Providing efficient processing and other IP for leading edge storage devices

Home: Applications > Storage

Storage

The explosion of flash memories from simple portable "thumb" drives to low-power, high-performance enterprise servers has happened very quickly and is still changing.

The unique capabilities of Cadence® Tensilica® processors have proven to be ideal for this market—where there are few standards yet lots of innovation for solving the write amplification, energy, and error correction problems, as well as increasing IOPS.

Our licensees are shipping industry-leading products that use Tensilica Xtensa® processors. With more efficient processing logic and data I/O for their particular product or product line, they can increase IOPS with fewer gates and consume less energy. No other processor can offer this.

But that's not all! Other Cadence IP is instrumental in storage devices.

Standards-Based IP Solutions

  • Interfaces: Ethernet, MIPI®, PCI Express®, and USB IP
  • Memory: DDR, LPDDR, NAND Flash, SD, SDIO, eMMC, Wide IO, and more
  • Analog  IP:  AFE, ADC, DAC, clock, and power/sensor IP

Verification IP

See the customers that use Tensilica cores in Storage

Application Areas Where can Cadence be used in your flash controller?

Where can Tensilica processors be used in your flash controller?


Cryptography

  • AES-XTS up to 265x faster for 35kGates
  • Triple-DES up to 50x faster for 5kGates
  • SHA-1 up to 12x faster for 33kGates
  • ... compared to general purpose processors.

Cyclic Redundancy Check (CRC)

  • Up to 12x faster at 8bits per cycle for 3kGates
  • Up to 24x faster at 16bits per cycle for 3kGates
  • ... compared to general purpose processors.

Lempel Ziv Compression

  • ~5.5x faster for <17kGates
  • ... compared to general purpose processors.

LDPC Error Correction

  • Software programmable for flexibility
  • Shorter development and easier maintenance
  • Similar in size to RTL

Custom/Proprietary acceleration

  • Customer algorithms can be accelerated
  • No-one else will have the same acceleration

Linked List Search

  • 3x faster for 1 key match in <200Gates
  • 4x-6x faster for 3 key matches
  • ... compared to general purpose processors.

Host Protocol processing

  • Multiple protocol support
  • Single cycle per header
  • Any width up to 1024 bits
  • Initiate Data DMA
  • More processing is available if needed

Table Lookup

  • 7x faster for <1kGates
  • ... compared to general purpose processors.

Trends in SSD Controller Design Accelerating the computational workload with processors

Virtually Unlimited Bandwidth Bypass the system bus to get RTL-like bandwidth

Virtually Unlimited Bandwidth

When considering a processor for any design, its overall suitability to the task must take into consideration getting data into it for processing and then out to the rest of the system to take effect.

Conventional processors connect to the rest of the system via a system bus (32 to 128bits wide) and maybe an inbound DMA port. This gives an upper bound on the amount of data that the processor can operate on—to consume/produce more data, high-bandwidth operations are either offloaded or more processors are added and the task is split across them. It all adds up to more development time, risk, and energy consumtion.

Xtensa DPUs give the designer the ability to add multiple data ports to the processor, each up to 1024 bits wide, as well as the registers to hold and process them internally. Typically we'll see a few ports up to 256 bits wide in designs that either take inputs directly from one part of the system (RTL/processor), or provide processed results to another part (RTL/processor), as shown in the following diagram.

Flexible, wide I/O

The system bus is still there, of course, but there are other ways to get the large amounts of data required in flash controllers into and out of the processor.

Over all, this increases IOPS and reduces energy consumption by causing fewer transactions to occur along with avoiding the need to add more processors/offload engines.

Higher Performance with Lower Energy Consumption Bypass the system bus to get RTL-like bandwidth

Higher Performance with Lower Energy Consumption

Adding logic gates to address specific bottlenecks in any algorithm reduces cycle counts and naturally leads to lower energy consumption when the cycle count reduction outweighs the extra energy consumed by those additional gates.

To demonstrate, consider a set of common functions performed by flash controllers, either on a processor or by an offload accelerator, as indicated in the charts below:

This performance chart shows the performance increase compared to the high-performance Diamond Standards 570T real-time controller. The 570T is a 3-issue VLIW CPU that can sustain up to 3 RISC operations per cycle, making it competitive with other leading high-end real-time control CPUs.

The gates used to accelerate the performance were added to the 570T as new instructions using the TIE (Verilog-like) language.

There are two implementation options for CRC16:

  1. "Hash" uses a hash table, which takes fewer cycles but requires additional memory to store. It's typically used when run on a processor.
  2. "NoHash" is logic-only, typically used in offload accelerator implementations

This does not show the number of gates used to get the indicated performance gains—that is built into the energy chart below.

This energy chart shows an energy consumption comparison for the same algorithms in the performance chart above.

EnergyConsumption ∝ GateCount / TotalCycles

The "Reference Energy" column on the far left is a reference point showing 20% of the total height for each of the five algorithms being shown.

The "Xtensa Energy" column on the right shows the reduction in energy for each algorithm. The net reduction shown over all algorithms would only occur in a design if each algorithm was actually taking 20% of the total cycles to begin with. As all controller architectures are different, once you know how many processor cycles these take in your own design, you can assess how these accelerations may improve over-all performance there.

Differentiation and Scalability Your unique solution - no one can copy

Differentiation and Scalability

Our processors are being chosen for use in everything from low-cost designs in the consumer space to high-end multi-core designs for the demanding enterprise market. Differentiation is made easier and lower cost for each customer as a natural part of product development.

Performance

Typically, adding a few hundred gates (as new instructions) can increase the performance of an algorithm by factors of 5X or 10X without noticeably increasing the power consumption. So, when more performance is required and it's not possible (or desirable) to increase the clock speed any further, adding a few critical instructions can reduce the number of cycles required—avoiding the need to add more processors or even create an offload accelerator.

When it comes to adding much higher levels of performance to support increased data rates and multiple channels, then multiple processors and accelerators are typically added. Often the processors are dedicated to a few types of tasks that each have different characteristics and benefit from instruction-level optimizations for highest efficiency. There are general control tasks as well as data operations that are specific to parts of each customer's design and not efficient on general-purpose processors.

Algorithmic

Most processors operate natively on 32-bit quantities. If the algorithm's datatypes are shorter, then some additional processing is often required to shift and mask before computation. This may add one or two cycles.

If the datatypes are wider than 32 bits, then a function call is typically required to handle all the additional operations. This may add tens of cycles.

Xtensa processors give designers the flexibility to add registers and instructions that operate on the exact data size that is required in a single cycle. Increasing resolution/accurancy by using more bits in the future is easily accommodated by expanding the register width and updating the instruction itself. The instruction opcode used doesn't even need to change!

I/O Throughput

Increasing data throughput caused by newer interface standards, multiple channels, or competitive pressure can be met by:

  • Expanding the width of the existing ports
  • Adding more I/O ports to spread the load across multiple processing units

Xtensa processors give you both options, expanding the width of the ports up to 1024 bits each and adding more of them.

Product Line

OEMs often provide products in multiple markets that have different requirements or care-abouts. Consumer markets tend to require lower cost products whereas enterprise markets look for higher IOPS and reliability.

Much of the firmware that is developed can be re-used in all markets, so a common processor architecture is desirable from both a hardware- and software-development perspective.

Using processor architectures that have fixed performance/power/area points typically compromises each design with over-ability: the next processor in the range has to be chosen, even if it's 50% more capable than it needs to be. This leads to extra cost and energy consumption compared to a processor that is just sufficient for the design.

Xtensa processors give designers the power to customise their processors to exactly fit their requirments. Any differentiating customizations are only known to the OEM and can be re-used in other Xtensa processors. This makes Xtensa-based processors ideal for use across all solid state controller designs, where OEMs must differentiate.

Resources Learn more about using Cadence IP in your storage products