Tensilica Xtensa 10 Customizable Processor
Small, Ultra-Low-Power 32-bit Embedded Controller

Features

- Highly efficient, small, ultra-low-power core with a 32-bit modern architecture and 5-stage pipeline
- Wide range of configurable options
- Local memories configurable with option for parity or error correcting code (ECC)
- Hardware prefetch unit
- Extensible with designer-defined application-specific instructions, execution units, register files, and I/Os
- 32-bit wire input and 32-bit wire output general-purpose I/O (GPIO) port option for peripheral control and monitoring
- 2x32-bit queue FIFO interface option for data streaming, bypassing the system bus
- Power domains for power shut off
- Semantic and memory data gating
- ARM® CoreSight™ compatible interfaces for debug and trace
- Optional IEEE 754-compliant double-precision floating point unit plus double-precision floating point acceleration
- Complete matching development tool chain automatically generated for each core
- Wide range of operating system support including Linux

Benefits

- Extremely efficient base architecture that is smaller and lower power than most other 32-bit embedded controllers
- Application-specific instruction extensions provide orders of magnitude performance improvements over traditional CPUs, in many cases eliminating the need to develop register-transfer level (RTL) blocks
- Lower verification effort with pre-verified, correct-by-construction, RTL generation
- Post-silicon programmability to extend the life of any design
- Highly accurate, high-speed system simulation models automatically created for software development

- Develop, simulate, debug, and profile in one integrated development environment (IDE)
- Low leakage power design
- Dynamic power savings
- Easy integration into a CoreSight-based debug and trace infrastructure
- Double-precision operations with greater precision and wider range (than single precision)
- Ability to generate a complete matching software toolchain automatically for each core
- Mature, highly optimizing C/C++ compiler, which lets you work at the ‘C’ level for most applications

Ideal for SoC Dataplane Processing

The Cadence® Tensilica® Xtensa® 10 dataplane processing unit (DPU) is an exceptional controller built on unique Xtensa technology. It can easily be extended to out-perform any other embedded processor for a specific application using a combination of configurable options and custom instructions. The Xtensa 10 processor even includes options for highly efficient digital signal processing (DSP). This makes it ideal for applications that require a combination of DSP and control.

Designers can use the Xtensa 10 DPU to not only perform control functions, but also some of the finite state machine tasks that manage RTL blocks and some of the RTL logic as well. This makes for a smaller, much more efficient chip design, and it significantly reduces the verification challenges associated with new RTL designs.

Create an Optimized DPU in Minutes

The Xtensa 10 DPU was designed from the start to be a basic building block in system-on-chip (SoC) designs. The Xtensa 10 processor is unlike other conventional embedded
Tensilica Xtensa 10 Customizable Processor

processors cores—the system designer can mold the processor to fit the target application. By selecting and configuring predefined elements of the architecture and by inventing completely new instructions and hardware execution units, the Xtensa 10 processor can deliver performance levels that are orders of magnitude faster than other standard 32-bit cores. And you can do this in a fraction of the time it takes to develop and verify an RTL-based solution.

You can define new instructions utilizing the Tensilica Instruction Extension (TIE) methodology, adding Verilog-like descriptions of datapaths, execution units, and register files that can deliver performance, area, and power characteristics approaching that of custom logic design.

**Feature Overview**

**Backwards-compatible ISA since 1999**
- Xtensa ISA fundamentally architected for extensibility
- Base instruction set of 80 instructions for compatibility across every Xtensa core
- Dozens of available optional blocks
- Any differentiating designer-defined instructions written since 1998 can still be re-used today

**Optional pre-defined execution units**
- 32-bit multiplier and/or 16-bit multiplier and multiplier accumulator (MAC)
- Integer divide
- IEEE-754-compliant single- and double-precision floating point unit
- Double-precision floating point acceleration
- Pre-defined 32-bit GPIO and FIFO-like queue interfaces

**Differentiate with designer-defined functions**
- Make your specific algorithm run even more efficiently by adding the instructions it needs
- Development tools automatically adapt for full support

**Natural connectivity with RTL blocks**
- Multiple custom-width I/O ports for peripheral control and monitoring
- Multiple custom-width queue interfaces to FIFOs for data streaming into and out of the processor
- Co-simulation with RTL down to the pin level in SystemC

**Highly configurable interfaces**
- Optional processor interface (PIF) to system bus, choice of 32-, 64-, or 128-bit width
- Hardware prefetch unit
- Optional high-speed Xtensa Local Memory Interface (XLMI)
- Write buffer: selectable from 1-32 entries
- Optional ARM® AMBA® AXI and AHB-Lite bridges with synchronous or asynchronous clocking
- Choice of 1-, 2– or 4-way cache and/or local memories
- Up to 32 interrupts

**Multi-core design style support**
- Multi-core system modeling and SystemC co-simulation out of the box, fully supported within the Xtensa Xplorer IDE
- Homogenous and heterogeneous subsystems supported
- Inter-core on-chip debug with break-in/out control
- Optional 16-bit processor ID, supporting massively parallel array architectures
- Conditional store instruction option and synchronization library provide shared memory semaphore operations and the “release consistency model” of memory access ordering

**Complete hardware implementation and verification flow support**
- Automatic generation of RTL and tailored electronic design automation (EDA) scripts for leading-edge process technologies, including physical synthesis and 3D extraction tools
- Auto-insertion of fine-grained clock gating for low power
- Hardware emulation support including automated FPGA netlist generation for rapid SOC prototyping
- Comprehensive diagnostic testbench to verify connectivity
- Formal verification support for designer-defined instructions

**High-speed, high-accuracy system simulation models automatically created**
- High-speed instruction-accurate simulator for software development
- Pipeline-modeling, cycle-accurate Xtensa instruction set simulator (ISS)
- Xtensa SystemC (XTSC) transaction-level modeling (TLM) support, including out-of-the-box multi-core simulation
- Hardware co-simulation with RTL in SystemC with pin-level XTSC

**Integrated design environment**
- Create, simulate, debug, and profile whole designs in one tool—Xtensa Xplorer is a high-productivity IDE
- Tenth-generation software development tools target each processor. The advanced Xtensa C/C++ compiler (XCC) includes optimizations for base, optional, and designer-defined instructions
- Vectorization Assistant directs the programmer to areas of the application that can benefit most from modifications to enable better vectorization
- Multi-core subsystem design and simulation support
- Custom data display formatting for easy debug of vector and fixed-point data types as well as bit-mapped status and control

**Multi-core debug and ease of use**
- Interfaces to support ARM Coresight infrastructure
- Multi-core on-chip debug (OCD) support
• Multi-core debug improvement including sharing of single trace memory across multiple TRAX modules, hardware/software support for synchronous restart/resume, cross triggering, etc.

**Dynamic and leakage power improvements**

• New power shut-off (PSO) feature allows Xtensa DPUs to be completely powered off. To help achieve low leakage, designs can now be divided into multiple “power domains,” and each power domain operates at the same voltage and can be shut down and powered up individually.
• New dynamic power-saving features including semantic and data power gating

**Robust operating system support**

• Use Mentor Graphics’s Nucleus+, Express Logic’s ThreadX, Micrium’s uC/OS-II, or the Linux operating systems

**Efficient Base Architecture**

The Xtensa 32-bit architecture features a compact instruction set optimized for embedded designs. The base architecture has a 32-bit arithmetic logic unit (ALU), up to 64 general-purpose physical registers, six special-purpose registers, and 80 base instructions, including improved 16– and 24-bit (rather than 32-bit) RISC instruction encoding. Key features include:

• A wide range of configurable options to ensure you get just the logic you need to meet your functional and performance requirements
• The ability to modelessly intermix 16- or 24-bit instructions for lowest code and performance overhead

**Efficient 5-stage pipeline**

• Configurable local memories with optional parity or ECC
• Optional hardware prefetch, which reduces memory latencies
• The ability to cache up to 32KB and up to 4-way set associativity on both instruction and data sides

**Base instruction set compatibility**

Configurability of a Tensilica processor core never compromises the underlying base Xtensa instruction set architecture (ISA), thereby ensuring availability of a robust ecosystem of third-party application software and development tools. All configurable, extensible Xtensa processors are always compatible with major operating systems, debug probes, and ICE solutions. For each processor, the automatically generated complete software development toolchain includes an advanced IDE based on the Eclipse framework, a world-class C/C++ compiler, a cycle-accurate SystemC compatible instruction set simulator, and the full industry-standard GNU toolchain. Tensilica uses an ISA that has been backwards compatible since its introduction in 1999. It uses a base instruction set of 80 instructions and was fundamentally architectured for extensibility. You can run application code written back in 1999, and it will run on the Xtensa LX5 processor today. Any differentiating designer-defined instructions from earlier designs can be reused today.

**Smaller code size**

The Xtensa 10 DPU can modelessly issue 24-bit and 16-bit instructions, leading to 25-50% better code density and, therefore, smaller memories than mixed 32– and 16-bit

Figure 2: Xtensa 10 DPU showing standard, optional, and designer-defined blocks

www.cadence.com
architectures. Since memories typically dominate SoC area, this code density advantage translates into significant SoC area savings.

**Powerful base ISA**

The Xtensa ISA includes powerful compare-and-branch instructions and zero-overhead loops, which allow the compiler to generate tight, optimized loops. It also provides bit manipulations including funnel shifts and field-extract operations that are critical for applications such as networking that process the fields in packet headers and perform rule-based checks.

**Extensible ISA**

One of the fundamental technology innovations in the Xtensa processor is the ability to easily and seamlessly add new instructions into the processor's datapath. The associated C data types, software tool chain support, and the EDA scripts required to synthesize the processor are all generated automatically, just as if they had been there from the start. The specification of this new datapath and associated instructions and C data types is written in the Instruction Extension (TIE) language, which is explained in more detail in a later section.

**Highly configurable functionality**

Select from click-box options to add functionality to your processor and evaluate performance improvements in a matter of minutes. Basic interface options include:

- Processor interface (PIF)
- Width: 32/64/128-bit
- Optional “no PIF” configuration
- AMBA AXI and AHB-Lite bridges with synchronous or asynchronous clocking
- 16-bit processor ID
- Inbound DMA
- XLMI high-speed local interface
- Big-Endian/Little-Endian byte ordering
- On-chip debug port (IEEE 1149.1 compliant)
- Trace port signals
- Up to 32 interrupts with up to 7 levels of priority plus a separate Non-Maskable Interrupt level
- Write buffer: selectable from 1 to 32 entries
- 2x32-wire GPIO ports for direct control and monitoring of peripherals
- 2x32-bit queue interfaces for streaming data into and out of the processor via FIFOs
- Single 16-bit MAC
- 16- or 32-bit multipliers
- Low-area integer divider
- IEEE 754-compliant single-/double-precision scalar floating-point coprocessor
- Double-precision floating-point accelerator

Memory subsystem options include:

- Multibank RAM support
- Up to 2 local instruction and data RAMs, and ROMs up to 8 Mbytes each
- Local data and instruction caches
- Up to 4-way set associative
- Up to 32KB
- Write-back and write-through cache write policy
- 4-way cache plus local memories
- Memory management options including:
  - Region protection
  - Region protection with translation
  - MMU for the Linux operating system

![Diagram showing configuration options for the Xtensa 10 DPU](image)

**Figure 3:** Offers pre-verified major configuration options for the Xtensa 10 DPU
• Memory management unit (MMU) with translation look aside buffers (TLBs), includes no-execute bit security support
• MMU for the Linux operating system
• Optional parity or ECC for all local memories

Configuration options bypass the bus for fast, efficient data I/O

Two configuration click-box options allow Xtensa 10 processors to very quickly communicate data, control, or status information with RTL blocks or other Xtensa processors.

The GPIO32 configuration option adds two 32-wire ports to the Xtensa 10 processor (one input, one output) to quickly control and monitor peripherals or other logic in the system.

The QIF32 configuration option adds two 32-bit queue interfaces for FIFO-like data streaming into and out of the processor. The input queue functions with a familiar push/empty/data interface to external logic while the output queue presents a similar pop/full/data interface. All interactions with the Xtensa 10 processor pipeline are automatically implemented when the option is selected.

These options are accessed as registers in the processor, so no separate load/store is required to operate on the data.

Add Flexibility and Extensibility to Your SoC Designs with Xtensa Processors

General-purpose processors offer fixed options for memory size, cache size, and bus interface. Performance is generally proportional to the clock speed. Beyond that, application code optimization or a move to the next-generation processor is required to get incremental performance benefits.

Cadence offers the unique ability to add flexibility and longevity to your SoC designs through software programmability, as well as differentiation through processor implementations tailored for the specific application. You can now design a processor whose functions, especially its instruction set, can be extended to include features never considered or imagined by designers of the original processor, all using the TIE language.

The TIE language can be used to describe instructions, registers, execution units, and I/Os that are then automatically added to the processor. TIE is a Verilog-like language used to describe desired instruction mnemonics, operands, encoding, and execution semantics. TIE files are inputs to the Xtensa Processor Generator. The Generator automatically builds the processor and the complete software toolchain that incorporates all configuration options and new TIE instructions. The base instruction set remains for maximum compatibility with third-party development tools and operating systems.

The TIE language unlocks the true power of the Xtensa DPU. It lets you get orders of magnitude performance increases for your applications and create differentiation. Extensibility with Xtensa DPU allows features to be added or adapted in any form that optimizes the processor’s cost, power, and application performance.
Flexibility - add just what you need

Just as you can choose from a set of predefined functional options to improve processor performance, you can now create instructions that can speed up standard or proprietary algorithms. Using the tools provided, application hot spots can be identified and additional instructions created to process these hot spots more efficiently, without the need to increase the clock frequency or re-write a lot of the software.

Differentiate—make a processor that’s uniquely your own

With fixed-function general-purpose processors, differentiation is often limited to the algorithm implementation itself. General-purpose processors are good at general-purpose computing, but not so good at any specific algorithm. Xtensa processors give you the opportunity to differentiate through the ability to implement algorithms more efficiently by designing hardware that will accelerate your particular algorithm. This means that your design will be almost impossible to copy, as only your custom processor will reach the performance required on the same software implementation.

Rapid design development, smulation, debug, and profiling

The Xtensa Xplorer™ integrated design environment (IDE) serves as the GUI for the entire design experience. From the Xtensa Xplorer IDE, if you have existing application software code, you can profile your application, identify hot spots, decide on configuration options, and add new instructions and execution units to optimize performance and generate a new processor—all within a matter of hours. No other IP provider puts such flexibility directly into the hands of the designer with a tool that integrates software development, processor optimization, and multiple processor SoC architecture in one IDE.

Hardware designers now have new options for implementing algorithms. Interfaces can be added to the processor to offer direct, deterministic connectivity to SoC logic. With the GPIO port and queue interface options, you can stream data into or out of the processor. This direct connectivity with the rest of the SoC offers great control and predictable bandwidth. The simple C programs needed to control the Xtensa processor can be written and debugged within the Xtensa Xplorer IDE.

The Xtensa Processor Generator creates a complete hardware design with matching software tools, including a mature, world-class compiler, a cycle-accurate SystemC-compatible instruction set simulator (ISS), and the full industry-standard GNU toolchain.

Figure 6: Proven methodology automates the creation of customized processors and matching software tools
**Hardware Development**

Hardware designers can profile, compare, and save many different processor configurations. Use the ISS to simulate a single processor or, for multiple processor subsystems, choose Cadence’s XTensa Modeling Protocol (XTMP) or our XTensa SystemC (XTSC) modeling tools.

The Xtensa Xplorer IDE serves as the gateway to the Xtensa Processor Generator. Once a processor configuration is finalized, the Xtensa Processor Generator creates the automatically verified Xtensa processor to match all of the configuration options and extensions you have defined, in about an hour. The full software toolchain is also created that matches all processor modifications made. See the Processor Developer’s Toolkit product brief for more information.

**Complete hardware implementation and verification flow support**

- Automatic generation of RTL and tailored EDA scripts for leading-edge process technologies, including physical synthesis and 3D extraction tools
- Auto-insertion of fine-grained clock gating, which delivers ultra-low power
- Hardware emulation support including automated FPGA netlist generation
- Comprehensive diagnostic testbench
- Format verification support for designer-defined functions
- Pipeline-modeling, cycle-by-cycle accurate Xtensa instruction set simulator (ISS)
- System modeling capabilities with optional XTMP and XTSC simulation environments
- Multiple-processor on-chip debug capable with break-in/out control
- Hardware co-simulation in SystemC with Cadence’s pin-level XTSC connectivity to RTL
- XTSC transaction-level modeling support, including out-of-the-box multi-core co-simulation

Figure 7: Xtensa Xplorer can display valuable information including performance comparisons, instruction sizes, and processor size, area and power
Software Development

The Xtensa Software Developer’s Toolkit (SDK) provides a comprehensive collection of code generation and analysis tools that speed the software application development process. The Eclipse-based Xtensa Xplorer GUI serves as the cockpit for the entire development experience and also provides powerful visualization tools to aid application optimization.

The entire Xtensa software development toolchain, along with simulation models, RTOS ports, optimized C-libraries, etc., are automatically generated by the Xtensa Processor Generator. This also ensures that all the software tools – such as the compiler, linker, assembler, debugger, and instruction set simulator – always match and are tuned exactly to any custom processor hardware.

Complete software development tools

- Mature, highly optimizing C/C++ compiler (XCC) that rivals hand-coded assembly applications on other processors
- GNU-based assembler and linker
- Pipeline-modeled, cycle-accurate ISS
- High-speed (40-80X), instruction-accurate TurboXim™ simulator, which speeds software development
- XTMP and XTSC for multiple processor simulation and modeling
- Debug offers full GUI and command line support for single and multiple processor designs
- Profiling views of the processor pipeline utilization as well as time spent in functions across multiple processors, allows “what if” comparisons
- Vectorization Assistant, which discovers and locates code that could not be vectorized along with an explanation that can help the programmer modify the code so that it can be vectorized
- Support for major operating systems including Mentor Graphics’ Nucleus Plus, Express Logic’s ThreadX, Micrium’s µC/OS-II, Sophia Systems’ µTRON, and open-source Linux

Ideal for Applications Where Low Power is Critical

Power often is the key issue in an SoC design. Many techniques are employed to reduce power consumption, both built in to the base hardware and into the configuration options, allowing more control over system and memory resources. Xtensa processors consistently consume less power than other licensable embedded CPUs at equivalent gate counts.

Insertion of fine-grained clock gating for every functional element is automated, including those defined by the designer. This automation gives the Xtensa DPUs a significant advantage over RTL design where manual, error-prone post-layout tuning of clock circuits is often required.

Accessing local memories is one of the most power-consuming activities. Xtensa LX5 processors eliminate any unnecessary local memory interface activation if that memory is not directly addressed by the processor. With Xtensa LX5, you can now do semantic and memory data gating to save dynamic power.

Figure 8. Xtensa Xplorer shows debug/trace, profiling of pipeline utilization, and a cycle comparison for a multiple core simulation
As process geometries shrink, leakage power consumes a larger portion of the total power budget. To substantially reduce leakage power, Xtensa LX5 provides options during processor configuration that will:

- Instantiate a power control module (PCM) in the Xtmem level of design hierarchy
- Specify the number of power domains within the design and their operation via industry-standard power format files

Implementation of these energy-saving techniques is automated by the Xtensa Processor Generator.

The designer can configure the external data bus width and internal local memory data widths independently. This allows system-level power optimizations depending on whether the processor is constrained by external or internal instruction and data access.

**Multi-processor Features and debug options**

Placing multiple processors on the same IC die introduces significant complexity in SoC software debugging.

All versions of the Xtensa processor have certain optional PIF operations that enhance support for MP systems.

Xtensa debug features include:

- Interfaces to support ARM Coresight infrastructure
- Multi-core on-chip debug (OCD) support
- Multi-core debug improvement including sharing single trace memory across multiple TRAX modules, hardware/software support for synchronous restart/resume, cross triggering, etc.

You can access these debug functions via:

- JTAG
- APB
- the Xtensa core itself

Some SoC designs use multiple Xtensa processors that execute from the same instruction space. The processor ID option helps software distinguish one processor from another via a PRID special register.

The break-in/break-out option for the Xtensa Debug Module simplifies multi-core debugging. This capability enables one Xtensa processor to selectively communicate a break to other Xtensa processors in a multiple-processor system.

In addition to MP debug, it is also possible to non-intrusively trace multiple processors if they are configured with the trace extraction and analysis tool, TRAX. TRAX, which is detailed in the Debug Guide, is a collection of hardware and software components that provides visibility into the activity of running processors using compressed execution traces. The ability to capture real-time activity in a deployed device or prototype is particularly valuable for MP systems where there are a large number of interactions between hardware and software.

When multiple processors are used in a system, some sort of communication and synchronization between processors is required. The Xtensa Multiprocessor Synchronization configuration option provides ISA support for shared-memory communication protocols.

The Performance Monitor module is used to count performance-related events, such as cache misses. Accessing the counts through JTAG or APB is non-intrusive, but it is also possible to configure an interrupt to software running on the Xtensa core.

**Specifications**

Because it is highly customizable, Xtensa 10 DPUs can run very efficiently at low MHz and very fast at over 1GHz.

The latest EDA tools, process flows, and other input are tracked to provide detailed performance information. For the latest data, please contact your local representative.