Kishan Jainandunsing
(February 2006)
Multi-core architectures are becoming the next important evolutionary step in processor design as clock frequency alone has run out of headroom for further performance gains. In this article we take a closer look at the technology and its relevance to embedded applications.
Summary
Multi-core architectures are becoming the next important evolutionary step in processor design, as clock frequency alone has run out of headroom for further performance gains. Raising clock frequency for higher performance, has reached its limits in terms of heat dissipation - so much so that the power density has nearly reached that of a nuclear reactor. The ways out of this dilemma are parallel and pipelined processing. Paradigms that were long before adopted in scientific computing and high-performance signal processing and are now finding mainstream use with single chip devices that can hold more than one processor core. The significant performance gains delivered by these new multi-core devices open up new possibilities for embedded computing applications.
No Two Multi-Core Architectures are the Same
By current count all major CPU manufacturers have multi-core products in the market. They include AMD, IBM, Intel and Freescale. Their multi-core products have one thing in common – they run mainstream general-purpose operating systems, such as Microsoft Windows and/or Linux. But that is where the similarities end. Architecturally they are very different from each other, as each manufacturer has a different philosophy on performance gains through parallel processing. Table 1 provides a summary.
SMP = Symmetric Multi-Processing
AMP = Asymmetric Multi-Processing
SIMD = Single Instruction, Multiple Data
VM = Virtual Machine
Table 1. Summary of parallel processing implementations by the major CPU manufacturers
In the x86 camp the programming models and targets are the same for AMD and Intel. The differences are hidden more deeply underneath the surface, in the way the cores are connected to each other and to peripheral subsystems. AMD deploys a crossbar switch between the cores, memory controller and I/O. A technology that AMD brands as Direct Connect Architecture. The crossbar and memory controller are integrated with the cores on the same chip and integrated HyperTransport links can be used to directly connect CPUs in a multi-processor configuration and to companion chips. Intel adheres to a conventional frontside bus interconnect with an external memory controller hub or MCH, and relies on the frontside bus for multi-processor configurations. The impact of these differences on an application requires a careful analysisof price, performance, dissipation and area. Figure 1 illustrates both architectures.
Figure 1. The AMD and Intel approach
In the PowerPC camp the differences are more pronounced. The programming models and targets show less overlap. Underneath the hood the architectures of the IBM Cell Broadband (CB) processor and the Freescale MPC8641D PowerPC are even more distinct. The MPC8641D PowerPC has a conventional shared-bus architecture and classifies as a SoC by integrating a large number of high-speed I/O, such as 1Gb Ethernet, Serial RapidIO and PCI Express. The CB processor consists of up to 8 synergistic processing elements (SPEs) combined with a PowerPC element (PPE). Central to the architecture is a quad, memory coherent, ring bus, called the Element Interface Bus (EIB). The EIB interconnects the SPEs, the PPE, off-chip memory and external I/O. Each ring is a 16Byte-wide bus. An arbiter arbitrates between different requestors and decides which ring is granted to a requestor, with the memory controller having overriding priority and all other subsystem requests being scheduled in a round-robin fashion. CB processors can be meshed in a multi-processor grid by connecting them through their FlexIO interconnects. Figure 2 illustrates both architectures.

Figure 2. The IBM and Freescale approach
Left out from the discussions here are the
MPCore and
MK4 multi-core offerings from ARM and MIPS, respectively. These are IP blocks rather than off-the-shelf products, and are targeted for ASSPs.
Implications of Multi-Core for Embedded Applications
Multi-core processors can bring tremendous performance gains to data- and compute-intensive applications. The extend to which these processors impact performance depends on the application and the architecture of the processor. An architecture, which is optimized for the SIMD programming model like the IBM Cell Broadband processor, can easily yield a significantly larger performance gain than an SMP processor in data and compute intensive, real-time, signal processing applications. The situation is reversed when the application is not as homogenous as a signal processing application and requires a lot of different tasks or threads to run more or less undeterministically.
Another area where multi-core processors may provide an important opportunity is in functional integration. For instance, some cores could run a security, monitoring or data acquisition application in the background, while the remaining cores run interactive applications, without the background task impacting the performance of the foreground applications and vice versa. The final result is therefore a smaller, more functional product than would have been possible with a single-core CPU solution.
Multi-Processing With Multi-Core COM Express Modules
The computer-on-module advantages of time-to-market, scalability and flexible footprint makes COM Express modules very attractive in multi-processor embedded applications. And since the COM Express specification is processor and chipset agnostic, it is possible to implement modules with any of the processors mentioned here that best fit the target class of applications.
Using the PCI Express links on the modules as the standardized point-to-point interconnects between the processors, it is possible to implement scalable configurations using ASI (Advanced Switching Interconnect) or non-transparent PCI Express switches (see for instance PLX) on the carrier board. See Figure 3.
Figure 3. COM Express multi-processing configuration
Note that the modules may not necessarily contain the same type of CPU. In fact, it is perfectly possible to mix modules with CPUs based on different architectures and programming models. Hence, making it possible to allocate the best resource for a particular task.
Conclusions
Multi-core processors can bring tremendous gains in performance to data- and compute-intensive embedded applications, including industrial and medical imaging, radar, surveillance, training and simulation. At the same time they provide an opportunity to implement more functional products by integrating several functions in a single device that otherwise would have needed several systems to implement. The computer-on-module advantages of time-to-market, scalability and flexible footprint makes COM Express modules very attractive in these applications. And with PCI Express as the standard fabric interconnect, COM Express modules easily scale into homogenous or heterogeneous multi-processing configurations.