How Much Better?
Dual-core x86 processors are available today and quad-core processors are appearing on the horizon. We all know that two heads are better than one, but how much better? Standard benchmarks show improvements such as double the compute performance and a four-fold improvement in performance/watt over the single-core versions.
The latest generation of Dual-Core Intel® Xeon® processors LV 2.0 GHz actually outperform its predecessor 2.8 GHz low voltage Xeon processors by a factor of 2.2 times as measured by the industry standard Specint_rate_base2000 benchmark*, shown in Figure 1. The performance per watt improvement is a dramatic four times greater, helped by a processor power consumption reduction from 85 watts to 31 watts. This is accomplished by running multiple cores at reduced frequency and voltage, a two-prong approach to significantly lower the power consumption. For more information on the benchmark conditions, please see http://www.intel.com/design/intarch/prodbref/31137502.pdf.
[ Click image for enlargement ]
Figure 1. Comparison of single and dual core Intel® Xeon™ Processors
How It Was Done
Complementing the integration of more execution cores, multi-core processors continue to evolve architecturally. Advances in cache memory, the fast static memory that feeds the CPU execution unit with data and instructions, can significantly increase compute performance and reduce power dissipation. One such advance allows execution cores to share cache data, eliminating the need to store and manage multiple copies of the same data. Dynamic cache allocation is a feature that assigns cache memory according to the amount needed by each execution core. Therefore, if one core is idle, the other core can use the entire cache to step up its performance. Continuing the idle core example, the power consumed by a lightly loaded execution core can be reduced by controlling the voltage and frequency to each core independently. Software written to make use of multiple execution cores and new processor features can enable multi-fold improvements in compute performance and performance per watt.
Some Effort Required
Reaping the rewards from multi-core processors typically requires more development effort than in the past. Previously, software took advantage of faster single-core CPUs with little or no software changes – the software just ran faster. Today, the performance gain from multi-core processors will vary significantly from application to application and from OEM system to OEM system. Some applications are naturally better-suited for parallel processing and some OEMs are determined to make the software changes required to maximize performance.
To exploit multi-core processors, multiple software processes need to run at the same time. Perhaps the coarsest parallelism is multi-tasking, where full applications or discrete functions, such as print spooling, run concurrently on different cores. A finer level of parallelism is threading, where common application models are either task-parallel or data-parallel. The task-parallel model is applicable when independent threads can readily service separate functions (i.e., breaking up TCP/IP packet processing such that inherently independent tasks are performed in parallel on the same data set). The data-parallel threading model is used for compute-intensive loops, where the same independent operation (i.e., comparing a word in a file against a dictionary) is performed repeatedly. In both cases, multiple threads from a single application execute in tandem across multiple cores.
Parallel Software Examples
Many application segments that benefit from parallelism, such as storage and image rendering, are already threaded and run on symmetric multi-processing (SMP) systems. The operating system (OS) is typically threaded too, so both the OS and the application software workloads are spread amongst multiple cores. For these application segments, the heavy lifting to parallelize software is already done and the migration from using multiple processors to multi-core processors is natural and relatively simple.
For some embedded applications, such as self-checkout registers, systems typically have ample performance -- the registers aren’t the bottleneck, it’s the customers. This application can still benefit from additional performance when systems with multi-core processors run two, three or four self-checkout registers at the same time. The performance and responsiveness from multiple cores enable the OEM to ship one compute board instead of two or more, saving the cost of multiple boards, hard drives, memory modules and OSes. This application is also well suited to take advantage of task-parallelism. Some of the tasks that can be done in parallel include reading bar codes, printing entries on the receipt, looking up prices and calling the bank for credit.
Industrial control applications require real-time system response to control equipment. Multi-core processors provide an extra degree of freedom by allowing application software to provision tasks among multiple execution cores for optimal performance. For those embedded applications that would benefit from a faster and more repeatable real-time response, it is possible to run real-time tasks on a dedicated execution core, without interference from other tasks that would otherwise compete for CPU resources. The TenAsys* Corporation uses this approach to significantly improve the determinism of real-time response and claims a factor of ten improvement in control loop clock jitter.
Multi-core processors increase performance for packet processing applications such as intrusion detection. For example, running this application in parallel across four execution cores utilizing pipelining and flow-pinning programming techniques generates more than six times the throughput of a single core configuration.
This is accomplished by improving cache efficiency (e.g. L2) as measured by the cache-hit rate, the percentage of time the execution unit is fed from the L2 cache. The cache-hit rate often correlates to the application program’s locality of reference. Locality of reference is the degree to which a program’s memory accesses are limited to a relatively small number of addresses. Conversely, a program that accesses a large amount of data from scattered addresses is less likely to use cache memory efficiently.
By distributing TCP reassembly among several execution cores, it is possible to improve the locality of reference in a packet processing application by “pinning” individual TCP flows to a single execution core. By ensuring all packets from a TCP flow are processed on one core, the processor L2 cache-hit rate increases significantly, which results in supra-linear performance gains. Please visit http://www.intel.com/technology/advanced_comm/311566.htm for more details.
Tools Demystify Parallelism
There are no effortless ways to parallelize software. However, there are sophisticated tools available to create multi-threaded applications and then to examine thread integrity and test for race conditions.
Compilers facilitate the migration effort to parallel software architecture through the key features of OpenMP and Auto-Parallelization. OpenMP* (www.openmp.com) is the industry standard for portable multi-threaded application development, and is effective at fine grain (loop level) and large grain (function level) threading. OpenMP directives provide an easy and powerful way to convert serial applications into parallel applications, enabling potentially big performance gains from parallel execution on multi-core processor systems. Auto-Parallelization detects parallel loops capable of being executed safely in parallel and automatically generates multi-threaded code. Automatic parallelization relieves the programmer from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations.
Thread integrity can be verified by automated runtime debuggers that check for storage conflicts and look for places where threads may lock or stall. They identify memory locations accessed by one thread and followed by an unprotected access by another thread, exposing the program to data corruption.
To assist performance tuning, programmers can utilize profiling tools to check load balance, lock contention, synchronization bottlenecks, and parallel overhead. These tools can drill down to source code for threads created by OpenMP or thread libraries. Profilers identify the critical path in the program and indicate the processor utilization by thread. The programmer can view the threads in the critical path as well as the CPU time spent in each thread.
A real challenge for software developers is properly identifying performance bottlenecks. Performance analyzer tools help locate and remove software performance bottlenecks by collecting, analyzing, and displaying performance data from the system-wide level down to the source level. A performance analyzer was used in the intrusion detection example described previously to measure and optimize cache hit rate and notably increase performance. For more information on compilers, threading tools and performance analyzers, please visit http://www.intel.com/cd/software/products/asmo-na/eng/index.htm
Future Proofing COM Express
Standard form factors such as COM Express allow board vendors to amortize their development cost across a larger number of customers and board skus. COM Express has 2 power budgets as a function of pin-out type. Pin-out Type-1 has a 101 Watt limit and all other pin-out types have a 160 Watt limit. Paired with the extended form factor, this larger power envelope and larger footprint can be used to create multi-core and multi-processor modules.
Lower power multi-core processors further protect this development investment by extending the lifetime of the COM Express Basic form factor. More powerful, yet less power hungry multi-core processors will enable at least another generation of higher performance COM Express products. And the additional performance headroom from multi-core processors also permits COM Express systems to incorporate future virtualization and security functionality.
Conclusions
The future is dual-core, quad-core, oct-core processors and so on. This tack provides greater performance headroom with reasonable power dissipation. Customers who capitalize on opportunities to parallelize their software application will experience the greatest performance gains.
* Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
* Intel and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2006 Intel Corporation. All rights reserved