Produttore : AMD
File Size : 486.48 kb
File Nome : 40555.pdf
|
Facilità d'uso
It is a mature architecture that is designed to extract greater performance potential from multiprocessor systems. As developers deploy more demanding workloads on these multiprocessor systems, common performance questions arise: Where should threads or processes be scheduled (thread or process placement)? Where should memory be allocated (memory placement)? The underlying operating system (OS), tuned for AMD Athlon 64 and AMD Opteron multiprocessor ccNUMA systems, makes these performance decisions transparent and easy. Advanced developers, however, should be aware of the more advanced tools and techniques available for performance tuning. In addition to recommending mechanisms provided by the OS for explicit thread (or process) and memory placement, this application note explores advanced techniques such as node interleaving of memory to boost performance. This document also delves into the characterization of an AMD ccNUMA multiprocessor system, providing advanced developers with an understanding of the fundamentals necessary to enhance the performance of synthetic and real applications and to develop advanced tools. In general, applications can be memory latency sensitive or memory bandwidth sensitive; both classes are important for performance tuning. In a multiprocessor system, in addition to memory latency and memory bandwidth, other factors influence performance: • the latency of remote memory access (hop latency) • the latency of maintaining cache coherence (probe latency) • the bandwidth of the HyperTransport interconnect links • the lengths of various buffer queues in the system The empirical analysis presented in this document is based upon data provided by running a multi- threaded synthetic test. While this test is neither a pure memory latency test nor a pure memory Chapter 1 Introduction Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations. The discussion below explores the performance results of this test, with an emphasis on behavior exhibited when the test imposes high bandwidth demands on the low level resources of the system. Additionally, the tests are run in undersubscribed, highly subscribed, and fully subscribed modes. In undersubscribed mode, there are significantly fewer threads than the number of processors. In highly subscribed mode, the number of threads approaches the number of processors. In the fully subscribed mode, the number of threads is equal to the number of processors. Testing these conditions provides an understanding of the impact of thread subscription on performance. Based on the data and the analysis gathered from this synthetic test-bench, this application note presents recommendations to software developers who are working on applications, compiler tool chains, virtual machines and operating systems. Finally, the test results should also dispel some common myths concerning identical performance results obtained when comparing workloads that are symmetrical in all respects except for the thread and memory placement used. 1.1 Related Documents The following web links are referenced in the text and provide valuable resource and background information: [1] [2] paper.pdf [3] [4] [5] [6] [7] [8] multiple_processors.asp [9] virtualalloc.asp [10] [11] 529588d3-71bc-45ea-a84b-267914674709.mspx Introduction Chapter 1 40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems [12] msdn_heapmm.asp [13] low_fragmentation_heap.asp [14] [15] https://[16] Chapter 1 Introduction Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Introduction Chapter 1 ...