Präsentation herunterladen
Die Präsentation wird geladen. Bitte warten
Veröffentlicht von:Eleonora Martin Geändert vor über 9 Jahren
1
Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying Joachim Worringen *, Friedrich Seifert +, Thomas Bemmerl * SCI Europe 2001 – Trinity College Dublin
2
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Agenda What is Zero-Copying? What is it good for? Zero-Copying with SCI Support through SMI-Library Shared Memory Interface Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication
3
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies: N-way–Copying No intermediate copy: Zero-Copying Effective Bandwidth and Efficiency:
4
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Efficiency Comparison FastEthernet GigaEthernet SCI DMA
5
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying with SCI SCI does zero-copy by nature. But: SCI via IO-Bus is limited: No SMP-style shared memory Specially allocated memory regions were required No general zero-copy possible New possibility: Using user-allocated buffers for SCI communication Allows general zero-copy! Connection setup is always required.
6
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme SMI Library Shared Memory Interface High-Level SCI support library for parallel applications or libraries Application startup Synchronization & basic communication Shared-Memory setup: - Collective regions - Point-2-point regions - Individual regions Dynamic memory management Data transfer
7
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (I) Shared Memory Paradigm: Import remote memory in local address space Perform memcpy() or maybe DMA SMI Support: - region type REMOTE - Synchronous (PIO): SMI_Memcpy() - Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() Problems: High Mapping Overhead Resource Usage (ATT entries on PCI-SCI adapter)
8
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Mapping Overhead Not suitable for dynamic memory setups!
9
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (II) Connection Paradigm: Connect to remote memory location No representation in local address space only DMA possible SMI support: Region type RDMA Synchronous / Asynchronous DMA: SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait Problems: Alignment restrictions Source needs to be pinned down
10
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Setup Acceleration Memory buffer setup costs time ! Reduce number of operations to increase performance Desirable: only one operation per buffer Problem: limited ressources Solution: caching of SCI segment states by lazy-release - Leave buffers registered, remote segments connected or mapped - Release unneeded resources if setup of new resource fails - Different replacement strategies possible: LRU, LFU, best-fit, random, immediate - Attention: remote segment deallocation! Callback on connection event to release local connection MPI persistent communication operations: Pre-register user buffer & higher „hold“ priority
11
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Memory Allocation Allocate „good“ memory: MPI_Alloc_mem() / MPI_Free_mem() Part of MPI-2 (mostly for single-sided operations) SCI-MPICH defines attributes: - type : shared, private or default Shared memory performs best. - alignment : none, specified or default Non-shared memory should be page-aligned „Good“ memory should only be enforced for communication buffers!
12
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copy Protocols Applicable for hand-shake based rendez-vous protocol Requirements: registered user allocated buffers or regular SCI segments „good“ memory via MPI_Alloc_mem() State of memory range must be known SMI provides query functionality Registering / Connection / Mapping may fail Several different setups possible Fallback mechanism required
13
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Transfer Sender Application Thread Device Thread Receiver Application Thread Device Thread Asynchronous Rendez-Vous OK to send Control Messages Ask to sendIsendIsendIrecvIrecv WaitWait WaitWait Continue Done
14
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Test Setup Systems used for performance evaluation: Pentium-III @ 800 MHz 512 MB RAM @ 133 MHz 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) Dolphin D330 (single ring topology) Linux 2.4.4-bigphysarea modified SCI driver (user memory for SCI)
15
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Bandwidth Comparison
16
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Application Kernel: NPB IS Parallel bucket sort Keys are integer numbers Dominant communication: MPI_Alltoallv for distributed key array: ClassArray size [MiB] ProcsMsg size [kiB] Alltoallv [ms] % of execution time A1425616.36334.6 W842048123.92136.2
17
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme MPI_Alltoallv Performance MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall Improved performance with asynchronous DMA operations Application speedup deduced ClassProcsregular [ms] speedupuser [ms] speedup A47.5781.229.6171.16 W452.4151.2663.9571.21
18
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Communication Goal: Overlap Computation & Communication How to quantify the efficiency for this? Typical overlapping effect: total time computation time Computation Synchronous Asynchronous
19
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (I) Two parameters are required: 1.Saturation s Duration of computation period required to make total time (communication & computation) increase 2.Efficiency Relation of overhead to message latency
20
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (II) t total t busy t msg_a t total - t busy Computation Synchronous Asynchronous t msg_s Saturation s
21
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency
22
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Spinning Different ways of keeping CPU busy: FIXED Spin on single variable for a given amount of CPU time No memory stress DAXPY Perform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size) Stress memory system
23
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 64kiB Message
24
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 256kiB Message
25
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme FIXED – 64kiB Message
26
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Performance Saturation and Efficiency derived from experiments: ExperimentProtocol t msg [ms] s [ms] 64 kiB DAXPY a-DMA-0-R 0.4900.2850.581 a-DMA-0-U 0.7350.4730.643 s-PIO-1 0.5720.0560.043 256 kiB DAXPY a-DMA-0-R 1.3001.0990.845 a-DMA-0-U 1.5061.1480.762 s-PIO-1 1.895-0.030-0.015 64 kiB FIXED a-DMA-0-R 0.4930.4460.904 a-DMA-0-U 0.7380.6910.936 s-PIO-1 0.5670.0160.028
27
SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Summary & Outlook Efficient utilization of new SCI driver functionality for MPI communication: Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) Connection overhead hidden by segment caching Asynchronous communication pays off much earlier than before New (?) quantification scheme for efficiency of asynchronous communication Flexible MPI memory allocation supports MPI application writer Connection-oriented DMA transfers reduce resource utilization DMA alignment problems Segment callback required for improved connection caching
Ähnliche Präsentationen
© 2024 SlidePlayer.org Inc.
All rights reserved.