Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying.

Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying Joachim Worringen *, Friedrich Seifert +, Thomas Bemmerl * SCI Europe 2001 – Trinity College Dublin

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Agenda What is Zero-Copying? What is it good for? Zero-Copying with SCI Support through SMI-Library Shared Memory Interface Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies: N-way–Copying  No intermediate copy: Zero-Copying Effective Bandwidth and Efficiency:

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Efficiency Comparison FastEthernet GigaEthernet SCI DMA

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying with SCI SCI does zero-copy by nature. But: SCI via IO-Bus is limited: No SMP-style shared memory Specially allocated memory regions were required  No general zero-copy possible New possibility: Using user-allocated buffers for SCI communication  Allows general zero-copy! Connection setup is always required.

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme SMI Library Shared Memory Interface High-Level SCI support library for parallel applications or libraries Application startup Synchronization & basic communication Shared-Memory setup: - Collective regions - Point-2-point regions - Individual regions Dynamic memory management Data transfer

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (I) Shared Memory Paradigm: Import remote memory in local address space Perform memcpy() or maybe DMA SMI Support: - region type REMOTE - Synchronous (PIO): SMI_Memcpy() - Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() Problems: High Mapping Overhead Resource Usage (ATT entries on PCI-SCI adapter)

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Mapping Overhead  Not suitable for dynamic memory setups!

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (II) Connection Paradigm: Connect to remote memory location No representation in local address space  only DMA possible SMI support: Region type RDMA Synchronous / Asynchronous DMA: SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait Problems: Alignment restrictions Source needs to be pinned down

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Setup Acceleration Memory buffer setup costs time !  Reduce number of operations to increase performance Desirable: only one operation per buffer Problem: limited ressources Solution: caching of SCI segment states by lazy-release - Leave buffers registered, remote segments connected or mapped - Release unneeded resources if setup of new resource fails - Different replacement strategies possible: LRU, LFU, best-fit, random, immediate - Attention: remote segment deallocation!  Callback on connection event to release local connection MPI persistent communication operations: Pre-register user buffer & higher „hold“ priority

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Memory Allocation Allocate „good“ memory: MPI_Alloc_mem() / MPI_Free_mem() Part of MPI-2 (mostly for single-sided operations) SCI-MPICH defines attributes: - type : shared, private or default  Shared memory performs best. - alignment : none, specified or default  Non-shared memory should be page-aligned „Good“ memory should only be enforced for communication buffers!

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copy Protocols Applicable for hand-shake based rendez-vous protocol Requirements: registered user allocated buffers or regular SCI segments  „good“ memory via MPI_Alloc_mem() State of memory range must be known  SMI provides query functionality Registering / Connection / Mapping may fail Several different setups possible  Fallback mechanism required

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Transfer Sender Application Thread Device Thread Receiver Application Thread Device Thread Asynchronous Rendez-Vous OK to send Control Messages Ask to sendIsendIsendIrecvIrecv WaitWait WaitWait Continue Done

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Test Setup Systems used for performance evaluation: Pentium-III @ 800 MHz 512 MB RAM @ 133 MHz 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) Dolphin D330 (single ring topology) Linux 2.4.4-bigphysarea modified SCI driver (user memory for SCI)

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Bandwidth Comparison

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Application Kernel: NPB IS Parallel bucket sort Keys are integer numbers Dominant communication: MPI_Alltoallv for distributed key array: ClassArray size [MiB] ProcsMsg size [kiB] Alltoallv [ms] % of execution time A1425616.36334.6 W842048123.92136.2

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme MPI_Alltoallv Performance MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall Improved performance with asynchronous DMA operations Application speedup deduced ClassProcsregular [ms] speedupuser [ms] speedup A47.5781.229.6171.16 W452.4151.2663.9571.21

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Communication Goal: Overlap Computation & Communication How to quantify the efficiency for this?  Typical overlapping effect: total time computation time Computation Synchronous Asynchronous

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (I) Two parameters are required: 1.Saturation s Duration of computation period required to make total time (communication & computation) increase 2.Efficiency  Relation of overhead to message latency

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (II) t total t busy t msg_a t total - t busy Computation Synchronous Asynchronous t msg_s Saturation s

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Spinning Different ways of keeping CPU busy: FIXED Spin on single variable for a given amount of CPU time  No memory stress DAXPY Perform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size)  Stress memory system

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 64kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 256kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme FIXED – 64kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Performance Saturation and Efficiency derived from experiments: ExperimentProtocol t msg [ms] s [ms]  64 kiB DAXPY a-DMA-0-R 0.4900.2850.581 a-DMA-0-U 0.7350.4730.643 s-PIO-1 0.5720.0560.043 256 kiB DAXPY a-DMA-0-R 1.3001.0990.845 a-DMA-0-U 1.5061.1480.762 s-PIO-1 1.895-0.030-0.015 64 kiB FIXED a-DMA-0-R 0.4930.4460.904 a-DMA-0-U 0.7380.6910.936 s-PIO-1 0.5670.0160.028

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Summary & Outlook Efficient utilization of new SCI driver functionality for MPI communication:  Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) Connection overhead hidden by segment caching  Asynchronous communication pays off much earlier than before New (?) quantification scheme for efficiency of asynchronous communication Flexible MPI memory allocation supports MPI application writer Connection-oriented DMA transfers reduce resource utilization DMA alignment problems Segment callback required for improved connection caching

Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying.

Ähnliche Präsentationen

Präsentation zum Thema: "Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying.

Ähnliche Präsentationen

Präsentation zum Thema: "Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback