Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying.

Slides:



Advertisements
Ähnliche Präsentationen
Peter Marwedel TU Dortmund, Informatik 12
Advertisements

You need to use your mouse to see this presentation © Heidi Behrens.
+ Arbeitsbericht mit Blick in die Zukunft M. Pernicka
GETRIEBE.
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Titelmasterformat.
Passiv What are these sentences expressing?
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
How Does Fuzzy Arithmetic Work ? © Hartwig Jeschke Institut für Mikroelektronische Schaltungen und Systeme Universität Hannover
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) The dependence of convection-related parameters on surface and.
Research-guided Teaching Representation in the Biology Curriculum.
Präsentiert von Riccardo Fuda.  Klassische (symmetrische) Kryptographie  Der weg zur modernen Kryptographie  Message Authentification Codes  Asymmetrische.
Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.
Die Fragen Wörter Wer? Was? Wann?.
1 Bauhaus-Universität Weimar ArchitekturProgrammierung Generative Entwurfsmethoden Processing Grundlagen Professur Informatik in der Architektur.
Synchronization: Multiversion Concurrency Control
Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase 4 Nils Büscher Selected Topics in VLSI Design (Module 24513)
SiPass standalone.
Ralf M. Schnell Technical Evangelist Microsoft Deutschland GmbH
Christian Binder Senior Platform Strategy Manager Microsoft Deutschland GmbH.
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
Arbeiten in einem agilen Team mit VS & TFS 11
Peer-to-Peer. Peer-to-Peer (P2P) Definitions P2P is an architectural principle based on decentralization and resource sharing It is replacing the current.
Institut für Angewandte Mikroelektronik und Datentechnik Results of Phase 4: Layout for ST65 technology by Christoph Niemann Selected Topics.
III II I Relations between masses and mixing angles.
Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.
The Journey to America… The Immigrant Experience.
1 Konica Minolta IT Solutions Prinzip Partnerschaft MANAGED MONITORING ÜBERWACHJUNG DER SERVERINFRASTRUKTUR UND ANWENDUNGEN DIREKT AUS DER CLOUD.
KLIMA SUCHT SCHUTZ EINE KAMPAGNE GEFÖRDERT VOM BUNDESUMWELTMINISTERIUM Co2 online.
Magnetenzephalogramm, MEG
Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)
© Crown copyright 2011, Department for Education These materials have been designed to be reproduced for internal circulation, research and teaching or.
You need to use your mouse to see this presentation © Heidi Behrens.
EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.
Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.
1 Persistence Strategies for WebServices Senior Consultant Java Forum Stuttgart, 27. Juni 2002.
Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.
Mittwoch der
Imperfekt (Simple Past) Irregular or strong verbs
Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.
D. Sturzebecher, Institut für Betriebssysteme und Rechnerverbund, TU-Braunschweig Tele-Teaching with MACS n Tele-Teaching Requirements and Scenarios n.
EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.
Kapitel 9 Grammar INDEX 1.Formal Sie- Command 2.There Is/There Are 3.Negation: Nicht/Klein.
Das Wetter Lernziele: Heute: The „Wenn“ clause! - To describe and report the weather - To discuss activities done in different types of weather - To compare.
Outline Collaborators HgTe as a 3D topological insulator Sample design
Money rules the medicine?! A presentation by Jan Peter Hoffmann European healthcare systems in comparison.
Technische Universität München Alexander Neidhardt Forschungseinrichtung Satellitengeodäsie 1 Concepts for remote control of VLBI-telescopes: on the way.
LINUX II Unit 9 Network File Server NFS. NFS Überblick ● Zugriff von lokalen Rechner über Netzwerk auf Dateien oder Ordnern auf entfernten Servern ● Entwickelt.
Monitoring System in the federal state of Saxony-Anhalt, Germany Meeting on monitoring systems , May 2012, Prague Christine Makiol,
Fitness. An english presentation.
Electronics: Overview
The dynamic ultrasound
Aspect-Oriented Programming: Fad or the Future
Data Hazards 0x30 sub $6 $0 $1 0x34 add $7 $6 $
Telling Time in German Deutsch 1 Part 1.
Students have revised SEIN and HABEN for homework
CERN – TUD – GSI Webmeeting
Hallo! Wie geht’s? Hallo! Mir geht’s gut, danke! Guten Tag!
Ferrite Material Modeling (1) : Kicker principle
Health Card for refugees in Bremen
Collaborative Webmeeting November 24th, 2010 Geneve / Darmstadt
Wie viel Uhr ist es? Telling Time.
Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.
OFFICE 365 FOCUS SESSION SHAREPOINT ONLINE 101:LERNE DIE BASICS 19. März 2018 Höhr-Grenzhausen.
Integrating Knowledge Discovery into Knowledge Management
Calorimetry as an efficiency factor for biogas plants?
Ich - Projekt Due Monday, September 19..
- moodle – a internet based learning platform
 Präsentation transkript:

Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying Joachim Worringen *, Friedrich Seifert +, Thomas Bemmerl * SCI Europe 2001 – Trinity College Dublin

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Agenda What is Zero-Copying? What is it good for? Zero-Copying with SCI Support through SMI-Library Shared Memory Interface Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies: N-way–Copying  No intermediate copy: Zero-Copying Effective Bandwidth and Efficiency:

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Efficiency Comparison FastEthernet GigaEthernet SCI DMA

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copying with SCI SCI does zero-copy by nature. But: SCI via IO-Bus is limited: No SMP-style shared memory Specially allocated memory regions were required  No general zero-copy possible New possibility: Using user-allocated buffers for SCI communication  Allows general zero-copy! Connection setup is always required.

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme SMI Library Shared Memory Interface High-Level SCI support library for parallel applications or libraries Application startup Synchronization & basic communication Shared-Memory setup: - Collective regions - Point-2-point regions - Individual regions Dynamic memory management Data transfer

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (I) Shared Memory Paradigm: Import remote memory in local address space Perform memcpy() or maybe DMA SMI Support: - region type REMOTE - Synchronous (PIO): SMI_Memcpy() - Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() Problems: High Mapping Overhead Resource Usage (ATT entries on PCI-SCI adapter)

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Mapping Overhead  Not suitable for dynamic memory setups!

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Moving (II) Connection Paradigm: Connect to remote memory location No representation in local address space  only DMA possible SMI support: Region type RDMA Synchronous / Asynchronous DMA: SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait Problems: Alignment restrictions Source needs to be pinned down

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Setup Acceleration Memory buffer setup costs time !  Reduce number of operations to increase performance Desirable: only one operation per buffer Problem: limited ressources Solution: caching of SCI segment states by lazy-release - Leave buffers registered, remote segments connected or mapped - Release unneeded resources if setup of new resource fails - Different replacement strategies possible: LRU, LFU, best-fit, random, immediate - Attention: remote segment deallocation!  Callback on connection event to release local connection MPI persistent communication operations: Pre-register user buffer & higher „hold“ priority

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Memory Allocation Allocate „good“ memory: MPI_Alloc_mem() / MPI_Free_mem() Part of MPI-2 (mostly for single-sided operations) SCI-MPICH defines attributes: - type : shared, private or default  Shared memory performs best. - alignment : none, specified or default  Non-shared memory should be page-aligned „Good“ memory should only be enforced for communication buffers!

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Zero-Copy Protocols Applicable for hand-shake based rendez-vous protocol Requirements: registered user allocated buffers or regular SCI segments  „good“ memory via MPI_Alloc_mem() State of memory range must be known  SMI provides query functionality Registering / Connection / Mapping may fail Several different setups possible  Fallback mechanism required

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Data Transfer Sender Application Thread Device Thread Receiver Application Thread Device Thread Asynchronous Rendez-Vous OK to send Control Messages Ask to sendIsendIsendIrecvIrecv WaitWait WaitWait Continue Done

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Test Setup Systems used for performance evaluation: 800 MHz 512 MB 133 MHz 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) Dolphin D330 (single ring topology) Linux bigphysarea modified SCI driver (user memory for SCI)

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Bandwidth Comparison

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Application Kernel: NPB IS Parallel bucket sort Keys are integer numbers Dominant communication: MPI_Alltoallv for distributed key array: ClassArray size [MiB] ProcsMsg size [kiB] Alltoallv [ms] % of execution time A W

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme MPI_Alltoallv Performance MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall Improved performance with asynchronous DMA operations Application speedup deduced ClassProcsregular [ms] speedupuser [ms] speedup A W

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Communication Goal: Overlap Computation & Communication How to quantify the efficiency for this?  Typical overlapping effect: total time computation time Computation Synchronous Asynchronous

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (I) Two parameters are required: 1.Saturation s Duration of computation period required to make total time (communication & computation) increase 2.Efficiency  Relation of overhead to message latency

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Saturation and Efficiency (II) t total t busy t msg_a t total - t busy Computation Synchronous Asynchronous t msg_s Saturation s

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Experimental Setup: Spinning Different ways of keeping CPU busy: FIXED Spin on single variable for a given amount of CPU time  No memory stress DAXPY Perform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size)  Stress memory system

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 64kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme DAXPY – 256kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme FIXED – 64kiB Message

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Asynchronous Performance Saturation and Efficiency derived from experiments: ExperimentProtocol t msg [ms] s [ms]  64 kiB DAXPY a-DMA-0-R a-DMA-0-U s-PIO kiB DAXPY a-DMA-0-R a-DMA-0-U s-PIO kiB FIXED a-DMA-0-R a-DMA-0-U s-PIO

SCI Europe 2001 – Trinity College DublinLehrstuhl für Betriebssysteme Summary & Outlook Efficient utilization of new SCI driver functionality for MPI communication:  Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) Connection overhead hidden by segment caching  Asynchronous communication pays off much earlier than before New (?) quantification scheme for efficiency of asynchronous communication Flexible MPI memory allocation supports MPI application writer Connection-oriented DMA transfers reduce resource utilization DMA alignment problems Segment callback required for improved connection caching