Mapping of Applications to Multi-Processor Systems

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

Service Oriented Architectures for Remote Instrumentation
PRESENTATION HEADLINE
Vernetzung von Repositorien : DRIVER Guidelines Dr Dale Peters, SUB Goettingen 4. Helmholtz Open Access Workshop Potsdam, 17 Juni 2008.
Finding the Pattern You Need: The Design Pattern Intent Ontology
E-Solutions mySchoeller.com for Felix Schoeller Imaging
SION Vacuum Circuit-Breakers 3AE5 and 3AE1
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil3.
P. Marwedel Informatik 12, U. Dortmund
Finite state machines & message passing: SDL
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
Die Senatorin für Arbeit, Frauen, Gesundheit, Jugend und Soziales ESF-Verwaltungsbehörde Freie Hansestadt Bremen Hildegard Jansen, head of Unit labour.
Steinbeis Forschungsinstitut für solare und zukunftsfähige thermische Energiesysteme Nobelstr. 15 D Stuttgart WP 4 Developing SEC.
Institut für Softwaresysteme in Wirtschaft, Umwelt und Verwaltung Folie 1 DER UMWELT CAMPUS BIRKENFELD ISS Institut für Softwaresysteme in Wirtschaft,
Embedded & Real-time Operating Systems
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:
Embedded System Hardware - Processing -
Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,
Peter Marwedel TU Dortmund, Informatik 12
Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.
Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.
1 JIM-Studie 2010 Jugend, Information, (Multi-)Media Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
Rechneraufbau & Rechnerstrukturen, Folie 2.1 © W. Oberschelp, G. Vossen W. Oberschelp G. Vossen Kapitel 2.
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil2.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.
Differentielles Paar UIN rds gm UIN
Lehrstuhl Informatik III: Datenbanksysteme AstroGrid-D Meeting Heidelberg, Informationsfusion und -Integrität: Grid-Erweiterungen zum Datenmanagement.
Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.
Rechneraufbau & Rechnerstrukturen, Folie 12.1 © W. Oberschelp, G. Vossen W. Oberschelp G. Vossen Kapitel 12.
Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.
INSTITUT FÜR DATENTECHNIK UND KOMMUNIKATIONS- NETZE 1 Steffen Stein, TU Braunschweig, 2009 A Timing-Aware Update Mechanism for Networked Real-Time Systems.
20:00.
Zusatzfolien zu B-Bäumen
Case Study Session in 9th GCSM: NEGA-Resources-Approach
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.
Sanjay Patil Standards Architect – SAP AG April 2008
Centre for Public Administration Research E-Government for European Cities Thomas Prorok
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
1 Ein kurzer Sprung in die tiefe Vergangenheit der Erde.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Ertragsteuern, 5. Auflage Christiana Djanani, Gernot Brähler, Christian Lösel, Andreas Krenzin © UVK Verlagsgesellschaft mbH, Konstanz und München 2012.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.
Symmetrische Blockchiffren DES – der Data Encryption Standard
ESSnet Workshop Conclusions.
MINDREADER Ein magisch - interaktives Erlebnis mit ENZO PAOLO
1 (C)2006, Hermann Knoll, HTW Chur, FHO Quadratische Reste Definitionen: Quadratischer Rest Quadratwurzel Anwendungen.
3rd Review, Vienna, 16th of April 1999 SIT-MOON ESPRIT Project Nr Siemens AG Österreich Robotiker Technische Universität Wien Politecnico di Milano.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL 95% der Ammoniakemissionen aus der Landwirtschaft Rindvieh Pflanzenbau.
1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,
Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.
Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
Institut für Nachrichtentechnik U. Reimers Technische Universität Braunschweig The MultiMedia Home Platform (MHP): Hype or Reality ?
Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
1 Medienpädagogischer Forschungsverbund Südwest KIM-Studie 2014 Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
 Präsentation transkript:

Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU Dortmund Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 2008/12/14

Structure of this course New clustering 3: Embedded System HW 5: Scheduling, HW/SW-Partitioning, Applications to MP-Mapping 2: Specifications 8: Testing Application Knowledge 4: Standard Software, Real-Time Operating Systems 7: Optimization of Embedded Systems 6: Evaluation [Digression: Standard Optimization Techniques (1 Lecture)]

Energy Efficiency GOPs/J IPE=Inherent power efficiency © Hugo De Man, IMEC, 2007 Courtesy: Philips GOPs/J IPE=Inherent power efficiency AmI=Ambient Intelligence

Heterogeneous Architectures http://www.mpsoc-forum.org/2007/slides/Hattori.pdf

Homogeneous Architectures © J. Goodacre, ARM, 2008

Not many contributions yet! Problem Description Tools urgently needed! Given A set of applications Scenarios on how these applications will be used A set of candidate architectures comprising (Possibly heterogeneous) processors (Possibly heterogeneous) communication architectures Possible scheduling policies Find A mapping of applications to processors Appropriate scheduling techniques (if not fixed) A target architecture (if DSE is included) Objectives Keeping deadlines and/or maximizing performance Minimizing cost, energy consumption Not many contributions yet!

Key distinctions between parallel processing for embedded applications and PC-like systems Architectures Frequently heterogeneous, very compact Mostly homogeneous, not compact (x86 etc) x86 compatibility Not relevant Very relevant Architecture fixed? Sometimes not Yes MoCs C+multiple MoCs (SDF, …) Mostly von Neumann Applications Several concurrent apps. Mostly single app. Apps. known at design time Most, if not all Only some (WORD, ..) Objectives Multiple (energy, size, …) Average performance dominating Real-time relevant Yes (!) Hardly

Practical problem in automotive design Which processor should run the software?

Focus of the ArtistDesign Network 1st Workshop on Mapping Applications To MPSoCs, Rheinfels castle, June, 2008 Future architectures of MPSoCs (John Goodacre, ARM) SPEA2 (Lothar Thiele, ETHZ) Car-Entertainment Applications -> MP Systems (Marco Bekooij, NXP) MAPS (Rainer Leupers, Aachen) Overview over parallelization techniques (C. Lengauer, U. Passau) DSE of Heterogeneous MPSoCs (Ristau, Fettweis, TU Dresden) Daedalus (Ed Deprettere, U. Leiden) Mapping to the CELL processor (U. Bologna and others) Timing analysis issues (various) Programme and slides: http://www.artist-embedded.org/artist/ -Mapping-of-Applications-to-MPSoCs-.html

Related Work Scheduling theory: Provides insight for the mapping task  start times Hardware/software partitioning: Can be applied if it supports multiple processors High performance computing (HPC) Automatic parallelization, but only for single applications, fixed architectures, no support for scheduling, memory and communication model usually different High-level synthesis Provides useful terms like scheduling, allocation, assignment Optimization theory

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given model Map to CELL, HOPES, ETHAM COOL codesign tool; EXPO/SPEA2 Auto-parallelizing Franke, O’Boyle et al., Mnemee Daedalus MAPS

A fixed architecture approach: Map  CELL Martino Ruggiero, Luca Benini: Mapping task graphs to the CELL BE processor, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008

Partitioning into Allocation and Scheduling © Ruggiero, Benini, 2008

Related work ETHAM: An Energy-Aware Task Allocation Algorithm for Heterogeneous Multiprocessor (Chang et al. – DAC’08) Input: Directed Acyclic Task Graph (DAG) Combines mapping, scheduling + DVS utilization in 1 step Builds tree with all possible solutions Uses Ant Colony Optimizations (ACO) on this tree Temperature Aware Task Scheduling in MPSoCs (Coskun et al. – DATE’07) Higher reliability through temperature optimized dynamic OS scheduling Strategy is called “Adaptive Random” Considers history of core’s temperate for scheduling Session on Task Mapping @ DATE 2009

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given model Map to CELL, HOPES, ETHAM COOL codesign tool; EXPO/SPEA2 Auto-parallelizing Franke, O’Boyle et al. Mnemee Daedalus MAPS

Design Space Which architecture is better suited for our application? Communication Templates Computation Templates DSP mE Cipher SDRAM RISC FPGA LookUp Scheduling/Arbitration proportional share WFQ static dynamic fixed priority EDF TDMA FCFS Which architecture is better suited for our application? Architecture # 1 Architecture # 2 LookUp RISC EDF mE mE mE TDMA static Priority mE mE mE WFQ Cipher DSP © L. Thiele, ETHZ

Target Platform Communication micro-network on chip for synchronization and data exchange consisting of busses, routers, drivers some critical issues: topology, switching strategies (packet, circuit), routing strategies (static – reconfigurable – dynamic), arbitration policies (dynamic, TDM, CDMA, fixed priority) challenges: heterogeneous components and requirements, compose network that matches the traffic characteristics of a given application (domain) © L. Thiele, ETHZ

Mapping Scenario: Overview Given specification of the task structure (task model) = for each flow the corresponding tasks to be executed different usage scenarios (flow model) Sought processor implementation (resource model) = architecture* + task mapping + scheduling Objectives: maximize performance minimize cost Subject to: memory constraints delay constraints *: 2 cases: fixed architecture architecture to be designed (performance model) based on Thiele’s slides

Example: System Synthesis © L. Thiele, ETHZ

Basic Model – Problem Graph © L. Thiele, ETHZ

Basic Model – architecture graph © L. Thiele, ETHZ

Basic Model: Specification Graph © L. Thiele, ETHZ

Basic model - synthesis © L. Thiele, ETHZ

Basic model - implementation © L. Thiele, ETHZ

Example © L. Thiele, ETHZ

Evolutionary Algorithms for Design Space Exploration (DSE) © L. Thiele, ETHZ

Challenges © L. Thiele, ETHZ

EXPO – Tool architecture (1) MOSES system architecture performance values EXPO SPEA 2 task graph, scenario graph, flows & resources Exploration Cycle selection of “good” architectures © L. Thiele, ETHZ

EXPO – Tool architecture (2) Tool available online: http://www.tik.ee.ethz.ch/expo/expo.html © L. Thiele, ETHZ

EXPO – Tool (3) © L. Thiele, ETHZ

Application Model Example of a simple stream processing task structure: © L. Thiele, ETHZ

Exploration – Case Study (1) © L. Thiele, ETHZ

Exploration – Case Study (2) © L. Thiele, ETHZ

Exploration – Case Study (3) © L. Thiele, ETHZ

EA Case Study – Design Space © L. Thiele, ETHZ

Exploration Case Study – Solution 1 © L. Thiele, ETHZ

Exploration Case Study – Solution 2 © L. Thiele, ETHZ

More Results Performance for encryption/decryption Performance for RT voice processing © L. Thiele, ETHZ

A 2nd approach based on evolutionary algorithms: SYMTA/S: [R. Ernst et al.: A framework for modular analysis and exploration of heteterogenous embedded systems, Real-time Systems, 2006, p. 124]

Additional Contributions SystemCoDesigner (Teich et al.) Compaan (B. Kienhuis et al.) Milan (J. David et al.) Charmed (S. Bhattacharyya et al.) Metropolis (A. Sangiovanni-Vincentelli et al.)

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given model Map to CELL, HOPES, ETHAM COOL codesign tool; EXPO/SPEA2 Auto-parallelizing Franke, O’Boyle et al. Mnemee Daedalus MAPS

Xilinx Platform Studio (XPS) Daedalus Design-flow Sequential application Sequential application System-level design space exploration Explore, modify, select instances Library of IP cores Library of IP cores High-level Models Sesame KPNgen Automatic Parallelization System-Level Specification Common XML Interface Platform specification Mapping specification Kahn Process Network Parallel application specification RTL-level Models System-level synthesis ESPAM RTL-Level Specification Synthesizable VHDL C/C++ code for processors MP-SoC Ed Deprettere et al.: Toward Composable Multimedia MP-SoC Design,1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008 Xilinx Platform Studio (XPS) Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) © E. Deprettere, U. Leiden

JPEG/JPEG200 case study Example architecture instances for a single-tile JPEG encoder: 16KB 32KB 32KB 4KB 2KB Vin,DCT Q,VLE,Vout Vin,Q,VLE,Vout DCT 2 MicroBlaze processors (50KB) 1 MicroBlaze, 1HW DCT (36KB) DCT, Q 2KB 4x2KB DCT Q 2KB 2KB 8KB 32KB 8KB DCT, Q Vin VLE, Vout Vin 8KB 32KB VLE, Vout DCT, Q 8KB 4x2KB 2KB 2KB 2KB DCT Q DCT, Q 4x16KB 6 MicroBlaze processors (120KB) 4 MicroBlaze, 2HW DCT (68KB) © E. Deprettere, U. Leiden

Sesame DSE results: Single JPEG encoder DSE © E. Deprettere, U. Leiden

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given model Map to CELL, HOPES, ETHAM COOL codesign tool; EXPO/SPEA2 Auto-parallelizing Franke, O’Boyle et al. Mnemee Daedalus MAPS

MAPS-TCT Framework Rainer Leupers, Weihua Sheng: MAPS: An Integrated Framework for MPSoC Application Parallelization, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008 © Leupers, Sheng, 2008

MAPS: MPSoC Application Programming Studio © Leupers, Sheng, 2008

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given model Map to CELL, HOPES, ETHAM COOL codesign tool; EXPO/SPEA2 Auto-parallelizing Franke, O’Boyle et al. Mnemee Daedalus MAPS

Auto-Parallelizing Compilers Discipline “High Performance Computing”: Research on vectorizing compilers for more than 25 years. Traditionally: Fortran compilers. Such vectorizing compilers usually inappropriate for Multi-DSPs, since assumptions on memory model unrealistic: Communication between processors via shared memory Memory has only one single common address space  De Facto no auto-parallelizing compiler for Multi-DSPs! © Falk, 2008

Auto-Parallelizing Compilers (2) LooPo (C. Lengauer et al.) Various commercial compilers … Mostly focusing on HPC

Workflow of Auto-Parallelization for Multi-DSPs (B. Franke, M. O’Boyle) Program Recovery Removal of undesired low-level constructs in IR Replacement by equivalent high-level constructs Parallelism Detection Identification of parallelizable loops Partitioning and Mapping of Data Minimization of communication overhead between DSPs Memory Access Localization Minimization of accesses to remote memories Data Transfer Optimization Exploitation of DMA for burst transfers Beyond Loop Parallelization and Static Analysis © Falk, 2008

Problems with Real Codes “Hand-optimized” for specific platform E.g. pointer based array traversals for efficient address calculation Unusual idioms to implement frequently used operations E.g. modulo operations in array index expressions for circular buffers Defeat auto-parallelization! “Plain C” is best for parallelization Need to recover canonical program representation © Björn Franke, 2008

Program Recovery - Eliminate Pointer Conversion - Affine array index expressions are critical! int A[N],B[N],C[N]; int *ptr_a = &A[0]; int *ptr_b = &B[0]; int *ptr_c = &C[N-1]; for (i = 0; i < N; i++) { *ptr_a++ = *ptr_b++ * *ptr_c--; } int A[N],B[N],C[N]; for (i = 0; i < N; i++) { A[i] = B[i] * C[N-i-1]; } Affine Index Expressions © Björn Franke, 2008

Program Recovery - Modulo Elimination - Affine array index expressions are critical! int A[N], B[M]; for (i = 0; i < N; i++) { X = A[i] * B[i%M]; } int A[N], B[M]; for (i = 0; i < N/M; i++) { for (j = 0; j < M; j++) X = A[i*M+j] * B[j]; } Affine Index Expressions © Björn Franke, 2008

Rank Modifying Code and Data Transformations Unified Linear Algebra Framework for Code and Data Transformations Extensions to Unimodular Transformations Elements of transformation matrix are functions with div & mod Representation of additional transformations Array dimensionality transformations Strip-mining and linearization of loops © Björn Franke, 2008

Coarse Grain Parallelism & Task Graph Extraction Profiling Infrastructure © Björn Franke, 2008

Coarse Grain Parallelism & Task Graph Extraction © Björn Franke, 2008

Coarse Grain Parallelism & Task Graph Extraction © Björn Franke, 2008

Memory architecture aware compilation: Proposed Mnemee Tool Flow (Simplified) Mnemee=Memory management technology for adaptive and efficient design of embedded systems, (European Project: IMEC, Dortmund, Eindhoven, U. Athens, Intracom, Thales) C-Source Codes Task Graph Extraction Task To Processor - Mapping Memory Optimizations Compilation C-Source Codes C-Source Codes Load Time Optimization Architecture Archi-tecture GUI Run Time Optimization Using ICD-C Compiler Tool-Box (www.icd.de/es)

Future Work Implementation of the Mnemee tool flow 2nd Workshop on Mapping of Applications to MPSoCs To be held June 29-30, 2009, at Rheinfels Castle Information: http://www.artist-embedded.org/artist/ Mapping-Applications-to-MPSoCs.html Work by other researchers in the area

Summary (1) Mapping applications onto heterogeneous MP systems needs allocation (if hardware is not fixed) binding of tasks to resources scheduling There exists a large design space Evolutionary algorithms currently are the only algorithms known to solve the existing multi-objective optimization problems. Details on the encoding, optimization strategies etc. are available in papers by L. Thiele et al. (ETH Zürich) and R. Ernst (TU Braunschweig)

Summary (2) Clear trend toward multi-processor systems for embedded systems Using architecture crucially depends on mapping tools ArtistDesign network cluster focusing on mapping Providing an overview of available techniques Two criteria for classification Fixed / flexible architecture Auto parallelizing / non-parallelizing Introduction to proposed Mnemee tool chain Future work Summary