Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

Service Oriented Architectures for Remote Instrumentation
Vernetzung von Repositorien : DRIVER Guidelines Dr Dale Peters, SUB Goettingen 4. Helmholtz Open Access Workshop Potsdam, 17 Juni 2008.
Finding the Pattern You Need: The Design Pattern Intent Ontology
E-Solutions mySchoeller.com for Felix Schoeller Imaging
Embedded System Hardware
P. Marwedel Informatik 12, U. Dortmund
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
1 | R. Steinbrecher | IMK-IFU | KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Natural Sources SNAP11.
Steinbeis Forschungsinstitut für solare und zukunftsfähige thermische Energiesysteme Nobelstr. 15 D Stuttgart WP 4 Developing SEC.
Institut für Softwaresysteme in Wirtschaft, Umwelt und Verwaltung Folie 1 DER UMWELT CAMPUS BIRKENFELD ISS Institut für Softwaresysteme in Wirtschaft,
Dr. M. Schlottke Common Description of Web Services Source: P. Gerbert, 2002 Web Services allow to establish B2B e-commerce on the fly Web Services allow.
fakultät für informatik informatik 12 technische universität dortmund Test Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra.
Embedded & Real-time Operating Systems
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:
Embedded System Hardware - Processing -
Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,
Peter Marwedel TU Dortmund, Informatik 12
Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.
C. Kottmeier, C. Hauck, G. Schädler, N. Kalthoff
NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.
Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.
Lehrstuhl Informatik III: Datenbanksysteme AstroGrid-D Meeting Heidelberg, Informationsfusion und -Integrität: Grid-Erweiterungen zum Datenmanagement.
Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.
Institut für Wasserbau Stuttgart Geodätisches Institut Stuttgart Institut für Meteorologie und Klimaforschung IMK-IFU SPP 1257 DIRECT WATERBALANCE An interdisciplinary.
Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.
Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 SADDU June 2008 S. Noël, K.Bramstedt,
INSTITUT FÜR DATENTECHNIK UND KOMMUNIKATIONS- NETZE 1 Steffen Stein, TU Braunschweig, 2009 A Timing-Aware Update Mechanism for Networked Real-Time Systems.
Integration of renewable energies: competition between storage, the power grid and flexible demand Thomas Hamacher.
Machen Sie sich schlau am Beispiel Schizophrenie.
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.
Sanjay Patil Standards Architect – SAP AG April 2008
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
A good view into the future Presented by Walter Henke BRIT/SLL Schweinfurt, 14. November 2006.
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
3rd Review, Vienna, 16th of April 1999 SIT-MOON ESPRIT Project Nr Siemens AG Österreich Robotiker Technische Universität Wien Politecnico di Milano.
Department of Computer Science Homepage HTML Preprocessor Perl Database Revision Control System © 1998, Leonhard Jaschke, Institut für Wissenschaftliches.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.
Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 K. Bramstedt, L. Amekudzi, J. Meyer IFE/IUP Tangent heights in occultation.
Fusszeilentext – bitte in (Ansicht – Master – Folienmaster, 1. Folie oben) individuell ändern! Danach wieder zurück in Normalansicht gehen! 1 OTR Shearography.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
4th Symposium on Lidar Atmospheric Applications
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.
RZPD Deutsches Ressourcenzentrum für Genomforschung GmbH DESPRAD-Meeting 02/09/2003 Steffen Schulze-Kremer (until 7/2003) Bernd Drescher (since 8/2003)
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.
1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,
Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.
Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.
Computer Services Business challenge
 Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany These slides use Microsoft clip arts. Microsoft copyright restrictions apply. © Springer, 2010

- 2 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Structure of this course 2: Specification & Modelling 3: ES-hardware 4: system software (RTOS, middleware, …) 8: Test 5: Evaluation & validation & (energy, cost, performance, …) 7: Optimization 6: Application mapping Application Knowledge Design repository Design Numbers denote sequence of chapters

- 3 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund SPM+MMU (1) How to use SPM in a system with virtual addressing? Virtual SPM Typically accesses MMU + SPM in parallel not energy efficient Real SPM suffers from potentially long VA translation Egger, Lee, Shin (Seoul Nat. U.): Introduction of small µTLB translating recent addresses fast. [B. Egger, J. Lee, H. Shin: Scratchpad memory management for portable systems with a memory management unit, CASES, 2006, p (best paper)] Proc. $SPM MMU

- 4 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund SPM+MMU (2) µTLB generates physical address in 1 cycle if address corresponds to SPM, it is used otherwise, mini-cache is accessed Mini-cache provides reasonable performance for non-optimized code µTLB miss triggers main TLB/MMU SPM is used only for instructions instructions are stored in pages pages are classified as cacheable, non-cacheable, and pageable (= suitable for SPM) CPU core MMU unified TLB minicache TAG RAM DATA RAM SPM SPM base reg. comparator µTLB instruction VA PA

- 5 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund SPM+MMU (3) Application binaries are modified: frequently executed code put into pageable pages. Initially, page-table entries for pageable code are marked invalid If invalid page is accessed, a page table exception invokes SPM manager (SPMM). SPMM allocates space in SPM and sets page table entry If SPMM detects more requests than fit into SPM, SPM eviction is started Compiler does not need to know SPM size virtual memory SPM main memory physical memory stack/heap region pageable region cached region uncached region pageable region cached region uncached region PC

- 6 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Multi-processor ARM (MPARM) Framework Homogenous SMP ~ CELL processor Processing Unit : ARM7T processor Shared Coherent Main Memory Private Memory: Scratchpad Memory SPM Interrupt Device Semaphore Device ARM Interconnect (AMBA or STBus) Shared Main Memory

- 7 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Application Example: Multi-Processor Edge Detection SourceSink Compute Processors Source, sink and n compute processors Each image is processed by an independent compute processor Communication overhead is minimized.

- 8 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results: Scratchpad Overlay for Edge Detection 2 CPs are better than 1 CP, then energy consumption stabilizes Best scratchpad size: 4kB (1CP& 2CP) 8kB (3CP & 4CP)

- 9 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results DES-Encryption DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Lets look at timing! Definition of worst case execution time: WCET EST © Graphics: R. Wilhelm + Microsoft Cliparts WCET EST must be 1.safe (i.e. WCET) and 2.tight (WCET EST -WCET WCET EST ) Deadline

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Current Trial-and-Error Based Development 1.Specification of CPS/ES system 2.Generation of Code (ANSI-C or similar) 3.Compilation of Code 4.Execution and/or simulation of code, using a (e.g. random) set of input data 5.Measurement-based computation of estimated worst case execution time (WCET meas ) 6.Adding safety margin m on top of WCET meas : WCET hypo := (1 + m )* WCET meas 7.If WCET hypo > deadline: change some detail, go back to 1 or 2.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Problems with this Approach Dependability Computed WCET hypo not a safe approximation Time constraint may be violated Design time How to find necessary changes? How many iterations until successful? Make the common case fast a wrong approach for RT- systems Computer architecture and compiler techniques focus on average speed Circuit designers know its wrong Compiler designers (typically) dont Optimizing compilers unaware of cost functions other than code size Common case fast

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Integration of WCET estimation and compilation Computing WCET EST after code generation is too late. Why not consider WCET EST as an objective function already in the compiler? Integration of aiT and compiler

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Challenges for WCET EST -Minimization Worst-Case Execution Path (WCEP) WCET EST of a program = Length of longest execution path (WCEP) in that program WCET EST -Minimization: Reduction of the longest path Other optimizations do not result in a reduction of WCET EST Optimizations need to know the WCEP

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund WCET-oriented optimizations Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07, CGO 12) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08) Procedure/code positioning (ECRTS 08, CASES 11 (2x)) Function inlining (SMART 09, SMART 10) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09, ECRTS 11)) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08) Superblock-based optimizations (ICESS 10) Surveys (Springer 10, Springer 12)

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Loop Unrolling as an Example Unrolling replaces the original loop with several instances of the loop body Positive Effects Reduced overhead for loop control Enables instruction level parallelism (ILP) Offers potential for following optimizations Unroll early in optimization chain Negative Effects Aggressive unrolling leads to I-cache overflows Additional spill code instructions Control code may cancel positive effects Consequences of transformation hardly known

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund WCET EST -aware Loop Unrolling via Back- annotation WCET EST -information available at assembly level Unrolling to be applied at internal representation of source code Solution: Back-annotation: Experimental WCET EST -aware compiler WCC allows copying information: assembly code source code WCET EST data Assembly code size Amount of spill code Memory architecture info available High-Level IR (Source Code) Low-Level IR (Assembly Code) Back- Annotation Mem. Spec. Code generation

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results for unrolling: WCET EST 100%: Avg. WCET EST for all benchmarks with –O3 & no unrolling WCET EST reduction between 10.2% and 15.4% WCET EST -driven Unrolling outperforms standard unrolling by 13.7% Relative WCET [%] STANDARD

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund WCET-aware superblock optimizations WCCs superblocks: proposed 1st time at ANSI-C code level, rely on WCET EST timing data WCET EST -aware superblock optimizations: Common Subexpression Elimination (CSE) and Dead Code Elimination (DCE) ported to WCC WCET EST -aware superblock optimizations

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Relative WCET EST with I-Cache Locking 5 Bench- marks/ARM920T/Postpass-Opt (ARM920T) [S. Plazar et al.]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results: Register allocation Registers = fastest level in the memory hierarchy Interest in good global register allocation techniques Frequently based on coloring of interference graph lifetimes v1v1 v2v2 v3v3 v1v1 v2v2 v3v3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Register Allocation 100% = WCET EST using Standard Graph Coloring (highest degree) 93% 24% 69% [H. Falk]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Potential cache thrashing [S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Avoiding cache thrashing Key idea: selectively allocate code to uncached memory S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Improving predictability for caches Loop caches Mapping code to less used part(s) of the index space Cache locking/freezing Changing the memory allocation for code or data Mapping pieces of software to specific ways Methods: -Generating appropriate way in software -Allocation of certain parts of the address space to a specific way -Including way-identifiers in virtual to real-address translation Caches behave almost like a scratch pad

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (1) Execution counts based approach: Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3 Place functions in decreasing order of execution counts f1f1 f2f2 f3f3 f4f4 f5f5 (1100) (900) (400) (2000) (700) [S. McFarling: Program optimization for instruction caches, 3rd International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1989]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (2) Execution counts based approach: Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3 Place functions in decreasing order of execution counts Transformation increases spatial locality. Does not take in account calling order f4f4 (1100) (700) (400) (2000) (900) f4f4 f1f1 f2f2 f5f5 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (2000) f4f4 f4f4 f2f2 f5f5 f1f1 f3f3 [W. W. Hwu et al.: Achieving high instruction cache performance with an optimizing compiler, 16th Annual International Symposium on Computer Architecture, 1989]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (2000) f4f4 f2f2 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (4) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (1100) (2000) f4f4 f2f2 f1f1 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (5) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 (900) (1100) (400) (2000) f4f4 f2f2 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Code Layout Transformations (6) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Combined with placing frequently executed traces at the top of the code space for functions. Increases spatial locality. f4f4 (900) (1100) (400) (2000) (700) f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Way prediction/selective direct mapping [M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, K. Roy: Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping, MICRO- 34, 2001] © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results for the paper on way prediction (2) © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Prefetching Prefetch instructions load values into the cache Pipeline not stalled for prefetching Prefetching instructions introduced in ~ Potentially, all miss latencies can be avoided Disadvantages: Increased # of instructions Potential premature eviction of cache line Potentially pre-loads lines that are never used Steps Determination of references requiring prefetches Insertion of prefetches (early enough!) [R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Results for prefetching [Mowry, as cited by R. Allen & K. Kennedy] © Morgan-Kaufman, 2002 Not very impressive!

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Memory Aware Compilation and Simulation Framework (for C) MACC [M. Verma, L. Wehmeyer, R. Pyka, P. Marwedel, L. Benini: Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations, Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), 2006]. Application C code Source-level memory optimizer encc, ARM gcc, M5 DSP Array partitioning SPM overlay Executable binary Energy database Memory hierarchy description Compilation Framework Profile report Memory simulator Processor simulators (ARM7/M5) Profiler Simulation Framework MPSoC simulator

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Memory architecture MACCv2 Query can include address, time stamp, value, … Query can request energy, delay, stored values Query processed along a chain of HW components, incl. busses, ports, address translations etc., each adding delay & energy [R. Pyka et al.: Versatile System level Memory Description Approach for embedded MPSoCs, University of Dortmund, Informatik 12, 2007] API query to model simplifies integration into compiler External XML representation REQ Energy= ? Cycles= ? +10 Energy +5 Cycles +1 Energy +2 Cycles +1 Energy +0 Cycles CPU1 MM ASPC-1 - IFETCH - DRD - DWR - MAINAS ASPC-M - 0…ffff ASPC-B - 0 … 3ffff

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Case study MPEG4+TCP/IP transmission A. Mallik et al.: Deliverable D 5.3: Implementation of the optimization approach in the non-open source application domain and evaluation, 2011 Confidential

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Execution platforms Flat Hierarchical Confidential

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Impact of parallelization 6 frames11 frames Significant contention limits usefulness of parallelization for the flat architecture. Confidential

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Impact of scratchpad optimization Confidential

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Overall result Design time: -80% Memory bandwidth: reduction possible, but with increased footprint Memory footprint: -1% Execution time: -30% Energy: -30% (estimated) if DVS is added Confidential

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Memory hierarchies beyond main memory Massive datasets are being collected everywhere Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day [© Larse Arge, I/O-Algorithms, More New Information Over Next 2 Years Than in All Previous History

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at m ~ 1,5 m between measurements COWI A/S (and others) is currently scanning Denmark [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2013 TU Dortmund Application Example: Flooding Prediction +1 meter +2 meter [© Larse Arge, I/O-Algorithms,