Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

Service Oriented Architectures for Remote Instrumentation
Finding the Pattern You Need: The Design Pattern Intent Ontology
E-Solutions mySchoeller.com for Felix Schoeller Imaging
H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,
Embedded System Hardware
P. Marwedel Informatik 12, U. Dortmund
Finite state machines & message passing: SDL
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
fakultät für informatik informatik 12 technische universität dortmund Test Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra.
Embedded & Real-time Operating Systems
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:
Embedded System Hardware - Processing -
Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.
Embedded System Hardware - Reconfigurable Hardware -
Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,
Peter Marwedel TU Dortmund, Informatik 12
Fakultät für informatik informatik 12 technische universität dortmund Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany 2011/03/09.
Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.
Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany
Embedded System Hardware - Processing -
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.
Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.
Lehrstuhl für Künstliche Intelligenz - Univ. Würzburg Optimization of simulated biological multi-agent systems by means of evolutionary processes Alexander.
Three minutes presentation I ArbeitsschritteW Seminar I-Prax: Inhaltserschließung visueller Medien, Spree WS 2010/2011 Giving directions.
Lehrstuhl Informatik III: Datenbanksysteme AstroGrid-D Meeting Heidelberg, Informationsfusion und -Integrität: Grid-Erweiterungen zum Datenmanagement.
Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.
Introduction to BOS Supplier Guidelines v6
Deutsche Gesellschaft für Technische Zusammenarbeit GmbH Integrated Experts as interface between technical cooperation and the private sector – An Example.
Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.
Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.
Case Study Session in 9th GCSM: NEGA-Resources-Approach
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.
Sanjay Patil Standards Architect – SAP AG April 2008
A good view into the future Presented by Walter Henke BRIT/SLL Schweinfurt, 14. November 2006.
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.
Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 K. Bramstedt, L. Amekudzi, J. Meyer IFE/IUP Tangent heights in occultation.
Fusszeilentext – bitte in (Ansicht – Master – Folienmaster, 1. Folie oben) individuell ändern! Danach wieder zurück in Normalansicht gehen! 1 OTR Shearography.
Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
4th Symposium on Lidar Atmospheric Applications
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.
© Boardworks Ltd of 8 Time Manner Place © Boardworks Ltd of 8 This icon indicates that the slide contains activities created in Flash. These.
Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) dCache T1 admins and experts Welcome to FZK.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.
1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,
Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.
Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
Technische Universität München Spatial aspects of the formation of GMO-free or GMO clubs Maarten J. Punt Technische Universität München.
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
EN/FAD Ericsson GmbH EDD/ Information im 21. Jahrundert muss Erwünscht Relevant Erreichbar Schnell Kostenlos!?
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
Software Prefetching Reduzierung der Miss-Rate Erfordert Prefetch-Instruktionen –Nicht blockierend –Erzeugen keine Ausnahmen Nicht blockierende Caches.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.
Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
 Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These slides use Microsoft clip arts. Microsoft copyright restrictions apply

- 2 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Structure of this course 2: Specification 3: ES-hardware 4: system software (RTOS, middleware, …) 8: Test 5: Evaluation & validation & (energy, cost, performance, …) 7: Optimization 6: Application mapping Application Knowledge Design repository Design Numbers denote sequence of chapters

- 3 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund SPM+MMU (1) How to use SPM in a system with virtual addressing? Virtual SPM Typically accesses MMU + SPM in parallel not energy efficient Real SPM suffers from potentially long VA translation Egger, Lee, Shin (Seoul Nat. U.): Introduction of small µTLB translating recent addresses fast. [B. Egger, J. Lee, H. Shin: Scratchpad memory management for portable systems with a memory management unit, CASES, 2006, p (best paper)] Proc. $SPM MMU

- 4 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund SPM+MMU (2) µTLB generates physical address in 1 cycle if address corresponds to SPM, it is used otherwise, mini-cache is accessed Mini-cache provides reasonable performance for non-optimized code µTLB miss triggers main TLB/MMU SPM is used only for instructions instructions are stored in pages pages are classified as cacheable, non-cacheable, and pageable (= suitable for SPM) CPU core MMU unified TLB minicache TAG RAM DATA RAM SPM SPM base reg. comparator µTLB instruction VA PA

- 5 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund SPM+MMU (3) Application binaries are modified: frequently executed code put into pageable pages. Initially, page-table entries for pageable code are marked invalid If invalid page is accessed, a page table exception invokes SPM manager (SPMM). SPMM allocates space in SPM and sets page table entry If SPMM detects more requests than fit into SPM, SPM eviction is started Compiler does not need to know SPM size virtual memory SPM main memory physical memory stack/heap region pageable region cached region uncached region pageable region cached region uncached region PC

- 6 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Extension to SNACK-pop (post-pass optimization) H. Cho, B. Egger, J. Lee, H. Shin: Dynamic Data Scratchpad Memory Management for a Memory Subsystem with an MMU, LCTES, 2007 object files libraries disassemble profile code generation building dynamic call graph cloning functions inserting SPM manager calls executable image generation ILP solver dynamic call graph profiling image architecture simulator profile data executable image input data

- 7 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Cloning of functions q mn g f q mn g f g Computation of which block should be moved in and out for a certain edge Generation of an ILP Decision about copy operations at compile time.

- 8 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for SNACK-pop (1) © ACM, 2007

- 9 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for SNACK-pop (2) © ACM, 2007

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Multi-processor ARM (MPARM) Framework Homogenous SMP ~ CELL processor Processing Unit : ARM7T processor Shared Coherent Main Memory Private Memory: Scratchpad Memory SPM Interrupt Device Semaphore Device ARM Interconnect (AMBA or STBus) Shared Main Memory

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Application Example: Multi-Processor Edge Detection SourceSink Compute Processors Source, sink and n compute processors Each image is processed by an independent compute processor Communication overhead is minimized.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results: Scratchpad Overlay for Edge Detection 2 CPs are better than 1 CP, then energy consumption stabilizes Best scratchpad size: 4kB (1CP& 2CP) 8kB (3CP & 4CP)

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results DES-Encryption DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund MPSoC with shared SPMs [M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, I. Kolcu: Compiler-Directed Scratch Pad Memory Optimization for Embedded Multiprocessors, IEEE Trans. on VLSI, Vol. 12, 2004, pp ] © IEEE, 2004

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Energy benefits despite large latencies for remote SPMs © IEEE, 2004 DRAM: 80 cycles

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Extensions Using DRAM Applications to Flash memory (copy code or execute in place): according to own experiments: very much parameter dependent Trying to imitate advantages of SPM with caches: partitioned caches, etc. PhD thesis of Lars Wehmeyer

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Improving predictability for caches Loop caches Mapping code to less used part(s) of the index space Cache locking/freezing Changing the memory allocation for code or data Mapping pieces of software to specific ways Methods: -Generating appropriate way in software -Allocation of certain parts of the address space to a specific way -Including way-identifiers in virtual to real-address translation Caches behave almost like a scratch pad

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (1) Execution counts based approach: Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3 Place functions in decreasing order of execution counts f1f1 f2f2 f3f3 f4f4 f5f5 (1100) (900) (400) (2000) (700) [S. McFarling: Program optimization for instruction caches, 3rd International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1989]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (2) Execution counts based approach: Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3 Place functions in decreasing order of execution counts Transformation increases spatial locality. Does not take in account calling order f4f4 (1100) (700) (400) (2000) (900) f4f4 f1f1 f2f2 f5f5 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (2000) f4f4 f4f4 f2f2 f5f5 f1f1 f3f3 [W. W. Hwu et al.: Achieving high instruction cache performance with an optimizing compiler, 16th Annual International Symposium on Computer Architecture, 1989]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (2000) f4f4 f2f2 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (4) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (1100) (2000) f4f4 f2f2 f1f1 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (5) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 (900) (1100) (400) (2000) f4f4 f2f2 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Code Layout Transformations (6) Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Combined with placing frequently executed traces at the top of the code space for functions. Increases spatial locality. f4f4 (900) (1100) (400) (2000) (700) f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Way prediction/selective direct mapping [M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, K. Roy: Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping, MICRO- 34, 2001] © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Hardware organization for way prediction © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for the paper on way prediction (1) Cache energy and prediction overhead Energy componentRelative energy Parallel access cache read (4 ways read) way read0.21 Cache write0.24 Tag array energy (incl. in the above numbers) x4bit prediction table read/write © ACM System configuration parameters Instruction issue & decode bandwidth 8 issues per cycle L1 I-Cache16K, 4-way, 1 cycle Base L1 D-Cache16K, 4-way, 1 or 2 cycles, 2ports L2 cache1M, 8-way, 12 cycle latency Memory access latency 80 cycles+4 cycles per 8 bytes Reorder buffer size64 LSQ size32 Branch predictor2-level hybrid

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for the paper on way prediction (2) © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for the paper on way prediction (2) © ACM

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Prefetching Prefetch instructions load values into the cache Pipeline not stalled for prefetching Prefetching instructions introduced in ~ Potentially, all miss latencies can be avoided Disadvantages: Increased # of instructions Potential premature eviction of cache line Potentially pre-loads lines that are never used Steps Determination of references requiring prefetches Insertion of prefetches (early enough!) [R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Results for prefetching [Mowry, as cited by R. Allen & K. Kennedy] © Morgan-Kaufman, 2002 Not very impressive!

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Optimization for exploiting processor-memory interface: Problem Definition (1) [A. Shrivastava, E. Earlie, N. Dutt, A. Nicolau: Aggregating processor free time for energy reduction, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp ] XScale is stalled for 30% of time, but each stall duration is small Average stall duration = 4 cycles Longest stall duration < 100 cycles Break-even stall duration for profitable switching 360 cycles Maximum processor stall < 100 cycles NOT possible to switch the processor to IDLE mode Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Optimization for exploiting processor-memory interface: Problem Definition (2) CT (Computation Time): Time to execute an iteration of the loop, assuming all data is present in the cache DT (Data Transfer Time): Time to transfer data required by an iteration of a loop between cache and memory Consider the execution of a memory-bound loop (DT > CT) Processor has to stall Time Activity Processor Activity Memory Bus Activity Processor activity is dis-continuous Memory activity is dis-continuous for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Optimization for exploiting processor-memory interface: Prefetching Solution Time Activity Processor Activity Memory Bus Activity Each processor activity period increases for (int i=0; i<1000; i++) prefetch a[i+4]; prefetch b[i+4]; prefetch c[i+4]; c[i] = a[i] + b[i]; Memory activity is continuous Processor activity is dis-continuous Memory activity is continuous Total execution time reduces Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Memory hierarchy description languages: ArchC Consists of description of ISA and HW architecture Extension of SystemC (can be generated from ArchC): Storage class structure [P. Viana, E. Barros, S. Rigo, R. Azevedo, G. Araújo: Exploring Memory Hierarchy with ArchC, 15th Symposium on Computer Architecture and High Performance Computing, 2003, pp. 2 – 9]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Example: Description of a simple cache-based architecture

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Memory Aware Compilation and Simulation Framework (for C) MACC [M. Verma, L. Wehmeyer, R. Pyka, P. Marwedel, L. Benini: Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations, Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), 2006]. Application C code Source-level memory optimizer encc, ARM gcc, M5 DSP Array partitioning SPM overlay Executable binary Energy database Memory hierarchy description Compilation Framework Profile report Memory simulator Processor simulators (ARM7/M5) Profiler Simulation Framework MPSoC simulator

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Memory architecture MACCv2 Query can include address, time stamp, value, … Query can request energy, delay, stored values Query processed along a chain of HW components, incl. busses, ports, address translations etc., each adding delay & energy [R. Pyka et al.: Versatile System level Memory Description Approach for embedded MPSoCs, University of Dortmund, Informatik 12, 2007] API query to model simplifies integration into compiler External XML representation REQ Energy= ? Cycles= ? +10 Energy +5 Cycles +1 Energy +2 Cycles +1 Energy +0 Cycles CPU1 MM ASPC-1 - IFETCH - DRD - DWR - MAINAS ASPC-M - 0…ffff ASPC-B - 0 … 3ffff

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Controlling tool chain generation through an architecture description language (ADL): EXPRESSION Overall information flow [P. Mishra, A. Shrivastava, N. Dutt: Architecture description language (ADL)-driven software toolkit generation for architectural exploration of programmable SOCs, ACM Trans. Des. Autom. Electron. Syst. (TODAES), 2006, pp ]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Description of Memories in EXPRESSION Generic approach, based on the analysis of a wide range of systems; Used for verification. (STORAGE_SECTION (DataL1 (TYPE DCACHE) (WORDSIZE 64) (LINESIZE 8) (NUM_LINES 1024) (ASSOCIATIVITY 2) (READ_LATENCY 1)... (REPLACEMENT_POLICY LRU) (WRITE_POLICY WRITE_BACK) ) (ScratchPad (TYPE SRAM) (ADDRESS_RANGE ) …. ) (SB (TYPE STREAM_BUFFER) ….. (InstL1 (TYPE ICACHE) ……… ) (L2 (TYPE DCACHE) ……. ) (MainMemory (TYPE DRAM) ) (Connect (TYPE CONNECTIVITY) (CONNECTIONS (InstL1, L2) (DataL1, SB) (SB, L2) (L2, MainMemory) )))

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund EXPRESSION: results q

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Optimization for main memory Exploiting burst mode of DRAM (1) [P. Grun, N. Dutt, A. Nicolau: Memory aware compilation through accurate timing extraction, DAC, 2000, pp. 316 – 321] Supported trafos: memory mapping, code reordering or loop unrolling

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Optimization for main memory Exploiting burst mode of DRAM (2) Timing extracted from EXPRESSION model for(i=0; i<9;i+=3){ a=a+x[i]+x[i+1]+x[i+2]+ y[i]+y[i+1]+y[i+2]; b=b+z[i]+z[i+1]+z[i+2]+ u[i]+u[i+1]+u[i+2];} Open circles of original paper changed into closed circles (column decodes). 2 banks

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Memory hierarchies beyond main memory Massive datasets are being collected everywhere Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day [© Larse Arge, I/O-Algorithms, More New Information Over Next 2 Years Than in All Previous History

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at m ~ 1,5 m between measurements COWI A/S (and others) is currently scanning Denmark [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2011 TU Dortmund Application Example: Flooding Prediction +1 meter +2 meter [© Larse Arge, I/O-Algorithms,