Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.

Slides:

Advertisements

Ähnliche Präsentationen

Cadastre for the 21st Century – The German Way

Advertisements

Service Oriented Architectures for Remote Instrumentation

Finding the Pattern You Need: The Design Pattern Intent Ontology

Service Discovery in Home Environments

H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,

DNS-Resolver-Mechanismus

P. Marwedel Informatik 12, U. Dortmund

R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:

Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:

Embedded System Hardware - Processing -

Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.

Embedded System Hardware - Reconfigurable Hardware -

Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,

Peter Marwedel TU Dortmund, Informatik 12

Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.

Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Embedded System Hardware - Processing -

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.

NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.

Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.

Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.

Deutsche Gesellschaft für Technische Zusammenarbeit GmbH Integrated Experts as interface between technical cooperation and the private sector – An Example.

Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.

Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 SADDU June 2008 S. Noël, K.Bramstedt,

Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.

INSTITUT FÜR DATENTECHNIK UND KOMMUNIKATIONS- NETZE 1 Steffen Stein, TU Braunschweig, 2009 A Timing-Aware Update Mechanism for Networked Real-Time Systems.

Integration of renewable energies: competition between storage, the power grid and flexible demand Thomas Hamacher.

Case Study Session in 9th GCSM: NEGA-Resources-Approach

Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.

Institut für Solare Energieversorgungstechnik Verein an der Universität Kassel Bereich Energetische Biomassenutzung, Hanau Dipl.-Ing. J. Müller Bioturbine,

Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.

Sanjay Patil Standards Architect – SAP AG April 2008

| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.

A good view into the future Presented by Walter Henke BRIT/SLL Schweinfurt, 14. November 2006.

BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.

Christoph Durt: Wittgenstein on the possibility of philosophy: The importance of an intercultural approach

Department of Computer Science Homepage HTML Preprocessor Perl Database Revision Control System © 1998, Leonhard Jaschke, Institut für Wissenschaftliches.

INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/

Algorithm Engineering Parallele Algorithmen Stefan Edelkamp.

Faculty of Public Health Department of Health Economics and Management University of Bielefeld WP 3.1 and WP 4.1: Macrocost EUprimecare Plenary Meeting.

Impairments in Polarization-Multiplexed DWDM Channels due to Cross- Polarization Modulation Marcus Winter Christian-Alexander Bunge Klaus Petermann Hochfrequenztechnik-Photonik.

Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.

Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.

4th Symposium on Lidar Atmospheric Applications

Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.

Cross-Polarization Modulation in DWDM Systems

Adjectiv Endungen Lite: Adjective following articles and pre-ceeding nouns. Colors and Clothes.

Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL 95% der Ammoniakemissionen aus der Landwirtschaft Rindvieh Pflanzenbau.

HRM A – G. Grote ETHZ, WS 06/07 HRM A: Work process design Overview.

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.

1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,

Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.

Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.

1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.

Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.

EN/FAD Ericsson GmbH EDD/ Information im 21. Jahrundert muss Erwünscht Relevant Erreichbar Schnell Kostenlos!?

Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.

TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.

Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.

Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.

Computer Services Business challenge

Work in Progress Ignacio Yaselli, Brunel University

Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik 12 Germany

- 2 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Schedule of the course TimeMondayTuesdayWednesdayThursdayFriday 09:30- 11:00 1: Orientation, introduction 2: Models of computation + specs 5: Models of computation + specs 9: Mapping of applications to platforms 13: Memory aware compilation 17: Memory aware compilation 11:00 Brief break 11:15- 12:30 6: Lab*: Ptolemy 10: Lab*: Scheduling 14: Lab*: Mem. opt. 18: Lab*: Mem. opt. 12:30Lunch 14:00- 15:20 3: Models of computation + specs 7: Mapping of applications to platforms 11: High-level optimizations* 15: Memory aware compilation 19: WCET & compilers* 15:20Break 15:40- 17:00 4: Lab*: Kahn process networks 8: Mapping of applications to platforms 12: High-level optimizations* 16: Memory aware compilation 20: Wrap-up * Dr. Heiko Falk

- 3 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund 1.Increasing speed gap 2.Major consumer of electrical energy 3.Timing predictability difficult to achieve 4.… The Problem with Memories Or: Why work on processors if memory is where the bottleneck is? Memories? Oops! Memories!

- 4 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Trends for the speeds Speed gap between processor and main DRAM increases [P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane] Speed years CPU Performance (1.5-2 p.a.) DRAM (1.07 p.a.) 31  2x every 2 years 1 0 Similar problems also for embedded systems & MPSoCs  In the future: Memory access times >> processor cycle times  “Memory wall” problem

- 5 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Importance of Energy Efficiency Courtesy: Philips© Hugo De Man, IMEC, 2007 Efficient software design needed, otherwise, the price for software flexibility cannot be paid. poor design techniques IPE=Inherent power efficiency AmI=Ambient Intelligence GOPs/J

- 6 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Energy consumption in mobile devices [O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.

- 7 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Memory system frequently consumes >50 % of the energy used for processing Multiprocessor with cache ($) Cache ($)-less monoprocessor Average over 200 benchmarks analyzed by Verma (U. Dortmund) [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, 2007] 29% 71% Processor Energy Main Mem. Energy 14,8% 5,2% 28,1% 51,9% Proc. Energy I-Cache Energy D-Cache Energy Main Mem. Energy

- 8 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Similar information according to other sources IEEE Journal of SSC Nov. 96 [Based on slide by and ©: Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001] [Segars 01 according to

- 9 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund “Everything“ is large for large memories * Monolithic register file; Rixner’s et al. model [HPCA’00], Technology of 0.18 mm; VLIW configurations for a certain number of ports („GPxMyREGz where: x={6}, y={2, 3} and z={16, 32, 64, 128“}; Based on slide by and ©: Harry Valero, U. Barcelona, 2001 Cycle Time (ns)* Area ( 2 x10 6 ) Power (W)

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Dependency on the size Energy Access times Applications are getting larger and larger … Sub-banking + locality of references  Memory hierarchies

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Timing Predictability G.721: using unified Worst case execution time (WCET) larger than without cache See later slide for experimental setup Many embedded systems are real-time systems  computations to be finished in a given amount of time Most memory hierarchies (e.g. caches) for PC-like systems designed for good average case, not for good worst case behavior.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund  Multiple Objectives for Memory System Design  (Average) Performance Throughput Latency  Energy consumption  Predictability, good worst case execution time bound (WCET)  Size  Cost  ….

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Pareto curves better worse

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Pareto points Definition: A (design) point J i is dominated by point J k, if J k is equal or better than J i in each criterion (J i  J k ). Definition: A (design) point is Pareto-optimal or a Pareto point, if it is not dominated by any other point.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Energy models  Commercial tools frequently very imprecise  Model of Tiwari (Dissertation, Princeton 1996): Cost of instructions and of transitions between instructions; Does not separate out the cost of memory access  Model of Simunic, de Micheli (DAC 99): Model based on data sheets; does not require measurements. Does not take transitions into account.  Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations; cannot predict effect of changes to memory architecture.  Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls  Dedicated energy models.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Source of energy models 1) measurements E.g.: ATMEL board with ARM7TDMI and ext. SRAM Data Memory Instruction Memory IAddr VDD ALU Multi- plier Barrel Shifter Register File Instr. Decoder & Control Logic Instr Imm Reg Value Reg# Opcode ARM7 DAddr mA Data Instr

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Instruction dependent costs in the CPU E cpu_instr =  MinCostCPU(Opcode i ) +  1 *  w(Imm i,j ) + ß 1 *  h(Imm i-1,j, Imm i,j ) +  2 *  w(Reg i,k ) + ß 2 *  h(Reg i-1,k, Reg i,k ) +  3 *  w(RegVal i,k ) + ß 3 *  h(RegVal i-1,k, RegVal i,k ) +  4 *  w(IAddr i ) + ß 4 *  h(IAddr i-1, IAddr i ) + FUCost(Instr i-1,Instr i ) Cost for a sequence of m instructions w: number of ones; h: Hamming distance; FUCost: cost of switching functional units , ß: determined through experiments

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Other costs E cpu_data =   5 * w(DAddr i ) + ß 5 * h(DAddr i-1, DAddr i ) +  6 * w(Data i ) + ß 6 * h(Data i-1, Data i ) E mem_instr =  MinCostMem(InstrMem,Word_width i ) +  7 * w(IAddr i ) + ß 7 * h(IAddr i-1, IAddr i ) +  8 * w(IData i ) + ß 8 * h(IData i-1, IData i ) E mem_data =  MinCostMem (DataMem, Direction, Word_width i ) +  9 * w(DAddr i ) + ß 9 * h(DAddr i-1, DAddr i ) +  10 * w(Data i ) + ß 10 * h(Data i-1, Data i )

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Results  It is not important, which address bit is set to ‚1‘  The number of ‚1‘s in the address bus is irrelevant  The cost of flipping a bit on the address bus is independent of the bit position.  It is not important, which data bit is set to ‚1‘  The number of ‚1‘s on the data bus has a minor effect (3%)  The cost of flipping a bit on the data bus is independent of the bit position.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Hamming Distance between adjacent addresses is playing a major role

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Source of energy models 2) computer models, e.g. CACTI Comparison with SPICE Cache model used

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Register allocation  Registers = fastest level in the memory hierarchy  Interest in good global register allocation techniques  Frequently based on coloring of interference graph  Registers neither suitable for |objects| ≫ single words nor for code  Register allocation not considered in the remainder lifetimes v1v1 v2v2 v3v3 v1v1 v2v2 v3v3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable Address space ARM7TDMI cores, well- known for low power consumption scratch pad memory 0 FFF.. Example Small; no tag memory SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) SPM select

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Predictability and scratch-pad memories … In essence, we must reinvent computer science. Fortunately, we have quite a bit of knowledge and experience to draw upon. Architecture techniques such as software-managed caches promise to deliver much of the benefit of memory hierarchy without the timing unpredictability. [Ed Lee: Absolutely Positively on Time: What would it take?, IEEE Computer, 2005] … pre-run-time scheduling is often the only practical means of providing predictability in a complex system. [J. Xu, D. Parnas: On satisfying timing constraints in hard real-time systems, IEEE Trans. Soft. Engineering, 1993, p. 70–84]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Comparison of currents using measurements E.g.: ATMEL board with ARM7TDMI and ext. SRAM

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Scratchpad vs. main memory energy Example: Atmel ARM-Evaluation board Savings (86%) even larger. energy reduction: / % predictable energy reduction: / % predictable €

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. [R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel. Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems, Intern. Workshop on Hardware/ Software Codesign (CODES), 2002]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Set-associative cache n-way cache Address Data Tag Index Tagsdata blockTags == n = 2 11 data block way 0way 1 $ (€)

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Influence of the associativity Technology different from previous slide.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Availability of SPMs=(“Tightly Coupled Memories”) Source: ARM CPU CoreCachesTCM Available ARM 1026EJ-SVariableyes ARM 1136J(F)-SVariableyes ARM 1176JZ(F)-SVariableyes ARM 926EJ-SVariableyes ARM 1026EJ-SVariableyes ARM 1156T2(F)-SVariableyes ARM 946E-SVariableyes ARM 966E-S-yes ARM 968E-S-yes All othersno

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Current usage for ARM 1.Use pragma in C-source to allocate to specific section: For example: #pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region) int const z2[3] = {1,2,3}; // in bar 2.Input scatter loading file to linker for allocating section to specific address range Software_Development_Tools/index.html (2 different examples) LOAD_ROM_1 0x000 { EXEC_ROM_1 0x000 { } program1.o (+RO) program1.o (+RW,+ZI) DRAM 0x x8000 { } Scatter description Load region description Execution region description Input section description Execution region description

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Popular among designers? Information received from designers:  Used for buffering speech in mobile phones (mobile phone company)  Manual mapping of frequently accessed data  Essentially no idea on how to exploit it (WLAN/Bluetooth specialists & major vendor) Why not change this situation?

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Similar Concept for the Cell processor * Motivation same as for this tutorial:  Large memory latency  Huge overhead for automatically managed caches Similar for Infineon TriCore Local SPE processors fetch instructions and data from local storage LS (256 kB). LS not designed as a cache. Separate DMA transfers required to fill and spill. * Sony, IBM, Toshiba Main Memory

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Infineon TriCore © Infineon, 2005

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Migration of instructions to small memory Yasuura’s architecture (Kyushu U.) e 2 instruction memories: main memory + “decompressor” memory; Compiler optimizing code allocation & size of the 2 instruction memories Merged instruction Small and Low Power Large and High Power [T. Ishihara, H. Yasuura: A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors, Design and Automation in Europe Conference (DATE), 2000]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund [Based on slide by & © : H. Yasuura, 2000] Energy savings

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Migration of data and instructions, global optimization model (TU Dortmund) Which object (array, loop, etc.) to be stored in SPM? Non-overlaying memory allocation: Gain g k & size s k for each object k. Maximise gain G =  g k, respecting size of SPM SSP   s k. Solution: Knapsack algorithm. Overlaying allocation: Moving objects back and forth between hierarchy levels Processor Scratch pad memory, capacity SSP main memory ? For i.{ } for j..{ } while... Repeat function... Array... Int... Array Example:

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund ILP representation - migrating functions and variables- Symbols: S(var k ) = size of variable k n k = number of accesses to variable k  e(var k ) = energy saved per variable access, if var k is migrated  E(var k ) = energy saved if variable var k is migrated (=  e(var k ) n(var k )) x(var k ) = decision variable, =1 if variable k is migrated to SPM, =0 otherwise K = set of variables Similar for functions I Integer linear programming formulation: Maximize  k  K x(var k )  E(var k ) +  i  I x(F i )  E(F i ) Subject to the constraint  i  I S (F i ) x(F i ) +  k  K S (var k ) x(var k )  SSP

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Reduction in energy and average run-time Multi_sort (mix of sort algorithms) Cycles [x100] Energy [µJ] Feasible with standard compiler & pre- or postpass optimization Measured processor / external memory energy + CACTI values for SPM (combined model) Numbers will change with technology, algorithms remain unchanged.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Allocation of basic blocks Fine-grained granularity smoothens dependency on the size of the scratch pad. Requires additional jump instructions to return to "main" memory. Fine-grained granularity smoothens dependency on the size of the scratch pad. Requires additional jump instructions to return to "main" memory. Main memory BB1 BB2 Jump1 Jump2 Jump4 Jump3 For consecutive basic blocks Statically 2 jumps, but only one is taken

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Taking consecutive basic blocks into account Approach: Consider sets of consecutive BBs as a new kind of basic blocks (“multi blocks”) Add a constraint preventing the same block from being selected twice: x(BB b ) + x(F i ) +  j  multiblocks(b), j  x x(BB j )  1  b  {blocks}  {multi blocks}  Block b is either moved individually, as part of a function, as part of one of its enclosing multi-blocks or not at all. BB1 BB3 BB2 BB12 BB23 BB123

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Allocation of basic blocks, sets of adjacent basic blocks and the entire stack Requires generation of additional jumps (special compiler) or procedure exlining (“procedural abstraction”) Cycles [x100] Energy [µJ]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Savings for memory system energy alone Combined model for memories

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Underlying reasons for large energy savings Potential energy saving is determined by the difference between the energy consumption for large and small memories. In this case: factor of  20  savings of up to 95% Can even be larger for larger memories

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Summary  Multiple objectives for the design of the memory system Energy efficiency (predictable) performance, cost, … small is beautiful (efficient)  efficiency(memory hierarchies)>efficiency(single mem)  Energy models: based on measurements+tools  Scratch pads: energy-efficient, fast, timing predictable  Scratch pad allocation strategies: Static (non-overlaying) -Globals, -functions, basic blocks

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Brief break (if on schedule) Q&A?