Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.

fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik 12 Germany

- 2 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Schedule of the course TimeMondayTuesdayWednesdayThursdayFriday 09:30- 11:00 1: Orientation, introduction 2: Models of computation + specs 5: Models of computation + specs 9: Mapping of applications to platforms 13: Memory aware compilation 17: Memory aware compilation 11:00 Brief break 11:15- 12:30 6: Lab*: Ptolemy 10: Lab*: Scheduling 14: Lab*: Mem. opt. 18: Lab*: Mem. opt. 12:30Lunch 14:00- 15:20 3: Models of computation + specs 7: Mapping of applications to platforms 11: High-level optimizations* 15: Memory aware compilation 19: WCET & compilers* 15:20Break 15:40- 17:00 4: Lab*: Kahn process networks 8: Mapping of applications to platforms 12: High-level optimizations* 16: Memory aware compilation 20: Wrap-up * Dr. Heiko Falk

- 3 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund 1.Increasing speed gap 2.Major consumer of electrical energy 3.Timing predictability difficult to achieve 4.… The Problem with Memories Or: Why work on processors if memory is where the bottleneck is? Memories? Oops! Memories!

- 4 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Trends for the speeds Speed gap between processor and main DRAM increases [P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane] 2 4 8 245 Speed years CPU Performance (1.5-2 p.a.) DRAM (1.07 p.a.) 31  2x every 2 years 1 0 Similar problems also for embedded systems & MPSoCs  In the future: Memory access times >> processor cycle times  “Memory wall” problem

- 5 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Importance of Energy Efficiency Courtesy: Philips© Hugo De Man, IMEC, 2007 Efficient software design needed, otherwise, the price for software flexibility cannot be paid. poor design techniques IPE=Inherent power efficiency AmI=Ambient Intelligence GOPs/J

- 6 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Energy consumption in mobile devices [O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.

- 7 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Memory system frequently consumes >50 % of the energy used for processing Multiprocessor with cache ($) Cache ($)-less monoprocessor Average over 200 benchmarks analyzed by Verma (U. Dortmund) [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, 2007] 29% 71% Processor Energy Main Mem. Energy 14,8% 5,2% 28,1% 51,9% Proc. Energy I-Cache Energy D-Cache Energy Main Mem. Energy

- 8 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Similar information according to other sources IEEE Journal of SSC Nov. 96 [Based on slide by and ©: Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001] [Segars 01 according to Vahid@ISSS01]

- 9 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund “Everything“ is large for large memories * Monolithic register file; Rixner’s et al. model [HPCA’00], Technology of 0.18 mm; VLIW configurations for a certain number of ports („GPxMyREGz where: x={6}, y={2, 3} and z={16, 32, 64, 128“}; Based on slide by and ©: Harry Valero, U. Barcelona, 2001 Cycle Time (ns)* Area ( 2 x10 6 ) Power (W)

- 10 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Dependency on the size Energy Access times Applications are getting larger and larger … Sub-banking + locality of references  Memory hierarchies

- 11 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Timing Predictability G.721: using unified Cache@ARM7TDMI Worst case execution time (WCET) larger than without cache See later slide for experimental setup Many embedded systems are real-time systems  computations to be finished in a given amount of time Most memory hierarchies (e.g. caches) for PC-like systems designed for good average case, not for good worst case behavior.

- 12 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund  Multiple Objectives for Memory System Design  (Average) Performance Throughput Latency  Energy consumption  Predictability, good worst case execution time bound (WCET)  Size  Cost  ….

- 13 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Pareto curves better worse

- 14 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Pareto points Definition: A (design) point J i is dominated by point J k, if J k is equal or better than J i in each criterion (J i  J k ). Definition: A (design) point is Pareto-optimal or a Pareto point, if it is not dominated by any other point.

- 15 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Energy models  Commercial tools frequently very imprecise  Model of Tiwari (Dissertation, Princeton 1996): Cost of instructions and of transitions between instructions; Does not separate out the cost of memory access  Model of Simunic, de Micheli (DAC 99): Model based on data sheets; does not require measurements. Does not take transitions into account.  Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations; cannot predict effect of changes to memory architecture.  Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls  Dedicated energy models.

- 16 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Source of energy models 1) measurements E.g.: ATMEL board with ARM7TDMI and ext. SRAM Data Memory Instruction Memory IAddr VDD ALU Multi- plier Barrel Shifter Register File Instr. Decoder & Control Logic Instr Imm Reg Value Reg# Opcode ARM7 DAddr mA Data Instr

- 17 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Instruction dependent costs in the CPU E cpu_instr =  MinCostCPU(Opcode i ) +  1 *  w(Imm i,j ) + ß 1 *  h(Imm i-1,j, Imm i,j ) +  2 *  w(Reg i,k ) + ß 2 *  h(Reg i-1,k, Reg i,k ) +  3 *  w(RegVal i,k ) + ß 3 *  h(RegVal i-1,k, RegVal i,k ) +  4 *  w(IAddr i ) + ß 4 *  h(IAddr i-1, IAddr i ) + FUCost(Instr i-1,Instr i ) Cost for a sequence of m instructions w: number of ones; h: Hamming distance; FUCost: cost of switching functional units , ß: determined through experiments

- 18 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Other costs E cpu_data =   5 * w(DAddr i ) + ß 5 * h(DAddr i-1, DAddr i ) +  6 * w(Data i ) + ß 6 * h(Data i-1, Data i ) E mem_instr =  MinCostMem(InstrMem,Word_width i ) +  7 * w(IAddr i ) + ß 7 * h(IAddr i-1, IAddr i ) +  8 * w(IData i ) + ß 8 * h(IData i-1, IData i ) E mem_data =  MinCostMem (DataMem, Direction, Word_width i ) +  9 * w(DAddr i ) + ß 9 * h(DAddr i-1, DAddr i ) +  10 * w(Data i ) + ß 10 * h(Data i-1, Data i )

- 19 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Results  It is not important, which address bit is set to ‚1‘  The number of ‚1‘s in the address bus is irrelevant  The cost of flipping a bit on the address bus is independent of the bit position.  It is not important, which data bit is set to ‚1‘  The number of ‚1‘s on the data bus has a minor effect (3%)  The cost of flipping a bit on the data bus is independent of the bit position.

- 20 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Hamming Distance between adjacent addresses is playing a major role

- 21 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Source of energy models 2) computer models, e.g. CACTI Comparison with SPICE Cache model used http://research.compaq.com/wrl/people/jouppi/CACTI.html

- 22 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Register allocation  Registers = fastest level in the memory hierarchy  Interest in good global register allocation techniques  Frequently based on coloring of interference graph  Registers neither suitable for |objects| ≫ single words nor for code  Register allocation not considered in the remainder lifetimes v1v1 v2v2 v3v3 v1v1 v2v2 v3v3

- 23 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable Address space ARM7TDMI cores, well- known for low power consumption scratch pad memory 0 FFF.. Example Small; no tag memory SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) SPM select

- 24 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Predictability and scratch-pad memories … In essence, we must reinvent computer science. Fortunately, we have quite a bit of knowledge and experience to draw upon. Architecture techniques such as software-managed caches promise to deliver much of the benefit of memory hierarchy without the timing unpredictability. [Ed Lee: Absolutely Positively on Time: What would it take?, IEEE Computer, 2005] … pre-run-time scheduling is often the only practical means of providing predictability in a complex system. [J. Xu, D. Parnas: On satisfying timing constraints in hard real-time systems, IEEE Trans. Soft. Engineering, 1993, p. 70–84]

- 25 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Comparison of currents using measurements E.g.: ATMEL board with ARM7TDMI and ext. SRAM

- 26 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Scratchpad vs. main memory energy Example: Atmel ARM-Evaluation board Savings (86%) even larger. energy reduction: / 7.06 100% predictable energy reduction: / 7.06 100% predictable €

- 27 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. [R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel. Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems, Intern. Workshop on Hardware/ Software Codesign (CODES), 2002]

- 28 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Set-associative cache n-way cache Address Data Tag Index Tagsdata blockTags == n = 2 11 data block way 0way 1 $ (€)

- 29 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Influence of the associativity Technology different from previous slide.

- 30 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Availability of SPMs=(“Tightly Coupled Memories”) Source: http://www.arm.com/products/CPUs/core_selector.html ARM CPU CoreCachesTCM Available ARM 1026EJ-SVariableyes ARM 1136J(F)-SVariableyes ARM 1176JZ(F)-SVariableyes ARM 926EJ-SVariableyes ARM 1026EJ-SVariableyes ARM 1156T2(F)-SVariableyes ARM 946E-SVariableyes ARM 966E-S-yes ARM 968E-S-yes All othersno

- 31 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Current usage for ARM 1.Use pragma in C-source to allocate to specific section: For example: #pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region) int const z2[3] = {1,2,3}; // in bar 2.Input scatter loading file to linker for allocating section to specific address range http://www.arm.com/documentation/ Software_Development_Tools/index.html (2 different examples) LOAD_ROM_1 0x000 { EXEC_ROM_1 0x000 { } program1.o (+RO) program1.o (+RW,+ZI) DRAM 0x18000 0x8000 { } Scatter description Load region description Execution region description Input section description Execution region description

- 32 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Popular among designers? Information received from designers:  Used for buffering speech in mobile phones (mobile phone company)  Manual mapping of frequently accessed data  Essentially no idea on how to exploit it (WLAN/Bluetooth specialists & major vendor) Why not change this situation?

- 33 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Similar Concept for the Cell processor * Motivation same as for this tutorial:  Large memory latency  Huge overhead for automatically managed caches Similar for Infineon TriCore Local SPE processors fetch instructions and data from local storage LS (256 kB). LS not designed as a cache. Separate DMA transfers required to fill and spill. * Sony, IBM, Toshiba Main Memory

- 34 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Infineon TriCore © Infineon, 2005

- 35 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Migration of instructions to small memory Yasuura’s architecture (Kyushu U.) e 2 instruction memories: main memory + “decompressor” memory; Compiler optimizing code allocation & size of the 2 instruction memories Merged instruction Small and Low Power Large and High Power [T. Ishihara, H. Yasuura: A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors, Design and Automation in Europe Conference (DATE), 2000]

- 36 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund [Based on slide by & © : H. Yasuura, 2000] Energy savings

- 37 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Migration of data and instructions, global optimization model (TU Dortmund) Which object (array, loop, etc.) to be stored in SPM? Non-overlaying memory allocation: Gain g k & size s k for each object k. Maximise gain G =  g k, respecting size of SPM SSP   s k. Solution: Knapsack algorithm. Overlaying allocation: Moving objects back and forth between hierarchy levels Processor Scratch pad memory, capacity SSP main memory ? For i.{ } for j..{ } while... Repeat function... Array... Int... Array Example:

- 38 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund ILP representation - migrating functions and variables- Symbols: S(var k ) = size of variable k n k = number of accesses to variable k  e(var k ) = energy saved per variable access, if var k is migrated  E(var k ) = energy saved if variable var k is migrated (=  e(var k ) n(var k )) x(var k ) = decision variable, =1 if variable k is migrated to SPM, =0 otherwise K = set of variables Similar for functions I Integer linear programming formulation: Maximize  k  K x(var k )  E(var k ) +  i  I x(F i )  E(F i ) Subject to the constraint  i  I S (F i ) x(F i ) +  k  K S (var k ) x(var k )  SSP

- 39 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Reduction in energy and average run-time Multi_sort (mix of sort algorithms) Cycles [x100] Energy [µJ] Feasible with standard compiler & pre- or postpass optimization Measured processor / external memory energy + CACTI values for SPM (combined model) Numbers will change with technology, algorithms remain unchanged.

- 40 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Allocation of basic blocks Fine-grained granularity smoothens dependency on the size of the scratch pad. Requires additional jump instructions to return to "main" memory. Fine-grained granularity smoothens dependency on the size of the scratch pad. Requires additional jump instructions to return to "main" memory. Main memory BB1 BB2 Jump1 Jump2 Jump4 Jump3 For consecutive basic blocks Statically 2 jumps, but only one is taken

- 41 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Taking consecutive basic blocks into account Approach: Consider sets of consecutive BBs as a new kind of basic blocks (“multi blocks”) Add a constraint preventing the same block from being selected twice: x(BB b ) + x(F i ) +  j  multiblocks(b), j  x x(BB j )  1  b  {blocks}  {multi blocks}  Block b is either moved individually, as part of a function, as part of one of its enclosing multi-blocks or not at all. BB1 BB3 BB2 BB12 BB23 BB123

- 42 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Allocation of basic blocks, sets of adjacent basic blocks and the entire stack Requires generation of additional jumps (special compiler) or procedure exlining (“procedural abstraction”) Cycles [x100] Energy [µJ]

- 43 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Savings for memory system energy alone Combined model for memories

- 44 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Underlying reasons for large energy savings Potential energy saving is determined by the difference between the energy consumption for large and small memories. In this case: factor of  20  savings of up to 95% Can even be larger for larger memories

- 45 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Summary  Multiple objectives for the design of the memory system Energy efficiency (predictable) performance, cost, … small is beautiful (efficient)  efficiency(memory hierarchies)>efficiency(single mem)  Energy models: based on measurements+tools  Scratch pads: energy-efficient, fast, timing predictable  Scratch pad allocation strategies: Static (non-overlaying) -Globals, -functions, basic blocks

- 46 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2008 TU Dortmund Brief break (if on schedule) Q&A?

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.

Ähnliche Präsentationen

Präsentation zum Thema: "Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.

Ähnliche Präsentationen

Präsentation zum Thema: "Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback