Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

Service Oriented Architectures for Remote Instrumentation
Finding the Pattern You Need: The Design Pattern Intent Ontology
SION Vacuum Circuit-Breakers 3AE5 and 3AE1
H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,
Embedded System Hardware
P. Marwedel Informatik 12, U. Dortmund
Peter Marwedel Informatik 12 TU Dortmund Germany
Finite state machines & message passing: SDL
Mapping of Applications to Multi-Processor Systems
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
Dr. M. Schlottke Common Description of Web Services Source: P. Gerbert, 2002 Web Services allow to establish B2B e-commerce on the fly Web Services allow.
Informatik 12, TU Dortmund
fakultät für informatik informatik 12 technische universität dortmund Test Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra.
Embedded & Real-time Operating Systems
fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:
Embedded System Hardware - Processing -
Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,
Peter Marwedel TU Dortmund, Informatik 12
Fakultät für informatik informatik 12 technische universität dortmund Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany 2011/03/09.
Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.
Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.
NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.
Lehrstuhl Informatik III: Datenbanksysteme AstroGrid-D Meeting Heidelberg, Informationsfusion und -Integrität: Grid-Erweiterungen zum Datenmanagement.
Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.
CTS2 based Terminology Server – Overview – Project eBPG
Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.
Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.
1 IDA, TU Braunschweig SuReal – Sicherheitsgarantien unter Realzeitanforderungen Jonas Rox Braunschweig, Jonas Rox Braunschweig,
INSTITUT FÜR DATENTECHNIK UND KOMMUNIKATIONS- NETZE 1 Steffen Stein, TU Braunschweig, 2009 A Timing-Aware Update Mechanism for Networked Real-Time Systems.
Case Study Session in 9th GCSM: NEGA-Resources-Approach
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.
Non invasive imaging of plants and plant roots in substrate with Nuclear Magnetic Resonant Imaging (NMRI) and (Sub) Terrahertz Imaging (THz) Dr. Axel Jung.
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
Technische Universität Berlin Fakultät für Verkehrs- und Maschinensysteme, Institut für Mechanik Lehrstuhl für Kontinuumsmechanik und Materialtheorie,
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 K. Bramstedt, L. Amekudzi, J. Meyer IFE/IUP Tangent heights in occultation.
SIT-MOON ESPRIT Project Nr st Review, Brussels, 27th of April 1998 slide 1 Siemens AG Österreich Robotiker Technische Universität Wien Politecnico.
Impairments in Polarization-Multiplexed DWDM Channels due to Cross- Polarization Modulation Marcus Winter Christian-Alexander Bunge Klaus Petermann Hochfrequenztechnik-Photonik.
Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
4th Symposium on Lidar Atmospheric Applications
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.
1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,
Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.
Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
Technische Universität München Visual Presentations Hints for media usage.
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Deutsches Zentrum für Luft- und Raumfahrt e.V. German Aerospace Center Institut für Physik der Atmosphäre Atmosphärische Aerosole The Multi-Angle Absorption.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
Software Prefetching Reduzierung der Miss-Rate Erfordert Prefetch-Instruktionen –Nicht blockierend –Erzeugen keine Ausnahmen Nicht blockierende Caches.
Andreas Burger ZENTRUM FÜR MEDIZINISCHE LEHRE RUHR-UNIVERSITÄT BOCHUM Irkutsk October 2012 Report about the lecture "Report of the TEMPUS IV- Project Nr.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.
Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
Fakultät für informatik informatik 12 technische universität dortmund Lab 2: Heterogeneous System Modeling in Ptolemy - Session 6 - Peter Marwedel Heiko.
 Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

- 2 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Structure of this course 2: Specifications 3: Embedded System HW 4: Standard Software, Real- Time Operating Systems 5: Scheduling, HW/SW-Partitioning, Applications to MP- Mapping 6: Evaluation and Validation 8: Testing 7: Optimization of Embedded Systems Application Knowledge

- 3 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Prefetching Prefetch instructions load values into the cache Pipeline not stalled for prefetching Prefetching instructions introduced in ~ Potentially, all miss latencies can be avoided Disadvantages: Increased # of instructions Potential premature eviction of cache line Potentially pre-loads lines that are never used Steps Determination of references requiring prefetches Insertion of prefetches (early enough!) [R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002]

- 4 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Results for prefetching [Mowry, as cited by R. Allen & K. Kennedy] © Morgan-Kaufman, 2002 Not very impressive!

- 5 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Optimization for exploiting processor-memory interface: Problem Definition (1) [A. Shrivastava, E. Earlie, N. Dutt, A. Nicolau: Aggregating processor free time for energy reduction, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp ] XScale is stalled for 30% of time, but each stall duration is small Average stall duration = 4 cycles Longest stall duration < 100 cycles Break-even stall duration for profitable switching 360 cycles Maximum processor stall < 100 cycles NOT possible to switch the processor to IDLE mode Based on slide by A. Shrivastava

- 6 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Optimization for exploiting processor-memory interface: Problem Definition (2) CT (Computation Time): Time to execute an iteration of the loop, assuming all data is present in the cache DT (Data Transfer Time): Time to transfer data required by an iteration of a loop between cache and memory Consider the execution of a memory-bound loop (DT > CT) Processor has to stall Time Activity Processor Activity Memory Bus Activity Processor activity is dis-continuous Memory activity is dis-continuous for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; Based on slide by A. Shrivastava

- 7 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Optimization for exploiting processor-memory interface: Prefetching Solution Time Activity Processor Activity Memory Bus Activity Each processor activity period increases for (int i=0; i<1000; i++) prefetch a[i+4]; prefetch b[i+4]; prefetch c[i+4]; c[i] = a[i] + b[i]; Memory activity is continuous Processor activity is dis-continuous Memory activity is continuous Total execution time reduces Based on slide by A. Shrivastava

- 8 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Memory hierarchy description languages: ArchC Consists of description of ISA and HW architecture Extension of SystemC (can be generated from ArchC): Storage class structure [P. Viana, E. Barros, S. Rigo, R. Azevedo, G. Araújo: Exploring Memory Hierarchy with ArchC, 15th Symposium on Computer Architecture and High Performance Computing, 2003, pp. 2 – 9]

- 9 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Example: Description of a simple cache-based architecture

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Memory Aware Compilation and Simulation Framework (for C) MACC [M. Verma, L. Wehmeyer, R. Pyka, P. Marwedel, L. Benini: Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations, Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), 2006]. Application C code Source-level memory optimizer encc, ARM gcc, M5 DSP Array partitioning SPM overlay Executable binary Energy database Memory hierarchy description Compilation Framework Profile report Memory simulator Processor simulators (ARM7/M5) Profiler Simulation Framework MPSoC simulator

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Memory architecture MACCv2 Query can include address, time stamp, value, … Query can request energy, delay, stored values Query processed along a chain of HW components, incl. busses, ports, address translations etc., each adding delay & energy [R. Pyka et al.: Versatile System level Memory Description Approach for embedded MPSoCs, University of Dortmund, Informatik 12, 2007] API query to model simplifies integration into compiler External XML representation REQ Energy= ? Cycles= ? +10 Energy +5 Cycles +1 Energy +2 Cycles +1 Energy +0 Cycles CPU1 MM ASPC-1 - IFETCH - DRD - DWR - MAINAS ASPC-M - 0…ffff ASPC-B - 0 … 3ffff

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Controlling tool chain generation through an architecture description language (ADL): EXPRESSION Overall information flow [P. Mishra, A. Shrivastava, N. Dutt: Architecture description language (ADL)-driven software toolkit generation for architectural exploration of programmable SOCs, ACM Trans. Des. Autom. Electron. Syst. (TODAES), 2006, pp ]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Description of Memories in EXPRESSION Generic approach, based on the analysis of a wide range of systems; Used for verification. (STORAGE_SECTION (DataL1 (TYPE DCACHE) (WORDSIZE 64) (LINESIZE 8) (NUM_LINES 1024) (ASSOCIATIVITY 2) (READ_LATENCY 1)... (REPLACEMENT_POLICY LRU) (WRITE_POLICY WRITE_BACK) ) (ScratchPad (TYPE SRAM) (ADDRESS_RANGE ) …. ) (SB (TYPE STREAM_BUFFER) ….. (InstL1 (TYPE ICACHE) ……… ) (L2 (TYPE DCACHE) ……. ) (MainMemory (TYPE DRAM) ) (Connect (TYPE CONNECTIVITY) (CONNECTIONS (InstL1, L2) (DataL1, SB) (SB, L2) (L2, MainMemory) )))

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund EXPRESSION: results q

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Optimization for main memory Exploiting burst mode of DRAM (1) [P. Grun, N. Dutt, A. Nicolau: Memory aware compilation through accurate timing extraction, DAC, 2000, pp. 316 – 321] Supported trafos: memory mapping, code reordering or loop unrolling

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Optimization for main memory Exploiting burst mode of DRAM (2) Timing extracted from EXPRESSION model for(i=0; i<9;i+=3){ a=a+x[i]+x[i+1]+x[i+2]+ y[i]+y[i+1]+y[i+2]; b=b+z[i]+z[i+1]+z[i+2]+ u[i]+u[i+1]+u[i+2];} Open circles of original paper changed into closed circles (column decodes). 2 banks

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Comparison Flash/Microdrive Sandisk Type I Flash Sandisk Type II Flash IBM Microdrive DSCM Capacity [MB] Power [W] (standby/operating) 0,15/0.660,15/0,660,07/0.83 Write cycles unlimited Mean-time between failures [h] > service-life=min(5J, 8800 h operating) Error rates, uncorrectable < 1 per <1 per Max. power onsunlimited Shock tolerance2000 G; 2000 G2000 G;175 G; 1500 G Source: Hennessy/Patterson, Computer Architecture, 2002

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund NOR- and NAND-Flash NOR: Transistor between bit line and ground NAND: Several transistor between bit line and ground was at [ Semicon-ductor/ Flash/FlashNews/ FlashStructure. htm] (2007) contact

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Properties of NOR- and NAND- Flash memories Type/PropertyNORNAND Random accessYes No Erase blockSlow Fast Size of cellLarger Small ReliabilityLarger Smaller Execute in placeYes No Applications Code storage, boot flash, set top box Data storage, USB sticks, memory cards [ tor/Flash/FlashNews/FlashStructure.htm]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Characteristics of NAND Flash memory Memory partitioned into blocks (typ KB), blocks partitioned into pages (typ KB). Read/write operations performed in page units. Single Level Cell (SLC) Multi Level Cell (MLC) Read (page)25 µs Write (page)300 µs Erase (block)2 ms1.5 ms J. Lee, S. Kim, H. Kwin, C. Hyun, S, Ahn, J. Choi, D. Lee, S.Noh: Block Recycling Schemes and Their Cost-based Optimization in NAND Flash Memory Based Storage System, EMSOFT07, Sept. 2007

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Page/sector mapping flash transaction layer (FTL) Inverted page table stored in flash memory (extra bits); normal page table constructed during initialization. Page table may become large Used in low capacity NOR Flash memories Block 0 Block 1 Block 2 Block 3 logi- cal sec- tor num- ber page mapping table 15 0 page sector page + extra bits

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Exploiting regularity Usually, long sequence of sequential writes

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Block mapping flash transaction layer (FTL) Mapping tables smaller than for page-based FTLs. used in high capacity NAND Flash memories Overall operation is simple, but successive writes require copying into a new block Degraded performance for random and repeated writes. Hybrid schemes Block 0 Block 1 Block 2 Block 3 & logi- cal sec- tor num- ber block mapping table Offset physical sector number concat 15 0

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Wear-leveling Example (Lofgren et al., 2000, 2003): Each erase unit carries erase counter One erase unit set aside as a spare When one of the most worn out units is reclaimed, its counter is compared to least-worn out unit. If is large: content of least-worn-out ( constants) spare content of most worn-out least worn-out most worn-out unit becomes the new spare Counter increment may be lost if power is lost between erase and counter update Attempts to avoid erase counter in the same erase unit Source: Gal, Toledo, ACM Computing Surveys, June 2005

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Flash-specific file systems Two-layer approach can be inefficient: FTL emulates flash as a magnetic disc Standard file system assumes magnetic disc Example: deleted sectors not marked not reclaimed Log-structured file systems just append new information For disc-based file system: -Fast writes -Slow reads (head movement for gather operations) Ideal for flash-based file system: -Writes done in new sectors -Reads not slow: no head movement Specific log-based flash file systems -JFFS2 (NOR) -YAFFS (NAND) Source: Gal, Toledo, ACM Computing Surveys, June 2005

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Flash-aware application data structures Direct use of flash-specific properties in applications Typically requires partitioning of the flash memory and possibly wasted space within partitions Execute-in-place Used with NOR-flash, directly addressable by processor Problematic in systems without MMU (no FTL feasible!): -instructions must be stored contiguously in flash -instructions cannot move Code needed during erase cannot be stored in flash, unless suspended writing or erasing feasible Source: Gal, Toledo, ACM Computing Surveys, June 2005

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Flash memory as main memory One approach published (Wu, Zwaenepoel, 1994): Uses MMU RAM + Flash mapped to memory map Reads from Flash read single words from Flash Writes copy block of data into RAM, all updates done in RAM If the RAM is full, a block is copied back to Flash Crucial issue: Speed of writes. Proposal based on wide bus between Flash and RAM, so that writes are sufficiently fast Larger erase units, increased wear-out feasible. M. Wu, W. Zwaenepoel: eNVy: A nonvolatile, main memory storage system. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. 1994, p. 86–97.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Memory hierarchies beyond main memory Massive datasets are being collected everywhere Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day [© Larse Arge, I/O-Algorithms, More New Information Over Next 2 Years Than in All Previous History

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at m ~ 1,5 m between measurements COWI A/S (and others) is currently scanning Denmark [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Application Example: Flooding Prediction +1 meter +2 meter [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund N= # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory T = # of items in output I/O: Move block between memory and disk We assume (for convenience) that M >B 2 External Memory Model [© Larse Arge, I/O-Algorithms, D M P Block I/O

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Scalability Problems: Block Access Matters Example: Reading an array from disk Array size N = 10 elements Disk block size B = 2 elements Main memory size M = 4 elements (2 blocks) Algorithm 2: N/B=5 I/Os Algorithm 1: N=10 I/Os Difference between N and N/B large since block size is large Example: N = 256 x 10 6, B = 8000, 1ms disk access time N I/Os take 256 x 10 3 sec = 4266 min = 71 hr N/B I/Os take 256/8 sec = 32 sec [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Re-writing algorithms for memory hierarchies Analysis of algorithm complexity mostly using the RAM (random access machine; const. mem. acc. times) model outdated take memory hierarchies explicitly into account. Example: Usually, divide-&-conquer algorithms are good. Cache-oblivious algorithms (are good for any size of the faster memory and any block size). Assuming Optimal replacement (Beladys algorithm) 2 Memory levels considered (there can be more) Full associativity Automatic replacement [Piyush Kumar: Cache Oblivious Algorithms, in: U. Meyer et al. (eds.): Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, Volume 2625, 2003, pp ] [Naila Rahman: Algorithms for Hardware Caches and TLB, in: U. Meyer et al. (eds.): Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, Volume 2625, 2003, pp ] Unlikely to be ever automatic

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Fundamental Bounds Internal External Scanning: N Sorting: N log N Permuting Searching: Note: Linear I/O: O(N/B) Permuting not linear Permuting and sorting bounds are equal in all practical cases B factor VERY important: Which results apply to flash memory? [© Larse Arge, I/O-Algorithms,

fakultät für informatik informatik 12 technische universität dortmund Dynamic Voltage Scaling Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Voltage Scaling and Power Management Dynamic Voltage Scaling V dd

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Recap from chapter 3: Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOS circuits (ignoring leakage): Delay for CMOS circuits: Decreasing V dd reduces P quadratically, while the run-time of algorithms is only linearly increased (ignoring the effects of the memory system).

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Example: Processor with 3 voltages Case a): Complete task ASAP Task that needs to execute 10 9 cycles within 25 seconds. E a = 10 9 x 40 x = 40 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Case b): Two voltages E b = x – x = 32.5 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Case c): Optimal voltage E c = 10 9 x 25 x = 25 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Observations A minimum energy consumption is achieved for the ideal supply voltage of 4 Volts. In the following: variable voltage processor = processor that allows any supply voltage up to a certain maximum. It is expensive to support truly variable voltages, and therefore, actual processors support only a few fixed voltages.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Generalisation Lemma [Ishihara, Yasuura]: If a variable voltage processor completes a task before the deadline, the energy consumption can be reduced. If a processor uses a single supply voltage V and completes a task T just at its deadline, then V is the unique supply voltage which minimizes the energy consumption of T. If a processor can only use a number of discrete voltage levels, then a voltage schedule with at most two voltages minimizes the energy consumption under any time constraint. If a processor can only use a number of discrete voltage levels, then the two voltages which minimize the energy consumption are the two immediate neighbors of the ideal voltage V ideal possible for a variable voltage processor.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund The case of multiple tasks: Assignment of optimum voltages to a set of tasks N : the number of tasks EC j : the number of executed cycles of task j L : the number of voltages of the target processor V i : the i th voltage, with 1 i L F i : the clock frequency for supply voltage V i T : the global deadline at which all tasks must have been completed X i, j : the number of clock cycles task j is executed at voltage V i SC j : the average switching capacitance during the execution of task j (SC i comprises the actual capacitance CL and the switching activity )

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Designing an I(L)P model Simplifying assumptions of the IP-model include the following: There is one target processor that can be operated at a limited number of discrete voltages. The time for voltage and frequency switches is negligible. The worst case number of cycles for each task are known.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Energy Minimization using an Integer Programming Model Minimize subject to and

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Dynamic power management (DPM) Dynamic Power management tries to assign optimal power saving states. Questions: When to go to an power-saving state? Different, but typically complex models: Markov chains, renewal theory, ….

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Experimental Results © Yasuura, 2000

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 TU Dortmund Summary Optimizations exploiting memory hierarchies Prefetching Memory-architecture aware compilation -Burst mode access exploited by EXPRESSION Support for FLASH memory Memory hierarchies beyond main memory Dynamic voltage scaling (DVS) An ILP model for voltage assignment in a multi- tasking system Dynamic power management (DPM) (briefly)