Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Slides:



Advertisements
Ähnliche Präsentationen
Finding the Pattern You Need: The Design Pattern Intent Ontology
Advertisements

P. Marwedel Informatik 12, U. Dortmund
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.
Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,
Peter Marwedel TU Dortmund, Informatik 12
Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Titelmasterformat.
Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
How Does Fuzzy Arithmetic Work ? © Hartwig Jeschke Institut für Mikroelektronische Schaltungen und Systeme Universität Hannover
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Titelmasterformat.
Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase Selected Topics in VLSI Design (Module 24513) ©
Fakultät für informatik informatik 12 technische universität dortmund Mapping: Applications  Platforms - Sessions Peter Marwedel TU Dortmund Informatik.
J. Neymeyer, T. Wülfing, W. Abdul-Wahab Al-Ansari, A. Apostolidis, S
Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.
Mein Arbeitspraktikum. Today we are learning to talk about work experience we have done, giving facts, details and opinions The bigger picture: We are.
Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Embedded & Real- time Operating Systems Peter Marwedel TU Dortmund,
Synchronization: Multiversion Concurrency Control
Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase 4 Nils Büscher Selected Topics in VLSI Design (Module 24513)
Institut für Angewandte Mikroelektronik und Datentechnik Course and Contest Results of Phase 5 Eike Schweißguth Selected Topics in VLSI Design (Module.
Christian Binder Senior Platform Strategy Manager Microsoft Deutschland GmbH.
Fakultät für informatik informatik 12 technische universität dortmund Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2009/12/09.
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
The influence of spatial variability of polar firn on microwave emission Martin Proksch 1, Henning Löwe 1, Stefanie Weissbach 2, Martin Schneebeli 1 1.
Welcome to Web Services & Grid Computing Jens Mache
Nachweis von B 0 s -Oszillationen mit dem ATLAS Detektor am LHC B. Epp 1, V.M. Ghete 2, E. Kneringer 1, D. Kuhn 1, A. Nairz 3 1 Institut für Experimentalphysik,
Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.
Nuklearmedizinsche Klinik und Poliklinik Klinikum rechts der Isar Technische Universität München Image reconstruction, MR-based attenuation correction,
How does the Summer Party of the LMU work? - Organizations and Networks -
COMMANDS imperative There are three command forms: formal familiar singular familiar plural.
Fakultät für informatik informatik 12 technische universität dortmund Lab 2: Heterogeneous System Modeling in Ptolemy - Session 6 - Peter Marwedel Heiko.
KLIMA SUCHT SCHUTZ EINE KAMPAGNE GEFÖRDERT VOM BUNDESUMWELTMINISTERIUM Co2 online.
Magnetenzephalogramm, MEG
Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)
© Crown copyright 2011, Department for Education These materials have been designed to be reproduced for internal circulation, research and teaching or.
Lecture slides for Training Curriculum TIA Portal
Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.
Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.
Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.
D. Sturzebecher, Institut für Betriebssysteme und Rechnerverbund, TU-Braunschweig Tele-Teaching with MACS n Tele-Teaching Requirements and Scenarios n.
Embedded System Hardware - Processing -
Memorisation techniques
Standort assurance for companies Industrie- und Handelskammer Lippe zu Detmold 01. Juni 2010 Seite 1 What does the IHK do against the crisis?
Technische Universität München Alexander Neidhardt Forschungseinrichtung Satellitengeodäsie 1 Concepts for remote control of VLBI-telescopes: on the way.
Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.
Learning Linear Ordering Problems for Better Translation Roy Tromble, Google Pittsburgh Jason Eisner, Johns Hopkins August 7, 2009.
Premiere Conferencing GmbH
Freizeit Thema 5 Kapitel 1 (1)
Azure Countdown Wenn der Freund und Helfer Freunde und Helfer braucht: Sichere Content-Upload-Plattform für Bürger.
Sentence Structure Connectives
Aspect-Oriented Programming: Fad or the Future
IETF 80 Prague DISPATCH WG
Results from CO2 heat pump applications
Ferrite Material Modeling (1) : Kicker principle
מבוא למערכות מחשב ואסמבלי
Collaborative Webmeeting November 24th, 2010 Geneve / Darmstadt
ELECTR IC CARS Karim Aly University of Applied Sciences.
Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.
CSL211 Computer Architecture
OFFICE 365 FOCUS SESSION SHAREPOINT ONLINE 101:LERNE DIE BASICS 19. März 2018 Höhr-Grenzhausen.
Integrating Knowledge Discovery into Knowledge Management
Practical Exercises and Theory
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany 2014 年 01 月 14 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply. © Springer, 2010

- 2 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Structure of this course 2: Specification & Modelling 3: ES-hardware 4: system software (RTOS, middleware, …) 8: Test 5: Evaluation & validation & (energy, cost, performance, …) 7: Optimization 6: Application mapping Application Knowledge Design repository Design Numbers denote sequence of chapters

- 3 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Compilers for embedded systems Book section 7.3  Introduction  Energy-aware compilation  Memory-architecture-aware compilation  Reconciling compilers and timing analysis  Compilation for digital signal processors  Compilation for VLIW processors  Compiler generation, retargetable compilers, design space exploration

- 4 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund SPM+MMU (1) How to use SPM in a system with virtual addressing?  Virtual SPM Typically accesses MMU + SPM in parallel  not energy efficient  Real SPM  suffers from potentially long VA translation  Egger, Lee, Shin (Seoul Nat. U.): Introduction of small µTLB translating recent addresses fast. [B. Egger, J. Lee, H. Shin: Scratchpad memory management for portable systems with a memory management unit, CASES, 2006, p (best paper)] Proc. $SPM MMU

- 5 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund SPM+MMU (2)  µTLB generates physical address in 1 cycle  if address corresponds to SPM, it is used  otherwise, mini-cache is accessed  Mini-cache provides reasonable performance for non-optimized code  µTLB miss triggers main TLB/MMU  SPM is used only for instructions  instructions are stored in pages  pages are classified as cacheable, non-cacheable, and “pageable” (= suitable for SPM) CPU core MMU unified TLB minicache TAG RAM DATA RAM SPM SPM base reg. comparator µTLB instruction VA PA

- 6 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund SPM+MMU (3)  Application binaries are modified: frequently executed code put into pageable pages.  Initially, page-table entries for pageable code are marked invalid  If invalid page is accessed, a page table exception invokes SPM manager (SPMM).  SPMM allocates space in SPM and sets page table entry  If SPMM detects more requests than fit into SPM, SPM eviction is started  Compiler does not need to know SPM size virtual memory SPM main memory physical memory stack/heap region pageable region cached region uncached region pageable region cached region uncached region PC

- 7 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Multi-processor ARM (MPARM) Framework  Homogenous SMP ~ CELL processor  Processing Unit : ARM7T processor  Shared Coherent Main Memory  Private Memory: Scratchpad Memory SPM Interrupt Device Semaphore Device ARM Interconnect (AMBA or STBus) Shared Main Memory

- 8 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Application Example: Multi-Processor Edge Detection SourceSink Compute Processors  Source, sink and n compute processors  Each image is processed by an independent compute processor Communication overhead is minimized.

- 9 - technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results: Scratchpad Overlay for Edge Detection  2 CPs are better than 1 CP, then energy consumption stabilizes  Best scratchpad size: 4kB (1CP& 2CP) 8kB (3CP & 4CP)

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results DES-Encryption DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of cooperation between U. Bologna and TU Dortmund supported by ARTIST2 network of excellence.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Compilers for embedded systems Book section 7.3  Introduction  Energy-aware compilation  Memory-architecture-aware compilation  Reconciling compilers and timing analysis  Compilation for digital signal processors  Compilation for VLIW processors  Compiler generation, retargetable compilers, design space exploration

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (1) Execution counts based approach:  Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3  Place functions in decreasing order of execution counts f1f1 f2f2 f3f3 f4f4 f5f5 (1100) (900) (400) (2000) (700) [S. McFarling: Program optimization for instruction caches, 3rd International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1989]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (2) Execution counts based approach:  Sort the functions according to execution counts f 4 > f 1 > f 2 > f 5 > f 3  Place functions in decreasing order of execution counts Transformation increases spatial locality. Does not take in account calling order f4f4 (1100) (700) (400) (2000) (900) f4f4 f1f1 f2f2 f5f5 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm:  Create weighted call-graph.  Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (2000) f4f4 f4f4 f2f2 f5f5 f1f1 f3f3 [W. W. Hwu et al.: Achieving high instruction cache performance with an optimizing compiler, 16th Annual International Symposium on Computer Architecture, 1989]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (3) Call-Graph Based Algorithm:  Create weighted call-graph.  Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (2000) f4f4 f2f2 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (4) Call-Graph Based Algorithm:  Create weighted call-graph.  Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. (900) (1100) (2000) f4f4 f2f2 f1f1 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (5) Call-Graph Based Algorithm:  Create weighted call-graph.  Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5 Increases spatial locality. f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 (900) (1100) (400) (2000) f4f4 f2f2 f1f1 f3f3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Code Layout Transformations (6) Call-Graph Based Algorithm:  Create weighted call-graph.  Place functions according to weighted depth-first traversal. f 4 > f 2 > f 1 > f 3 > f 5  Combined with placing frequently executed traces at the top of the code space for functions. Increases spatial locality. f4f4 (900) (1100) (400) (2000) (700) f4f4 f2f2 f5f5 f1f1 f3f3 f4f4 f2f2 f5f5 f1f1 f3f3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Prefetching  Prefetch instructions load values into the cache Pipeline not stalled for prefetching  Prefetching instructions introduced in ~  Potentially, all miss latencies can be avoided  Disadvantages: Increased # of instructions Potential premature eviction of cache line Potentially pre-loads lines that are never used  Steps Determination of references requiring prefetches Insertion of prefetches (early enough!) R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results for prefetching Mowry, as cited by R. Allen & K. Kennedy © Morgan-Kaufman, 2002 Not very impressive!

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Memory Aware Compilation and Simulation Framework (for C) MACC M. Verma, L. Wehmeyer, R. Pyka, P. Marwedel, L. Benini: Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations, Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), 2006 Application C code Source-level memory optimizer encc, ARM gcc, M5 DSP Array partitioning SPM overlay Executable binary Energy database Memory hierarchy description Compilation Framework Profile report Memory simulator Processor simulators (ARM7/M5) Profiler Simulation Framework MPSoC simulator

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Memory architecture MACCv2  Query can include address, time stamp, value, …  Query can request energy, delay, stored values  Query processed along a chain of HW components, incl. busses, ports, address translations etc., each adding delay & energy R. Pyka et al.: Versatile System level Memory Description Approach for embedded MPSoCs, University of Dortmund, Informatik 12, 2007  API query to model simplifies integration into compiler  External XML representation REQ Energy= ? Cycles= ? +10  Energy +5  Cycles +1  Energy +2  Cycles +1  Energy +0  Cycles CPU1 MM ASPC-1 - IFETCH - DRD - DWR - MAINAS ASPC-M - 0…ffff ASPC-B - 0 … 3ffff

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund MACC_System Introduction of Memory Architecture-Aware Optimization The MACC PMS (Processor/ Memory/Switch) Model CPU1 SPM MM2 CPU2 SPM L1$ BRI CPU3 SPM L1$ L2$ MM3 BUS1 BUS2 MM1  Explicit memory architecture  API provides access to memory information C code

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund MaCC Modeling Example via GUI

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Memory hierarchies beyond main memory  Massive datasets are being collected everywhere  Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day [© Larse Arge, I/O-Algorithms, More New Information Over Next 2 Years Than in All Previous History

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at m ~ 1,5 m between measurements COWI A/S (and others) is currently scanning Denmark [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Application Example: Flooding Prediction +1 meter +2 meter [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Compilers for embedded systems Book section 7.3  Introduction  Energy-aware compilation  Memory-architecture-aware compilation  Reconciling compilers and timing analysis  Compilation for digital signal processors  Compilation for VLIW processors  Compiler generation, retargetable compilers, design space exploration

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Worst case execution time (1) Definition of worst case execution time: WCET EST © Graphics: adopted from R. Wilhelm + Microsoft Cliparts WCET EST must be 1.safe (i.e. ≥ WCET) and 2.tight (WCET EST -WCET ≪ WCET EST ) Time constraint disrtribution of times possible execution times worst-case performance worst-case guarantee WCET BCET BCET EST timing predictability time

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Current Trial-and-Error Based Development 1.Specification of CPS/ES system 2.Generation of Code (ANSI-C or similar) 3.Compilation of Code 4.Execution and/or simulation of code, using a (e.g. random) set of input data 5.Measurement-based computation of “estimated worst case execution time” (WCET meas ) 6.Adding safety margin m on top of WCET meas : WCET hypo := (1 + m )* WCET meas 7.If “WCET hypo ” > deadline: change some detail, go back to 1 or 2.

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Problems with this Approach Dependability  Computed “WCET hypo ” not a safe approximation  Time constraint may be violated Design time  How to find necessary changes?  How many iterations until successful? “Make the common case fast” a wrong approach for RT- systems  Computer architecture and compiler techniques focus on average speed  Circuit designers know it’s wrong  Compiler designers (typically) don’t “Optimizing” compilers unaware of cost functions other than code size Common case fast

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Integration of WCET estimation and compilation Computing WCET EST after code generation is too late. Why not consider WCET EST as an objective function already in the compiler?  Integration of aiT and compiler

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Challenges for WCET EST -Minimization Worst-Case Execution Path (WCEP)  WCET EST of a program = Length of longest execution path (WCEP) in that program  WCET EST -Minimization: Reduction of the longest path  Other optimizations do not result in a reduction of WCET EST  Optimizations need to know the WCEP

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund WCET-oriented optimizations  Extended loop analysis (CGO 09)  Instruction cache locking (CODES/ISSS 07, CGO 12)  Cache partitioning (WCET-Workshop 09)  Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08)  Procedure/code positioning (ECRTS 08, CASES 11 (2x))  Function inlining (SMART 09, SMART 10)  Loop unswitching/invariant paths (SCOPES 09)  Loop unrolling (ECRTS 09)  Register allocation (DAC 09, ECRTS 11))  Scratchpad optimization (DAC 09)  Extension towards multi-objective optimization (RTSS 08)  Superblock-based optimizations (ICESS 10)  Surveys (Springer 10, Springer 12)

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Loop Unrolling as an Example  Unrolling replaces the original loop with several instances of the loop body  Positive Effects Reduced overhead for loop control Enables instruction level parallelism (ILP) Offers potential for following optimizations  Unroll early in optimization chain  Negative Effects Aggressive unrolling leads to I-cache overflows Additional spill code instructions Control code may cancel positive effects Consequences of transformation hardly known

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund WCET EST -aware Loop Unrolling via Back-annotation  WCET EST -information available at assembly level  Unrolling to be applied at internal representation of source code  Solution: Back-annotation: Experimental WCET EST -aware compiler WCC allows copying information: assembly code  source code WCET EST data Assembly code size Amount of spill code  Memory architecture info available High-Level IR (Source Code) Low-Level IR (Assembly Code) Back- Annotation Mem. Spec. Code generation

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results for unrolling: WCET EST 100%: Avg. WCET EST for all benchmarks with –O3 & no unrolling WCET EST reduction between 10.2% and 15.4% WCET EST -driven unrolling outperforms standard unrolling by 13.7% Relative WCET [%] STANDARD

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund WCET-aware superblock optimizations  WCC’s superblocks: proposed 1st time at ANSI-C code level, rely on WCET EST timing data  WCET EST -aware superblock optimizations: Common Subexpression Elimination (CSE) and Dead Code Elimination (DCE) ported to WCC WCET EST -aware superblock optimizations

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results: Register allocation  Registers = fastest level in the memory hierarchy  Interest in good global register allocation techniques  Frequently based on coloring of interference graph lifetimes v1v1 v2v2 v3v3 v1v1 v2v2 v3v3

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Register Allocation 100% = WCET EST using Standard Graph Coloring (highest degree) 93% 24% 69% [H. Falk]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Improving predictability for caches  Loop caches  Mapping code to less used part(s) of the index space  Cache locking/freezing  Changing the memory allocation for code or data  Mapping pieces of software to specific ways Methods: -Way prediction in hardware -Generating appropriate way in software -Allocation of certain parts of the address space to a specific way -Including way-identifiers in virtual to real-address translation

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Relative WCET EST with I-Cache Locking 5 Benchmarks/ARM920T/Postpass-Opt (ARM920T) [S. Plazar et al.]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Potential cache thrashing [S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012]

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Avoiding cache thrashing Key idea: selectively allocate code to uncached memory S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results S. Plazar: Memory-based Optimization Techniques for Real-Time Systems, PhD thesis, TU Dortmund, June 2012

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Way prediction/selective direct mapping [M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, K. Roy: Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping, MICRO- 34, 2001] © ACM

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Results for the paper on way prediction (2) © ACM

technische universität dortmund fakultät für informatik  p. marwedel, informatik 12, 2014 TU Dortmund Conclusion  SPM allocation: MMUs multi-cores  Code-Layout-Transformations  Prefetching  Reconciling Compilers and Timing Analysis  Improving Timing Predictability