Embedded System Hardware - Processing -

Slides:

Advertisements

Ähnliche Präsentationen

P. Marwedel Informatik 12, U. Dortmund

Advertisements

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:

Embedded System Hardware - Processing -

Peter Marwedel TU Dortmund, Informatik 12

Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.

Can you think of some KEY phrases which would be useful in multiple contexts? Take 2 minutes with a partner and come up with as many as you can!

Signal-Prozessoren DSV1, 2009, Hhrt, 1 Mikro-Prozessor Von Neumann-Architektur Daten und Programmcode im gleichen Speicher => Sequenzieller Zugriff auf.

Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Titelmasterformat.

1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,

Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.

Coordinating Conjunctions Why we need them & how to use them deutschdrang.com.

Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Titelmasterformat.

Fakultät für informatik informatik 12 technische universität dortmund Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany 2008/11/03.

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) The dependence of convection-related parameters on surface and.

Potentiale für den Channel Server-Software aktualisieren Server-Hardware erneuern Migration von Anwendungen Diese Produkte stehen vor dem Ende des Support.

Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.

Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase 3 Vincent Wiese Selected Topics in VLSI Design (Module 24513)

1IWF/ÖAW GRAZ Data Combination David Fischer, Rumi Nakamura (IWF/OeAW)  Fluxgate: noise + distortion gets worse than the searchcoil at ~ 6 Hz.  Searchcoil:

Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Embedded & Real- time Operating Systems Peter Marwedel TU Dortmund,

Institut für Angewandte Mikroelektronik und Datentechnik Selected Topics in VLSI Design (Module 24513) Vincent Wiese Adder Structures on FPGA and ASIC.

Synchronization: Multiversion Concurrency Control

Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase 4 Nils Büscher Selected Topics in VLSI Design (Module 24513)

Institut für Angewandte Mikroelektronik und Datentechnik Course and Contest Results of Phase 5 Eike Schweißguth Selected Topics in VLSI Design (Module.

Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN

Institut für Angewandte Mikroelektronik und Datentechnik Results of Phase 4: Layout for ST65 technology by Christoph Niemann Selected Topics.

Fakultät für informatik informatik 12 technische universität dortmund Communication Peter Marwedel Informatik 12 TU Dortmund Germany 2010 年 11 月 23 日 Graphics:

GERMAN WORD ORDER ORDER s. Sentences are made up by placing a variety of words in a specific order. If the order is wrong, the sentence is difficult to.

Institut für Angewandte Mikroelektronik und Datentechnik Results of phase 5: Investigations on a specific topic Special Features of the Virtex-6 FPGAs.

Fakultät für informatik informatik 12 technische universität dortmund Lab 2: Heterogeneous System Modeling in Ptolemy - Session 6 - Peter Marwedel Heiko.

1 Konica Minolta IT Solutions Prinzip Partnerschaft MANAGED MONITORING ÜBERWACHJUNG DER SERVERINFRASTRUKTUR UND ANWENDUNGEN DIREKT AUS DER CLOUD.

Magnetenzephalogramm, MEG

Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)

© Crown copyright 2011, Department for Education These materials have been designed to be reproduced for internal circulation, research and teaching or.

Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.

Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.

Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.

Memorisation techniques

Technische Universität München Alexander Neidhardt Forschungseinrichtung Satellitengeodäsie 1 Concepts for remote control of VLBI-telescopes: on the way.

Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.

What’s the weather like?. Look at the question above Turn it around and you have Das Wetter ist.... The phrase Das Wetter ist.... or Es ist.... can be.

LLP DE-COMENIUS-CMP Dieses Projekt wurde mit Unterstützung der Europäischen Kommission finanziert. Die Verantwortung für den Inhalt dieser.

Freizeit Thema 5 Kapitel 1 (1)

Sentence Structure Connectives

Englisch Grundlagen, Modal Verbs

Jetzt machen Venues aufmachen!!! Geh zu

Jetzt machen Venues aufmachen!!! Geh zu

Aspect-Oriented Programming: Fad or the Future

Process and Impact of Re-Inspection in NRW

Data Hazards 0x30 sub $6 $0 $1 0x34 add $7 $6 $

Get your Project started

Senseo Wir erweitern die statischen Informationen mit Hilfe von Laufzeitinformationen. Einbindung in IDE resp. JDT.

Metadata - Catalogues and Digitised works

THE PERFECT TENSE IN GERMAN

Results from CO2 heat pump applications

Ferrite Material Modeling (1) : Kicker principle

מבוא למערכות מחשב ואסמבלי

Collaborative Webmeeting November 24th, 2010 Geneve / Darmstadt

ELECTR IC CARS Karim Aly University of Applied Sciences.

Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.

CSL211 Computer Architecture

Integrating Knowledge Discovery into Knowledge Management

Practical Exercises and Theory

Calorimetry as an efficiency factor for biogas plants?

Die tiere Share: Introduce the session and PDSA.

- moodle – a internet based learning platform

Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.

Präsentation transkript:

Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund Germany © Springer, 2010 2013年 11 月 12 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Embedded System Hardware Embedded system hardware is frequently used in a loop (“hardware in a loop“):  cyber-physical systems

Efficiency: slide from lecture 1 applied to processing CPS & ES must be efficient Code-size efficient (especially for systems on a chip) Run-time efficient Weight efficient Cost efficient Energy efficient © Graphics: Microsoft, P. Marwedel, M. Engel, 2011

Why care about energy efficiency ? Relevant during use? Execution platform Plugged Uncharged periods Unplug-ged E.g. Factory Car Sensor Global warming   Cost of energy Increasing performance Problems with cooling, avoiding hot spots Avoiding high currents & metal migration Reliability Energy a very scarce resource Power © Graphics: P. Marwedel, 2011

Should we care about energy consumption or about power consumption? Both are closely related, but still different

Should we care about energy consumption or about power consumption (2)? Minimizing power consumption important for design of the power supply & regulators dimensioning of interconnect, short term cooling Minimizing energy consumption important due to restricted availability of energy (mobile systems) cooling: high costs, limited space thermal effects dependability, long lifetimes In general, we need to care about both

Energy Efficiency of different target platforms “inherent power efficiency of silicon“ © Hugo De Man, IMEC, Philips, 2007

Application Specific Circuits (ASICS) or Full Custom Circuits Approach suffers from long design times, lack of flexibility (changing standards) and high costs (e.g. Mill. $ mask costs). Custom-designed circuits necessary if ultimate speed or energy efficiency is the goal and large numbers can be sold.  HW synthesis not covered in this course, let’s look at processors © Graphics: M. Engel, 2012

Energy Efficiency of different target platforms “inherent power efficiency of silicon“ © Hugo De Man, IMEC, Philips, 2007

PCs: Problem: Power density increasing Nuclear reactor Prescott: 90 W/cm², 90 nm [c‘t 4/2004] © Intel M. Pollack, Micro-32

PCs: Surpassed hot (kitchen) plate …? Why not use it? Strictly speaking, energy is not “consumed”, but converted from electrical energy into heat energy http://www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html

PCs: Just adding transistors would have resulted in this: 2018 S. Borkar, A. Chien: The future of microprocessors, Communications of the ACM, May 2011 © ACM, 2011

Keep it simple, stupid (KISS) S. Borkar, A. Chien: The future of microprocessors, Communications of the ACM, May 2011 © ACM, 2011

Prerequisite: Static and dynamic power consumption Dynamic power consumption: Power consumption caused by charging capacitors when logic levels are switched. CMOS output CL Decreasing Vdd reduces P quadratically Static power consumption (caused by leakage current): power consumed in the absence of clock signals Leakage becoming more important due to smaller devices

How to make systems energy efficient: Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOS circuits (ignoring leakage): Delay for CMOS circuits: Decreasing Vdd reduces P quadratically, while the run-time of algorithms is only linearly increased

Voltage scaling: Example © ACM, 2011 S. Borkar, A. Chien: The future of microprocessors, Communications of the ACM, May 2011

10 GHz clock: Many cores unusable “Dark silicon” (For servers) 2 10 GHz clock: Many cores unusable “Dark silicon” © Babak Falsafi, 2010

(For servers) 2 © Babak Falsafi, 2010

Dynamic power management (DPM) Example: STRONGARM SA1100 400mW RUN: operational IDLE: a SW routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of on-chip activity RUN 90µs Power fault signal 10µs 160ms 10µs IDLE 90µs SLEEP Power fault signal 50mW 160µW

Energy Efficiency of different target platforms “inherent power efficiency of silicon“ © Hugo De Man, IMEC, Philips, 2007

Low voltage, parallel operation more efficient than high voltage, sequential operation Basic equations Power: P ~ VDD² , Maximum clock frequency: f ~ VDD , Energy to run a program: E = P  t, with: t = runtime (fixed) Time to run a program: t ~ 1/f Changes due to parallel processing, with  operations per clock: Clock frequency reduced to: f’ = f / , Voltage can be reduced to: VDD’ =VDD / , Power for parallel processing: P° = P /  ² per operation, Power for  operations per clock: P’ =   P° = P / , Time to run a program is still: t’ = t, Energy required to run program: E’ = P’  t = E /  Argument in favour of voltage scaling, and parallel processing Rough approxi-mations!

More energy-efficient architectures: Domain- and application specific “inherent power efficiency of silicon“ © Hugo De Man: From the Heaven of Software to the Hell of Nanoscale Physics: An Industry in Transition, Keynote Slides, ACACES, 2007 Close to power efficiency of silicon

Energy-efficient architectures: Domain- and application specific “inherent power efficiency of silicon“ © Hugo De Man: From the Heaven of Software to the Hell of Nanoscale Physics: An Industry in Transition, Keynote Slides, ACACES, 2007 Close to power efficiency of silicon

Mobile phones: Increasing performance requirements C.H. van Berkel: Multi-Core for Mobile Phones, DATE, 2009; Workload [MOPs] Many more instances of the power/energy problem

Mobile phones: Where does the power go? [O. Vargas: Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] It not just I/O, don’t ignore processing! 25

Mobile phones: Where does the power go? (2) Mobile phone use, breakdown by type of computation (geometry processing, rasterization, pixel shading) (display & camera processing, video (de)coding) (front-end, demodulation, decoding, protocol) (user interface, browsing, …) C.H. van Berkel: Multi-Core for Mobile Phones, DATE, 2009; (no explicit percentages in original paper) With special purpose HW! During use, all components & computations relevant

Mobile phones: Where is the energy consumed? According to International Technology Roadmap for Semiconductors (ITRS), 2010 update, [www.itrs.net] Current trends  violation of 0.5-1 W constraint for small mobiles; large mobiles: ~ 7 W © ITRS, 2010

Energy-efficient architectures: Heterogeneous processors http://www.mpsoc-forum.org/2007/slides/Hattori.pdf  “Dark silicon” (not all silicon can be powered at the same time, due to current, power or temperature constraints)

ARM‘s big.LITTLE as an example Used in Samsung S4 © ARM, 2013

Key requirement #2: Code-size efficiency Overview: http://www-perso.iro.umontreal.ca/~latendre/ codeCompression/codeCompression/node1.html Compression techniques: key idea

Code-size efficiency Compression techniques (continued): 2nd instruction set, e.g. ARM Thumb instruction set: 16-bit Thumb instr. ADD Rd #constant 001 10 Rd Constant Dynamically decoded at run-time major opcode source= destination minor opcode zero extended 1110 001 01001 0 Rd 0 Rd 0000 Constant Reduction to 65-70 % of original code size 130% of ARM performance with 8/16 bit memory 85% of ARM performance with 32-bit memory Same approach for LSI TinyRisc, … Requires support by compiler, assembler etc. [ARM, R. Gupta]

Dictionary approach, two level control store (indirect addressing of instructions) “Dictionary-based coding schemes cover a wide range of various coders and compressors. Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear. The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over.” [Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, ACM Computing Surveys, Vol. 35, Sept. 2003, pp 223-267]

Key idea (for d bit instructions) For each instruction address, S contains table address of instruction. b Uncompressed storage of a d-bit-wide instructions requires a x d bits. In compressed code, each instruction pattern is stored only once. Hopefully, axb+cxd < axd. Called nanoprogramming in the Motorola 68000. instruction address a S b « d bit table of used instructions (“dictionary”) c ≦ 2b small d bit CPU

Key requirement #3: Run-time efficiency - Domain-oriented architectures - Example: Filtering in Digital signal processing (DSP) Signal at t=ts (sampling points)

Filtering in digital signal processing ADSP 2100 -- outer loop over -- sampling times ts { MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for (k=0; k <= (n−1); k++) { MR:=MR + MX * MY; MX:=w[A2]; MY:=a[A1]; A1++; A2--; } x[s]:=MR; } Maps nicely

DSP-Processors: multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for ( k:=0 <= n-1) {MR:=MR+MX*MY; MY:=a[A1]; MX:=w[A2]; A1++; A2--} Multiply/accumulate (MAC) instruction Zero-overhead loop (ZOL) instruction preceding MAC instruction. Loop testing done in parallel to MAC operations.

Heterogeneous registers Example (ADSP 210x): P D AX AY MX MY Address- registers A0, A1, A2 .. AF MF +,-,.. * Address generation unit (AGU) +,- AR MR Different functionality of registers An, AX, AY, AF,MX, MY, MF, MR

Separate address generation units (AGUs) Data memory can only be fetched with address contained in A, but this can be done in parallel with operation in main data path (takes effectively 0 time). A := A ± 1 also takes 0 time, same for A := A ± M; A := <immediate in instruction> requires extra instruction  Minimize load immediates  Optimization in optimization chapter Example (ADSP 210x):

Modulo addressing sliding window Modulo addressing: Am++  Am:=(Am+1) mod n (implements ring or circular buffer in memory) w t t1 .. w[t1-1] w[t1] w[t1-n+1] w[t1-n+2] .. .. w[t1-1] w[t1] w[t1+1] w[t1-n+2] .. n most recent values Memory, t=t1 Memory, t2= t1+1

Saturating arithmetic Returns largest/smallest number in case of over/underflows Example: a 0111 b + 1001 standard wrap around arithmetic (1)0000 saturating arithmetic 1111 (a+b)/2: correct 1000 wrap around arithmetic 0000 saturating arithmetic + shifted 0111 Appropriate for DSP/multimedia applications: No timeliness of results if interrupts are generated for overflows Precise values less important Wrap around arithmetic would be worse. “almost correct“

Example MATLAB Demo

Fixed-point arithmetic Shifting required after multiplications and divisions in order to maintain binary point.

Real-time capability Timing behavior has to be predictable Features that cause problems: Unpredictable access to shared resources Caches with difficult to predict replacement strategies Unified caches (conflicts between instructions and data) Pipelines with difficult to predict stall cycles ("bubbles") Unpredictable communication times for multiprocessors Branch prediction, speculative execution Interrupts that are possible any time Memory refreshes that are possible any time Instructions that have data-dependent execution times  Trying to avoid as many of these as possible. [Dagstuhl workshop on predictability, Nov. 17-19, 2003]

Multiple memory banks or memories D AX AY MX MY Address- registers A0, A1, A2 .. AF MF +,-,.. * Address generation unit (AGU) +,- AR MR Simplifies parallel fetches

Multimedia-Instructions, Short vector extensions, Streaming extensions, SIMD instructions Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit), whereas most multimedia data types are narrow 2-8 values can be stored per register and added. E.g.: a1 a2 32 bits b1 b2 32 bits + c1 c2 32 bits 2 additions per instruction; no carry at bit 16 Cheap way of using parallelism SSE instruction set extensions, SIMD instructions

Summary Hardware in a loop Sensors Discretization Information processing Importance of energy efficiency Special purpose HW very expensive Energy efficiency of processors Code size efficiency Run-time efficiency MPSoCs D/A converters Actuators

SPARES

Voltage scaling: Example Vdd [Courtesy, Yasuura, 2000]

Variable-voltage/frequency example: INTEL Xscale From Intel’s Web Site