Embedded System Hardware - Processing -

Slides:

Advertisements

Ähnliche Präsentationen

Digital Output Board and Motherboard

Advertisements

Cadastre for the 21st Century – The German Way

Service Oriented Architectures for Remote Instrumentation

Finding the Pattern You Need: The Design Pattern Intent Ontology

E-Solutions mySchoeller.com for Felix Schoeller Imaging

SION Vacuum Circuit-Breakers 3AE5 and 3AE1

Embedded System Hardware

DNS-Resolver-Mechanismus

P. Marwedel Informatik 12, U. Dortmund

Finite state machines & message passing: SDL

PPTmaster_BRC_ pot Rexroth Inline compact I/O technology in your control cabinet SERCOS III Components Abteilung; Vor- und Nachname.

R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.

Herzlich Willkommen zum Informations-Forum: SAP Interoperabilität

Dr. M. Schlottke Common Description of Web Services Source: P. Gerbert, 2002 Web Services allow to establish B2B e-commerce on the fly Web Services allow.

Embedded & Real-time Operating Systems

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:

Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:

Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.

Embedded System Hardware - Reconfigurable Hardware -

Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,

Peter Marwedel TU Dortmund, Informatik 12

Fakultät für informatik informatik 12 technische universität dortmund Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany 2011/03/09.

Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.

Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany

Embedded System Hardware - Processing -

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.

Wozu die Autokorrelationsfunktion?

Hier wird Wissen Wirklichkeit Computer Architecture – Part 4 – page 1 of 35 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 4 Fundamentals.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.

Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.

Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.

5th IAEA Technical Meeting on ECRH Gandhinagar – February 2009

Seminar Telematiksysteme für Fernwartung und Ferndiagnose Basic Concepts in Control Theory MSc. Lei Ma 22 April, 2004.

Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.

Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.

Institut für Solare Energieversorgungstechnik Verein an der Universität Kassel Bereich Energetische Biomassenutzung, Hanau Dipl.-Ing. J. Müller Bioturbine,

Lehrstuhl Technische Informatik - Computer Engineering Brandenburgische Technische Universität Cottbus 1 Hierarchical Test Technology for Systems on a.

Sanjay Patil Standards Architect – SAP AG April 2008

A good view into the future Presented by Walter Henke BRIT/SLL Schweinfurt, 14. November 2006.

BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.

INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/

Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 K. Bramstedt, L. Amekudzi, J. Meyer IFE/IUP Tangent heights in occultation.

Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.

Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.

Image Processing and Analysis Introduction. How do we see things ?

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.

3rd Review, Vienna, 16th of April 1999 SIT-MOON ESPRIT Project Nr Siemens AG Österreich Robotiker Technische Universität Wien Politecnico di Milano.

Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL 95% der Ammoniakemissionen aus der Landwirtschaft Rindvieh Pflanzenbau.

Wind Energy in Germany 2004 Ralf Christmann, BMU Joachim Kutscher, PTJ

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.

1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,

Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.

Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.

Adjective Endings Nominative & Accusative Cases describing auf deutsch The information contained in this document may not be duplicated or distributed.

EN/FAD Ericsson GmbH EDD/ Information im 21. Jahrundert muss Erwünscht Relevant Erreichbar Schnell Kostenlos!?

Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.

TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.

Institut für Nachrichtentechnik U. Reimers Technische Universität Braunschweig The MultiMedia Home Platform (MHP): Hype or Reality ?

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Session 13 - Peter Marwedel TU Dortmund Informatik.

Technische Universität München Fakultät für Informatik Computer Graphics SS 2014 Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.

Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.

Institut für Angewandte Mikroelektronik und Datentechnik Course and Contest Results of Phase 5 Eike Schweißguth Selected Topics in VLSI Design (Module.

CSL211 Computer Architecture

Creating Web Documents

Cooling R&D plans in Aachen

Präsentation transkript:

Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 2010年 11 月 15 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Embedded System Hardware Embedded system hardware is frequently used in a loop (“hardware in a loop“):  cyber-physical systems

Processing units Need for efficiency (power + energy): Why worry about energy and power? “Power is considered as the most important constraint in embedded systems“ [in: L. Eggermont (ed): Embedded Systems Roadmap 2002, STW] Energy consumption by IT is the key concern of green computing initiatives (embedded computing leading the way) http://www.esa.int/images/earth,4.jpg

Importance of Energy Efficiency “inherent power efficiency of silicon“ © Hugo De Man, IMEC, Philips, 2007

Power and energy are related to each other In many cases, faster execution also means less energy, but the opposite may be true if power has to be increased to allow faster execution.

Low Power vs. Low Energy Consumption Minimizing power consumption important for the design of the power supply the design of voltage regulators the dimensioning of interconnect short term cooling Minimizing energy consumption important due to restricted availability of energy (mobile systems) limited battery capacities (only slowly improving) very high costs of energy (solar panels, in space) cooling high costs limited space dependability long lifetimes, low temperatures

Power density continues to get worse Nuclear reactor Prescott: 90 W/cm², 90 nm [c‘t 4/2004] © Intel M. Pollack, Micro-32

Surpassed hot (kitchen) plate …? Why not use it? http://www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html

Energy consumption in mobile devices [O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source. 9

Need to consider CPU & System Power Mobile PC (notebook) Thermal Design (TDP) System Power 600/500 MHz uP 37% LCD 10" 19% HDD 9% Memory+Graphics 12% Power Supply 10% Other 13% CPU Dominates Thermal Design Power [Courtesy: N. Dutt; Source: V. Tiwari] Note: Based on Actual Measurements Mobile PC (notebook) Average System Power 600/500 MHz uP 13% LCD 10" 30% HDD 19% Memory+Graphics 15% Power Supply 10% Other Multiple Platform Components Comprise Average Power

Application Specific Circuits (ASICS) or Full Custom Circuits Custom-designed circuits necessary if ultimate speed or energy efficiency is the goal and large numbers can be sold. Approach suffers from long design times, lack of flexibility (changing standards) and high costs (e.g. Mill. $ mask costs).

Mask cost for specialized HW becomes very expensive Trend towards implementation in Software HW synthesis not covered in this course. [http://www.molecularimprints.com/Technology/ tech_articles/MII_COO_NIST_2001.PDF9]

Key requirements for processors Energy/ power-efficiency

Dynamic power management (DPM) Example: STRONGARM SA1100 400mW RUN: operational IDLE: a sw routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of on-chip activity RUN 90µs Power fault signal 10µs 160ms 10µs IDLE 90µs SLEEP Power fault signal 50mW 160µW

Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOS circuits (ignoring leakage): Delay for CMOS circuits: Decreasing Vdd reduces P quadratically, while the run-time of algorithms is only linearly increased

Voltage scaling: Example Vdd [Courtesy, Yasuura, 2000]

Variable-voltage/frequency example: INTEL Xscale OS should schedule distribution of the energy budget. From Intel’s Web Site

Rough approxi-mations! Low voltage, parallel operation more efficient than high voltage, sequential operation Basic equations Power: P ~ VDD² , Maximum clock frequency: f ~ VDD , Energy to run a program: E = P  t, with: t = runtime (fixed) Time to run a program: t ~ 1/f Changes due to parallel processing, with  operations per clock: Clock frequency reduced to: f’ = f / , Voltage can be reduced to: VDD’ =VDD / , Power for parallel processing: P° = P / ² per operation, Power for  operations per clock: P’ =   P° = P / , Time to run a program is still: t’ = t, Energy required to run program: E’ = P’  t = E /  Argument in favour of voltage scaling, VLIW processors, and multi-cores Rough approxi-mations!

Application: VLIW procesing and voltage scaling in the Crusoe processor VDD: 32 levels (1.1V - 1.6V) Clock: 200MHz - 700MHz in increments of 33MHz Scaling is triggered when CPU load change is detected by software (~1/2 ms). More load: Increase of supply voltage (~20 ms/step), followed by scaling clock frequency Less load: reduction of clock frequency, followed by reduction of supply voltage Worst case (1.1V to 1.6V VDD, 200MHz to 700MHz) takes 280 ms

Result (as published by transmeta) Pentium Crusoe Running the same multimedia application. [www.transmeta.com]

Key requirement #2: Code-size efficiency CISC machines: RISC machines designed for run-time-, not for code-size-efficiency Compression techniques: key idea

Code-size efficiency Compression techniques (continued): 2nd instruction set, e.g. ARM Thumb instruction set: Reduction to 65-70 % of original code size 130% of ARM performance with 8/16 bit memory 85% of ARM performance with 32-bit memory 1110 001 01001 0 Rd 0 Rd 0000 Constant 16-bit Thumb instr. ADD Rd #constant 001 10 Rd Constant zero extended major opcode minor opcode source= destination [ARM, R. Gupta] Dynamically decoded at run-time Same approach for LSI TinyRisc, … Requires support by compiler, assembler etc.

Dictionary approach, two level control store (indirect addressing of instructions) “Dictionary-based coding schemes cover a wide range of various coders and compressors. Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear. The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over.” [Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, ACM Computing Surveys, Vol. 35, Sept. 2003, pp 223-267]

Key idea (for d bit instructions) Uncompressed storage of a d-bit-wide instructions requires axd bits. In compressed code, each instruction pattern is stored only once. Hopefully, axb+cxd < axd. Called nanoprogramming in the Motorola 68000. For each instruction address, S contains table address of instruction. b instruction address a S b « d bit table of used instructions (“dictionary”) c ≦ 2b small d bit CPU

Cache-based decompression Main idea: decompression whenever cache-lines are fetched from memory. Cache lines ↔ variable-sized blocks in memory  line address tables (LATs) for translation of instruction addresses into memory addresses. Tables may become large and have to be bypassed by a line address translation buffer. [A. Wolfe, A. Chanin, MICRO-92]

More information on code compaction Popular code compaction library by Rik van de Wiel [http://www.extra.research.philips.com/ccb] has been moved to http://www-perso.iro.umontreal.ca/~latendre/ codeCompression/codeCompression/node1.html http://www.iro.umontreal.ca/~latendre/compactBib/ (153 entries as per 11/2004)

Key requirement #3: Run-time efficiency - Domain-oriented architectures - Example: Filtering in Digital signal processing (DSP) Signal at t=ts (sampling points)

Filtering in digital signal processing ADSP 2100 -- outer loop over -- sampling times ts { MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for (k=0; k <= (n−1); k++) { MR:=MR + MX * MY; MX:=w[A2]; MY:=a[A1]; A1++; A2--; } x[s]:=MR; } Maps nicely

DSP-Processors: multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0]; for ( k:=1 <= n-1) {MR:=MR+MX*MY; MY:=a[A1]; MX:=w[A2]; A1++; A2--} Multiply/accumulate (MAC) instruction Zero-overhead loop (ZOL) instruction preceding MAC instruction. Loop testing done in parallel to MAC operations.

Heterogeneous registers Example (ADSP 210x): P D AX AY MX MY Address- registers A0, A1, A2 .. AF MF +,-,.. * Address generation unit (AGU) +,- AR MR Different functionality of registers An, AX, AY, AF,MX, MY, MF, MR

Separate address generation units (AGUs) Example (ADSP 210x): Data memory can only be fetched with address contained in A, but this can be done in parallel with operation in main data path (takes effectively 0 time). A := A ± 1 also takes 0 time, same for A := A ± M; A := <immediate in instruction> requires extra instruction  Minimize load immediates  Optimization in optimization chapter

Modulo addressing sliding window Modulo addressing: Am++  Am:=(Am+1) mod n (implements ring or circular buffer in memory) w t t1 .. w[t1-1] w[t1] w[t1-n+1] w[t1-n+2] .. .. w[t1-1] w[t1] w[t1+1] w[t1-n+2] .. n most recent values Memory, t=t1 Memory, t2= t1+1

Saturating arithmetic Returns largest/smallest number in case of over/underflows Example: a 0111 b + 1001 standard wrap around arithmetic (1)0000 saturating arithmetic 1111 (a+b)/2: correct 1000 wrap around arithmetic 0000 saturating arithmetic + shifted 0111 Appropriate for DSP/multimedia applications: No timeliness of results if interrupts are generated for overflows Precise values less important Wrap around arithmetic would be worse. “almost correct“

Example MATLAB Demo

Fixed-point arithmetic Shifting required after multiplications and divisions in order to maintain binary point.

Properties of fixed-point arithmetic Automatic scaling a key advantage for multiplications. Example: x= 0.5 x 0.125 + 0.25 x 0.125 = 0.0625 + 0.03125 = 0.09375 For iwl=1 and fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093 Like a floating point system with numbers  (-1..1), with no stored exponent (bits used to increase precision). Appropriate for DSP/multimedia applications (well-known value ranges).

Real-time capability Timing behavior has to be predictable Features that cause problems: Unpredictable access to shared resources Caches with difficult to predict replacement strategies Unified caches (conflicts between instructions and data) Pipelines with difficult to predict stall cycles ("bubbles") Unpredictable communication times for multiprocessors Branch prediction, speculative execution Interrupts that are possible any time Memory refreshes that are possible any time Instructions that have data-dependent execution times  Trying to avoid as many of these as possible. [Dagstuhl workshop on predictability, Nov. 17-19, 2003]

Multiple memory banks or memories D AX AY MX MY Address- registers A0, A1, A2 .. AF MF +,-,.. * Address generation unit (AGU) +,- AR MR Simplifies parallel fetches

Multimedia-Instructions/Processors Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit), whereas most multimedia data types are narrow (e.g. 8 bit per color, 16 bit per audio sample per channel) 2-8 values can be stored per register and added. E.g.: + 4 additions per instruction; carry disabled at word boundaries.

Early example: HP precision architecture (hp PA) Half word add instruction HADD: Half word add? Optional saturating arithmetic. Up to 10 instructions can be replaced by HADD.

Pentium MMX-architecture (1) 64-bit vectors representing 8 byte encoded, 4 word encoded or 2 double word encoded numbers. wrap around/saturating options. Multimedia registers mm0 - mm7, consistent with floating-point registers (OS unchanged). Instruction Options Comments Padd[b/w/d] PSub[b/w/d] wrap around, saturating addition/subtraction of bytes, words, double words Pcmpeq[b/w/d] Pcmpgt[b/w/d] Result= "11..11" if true, "00..00" otherwise Pmullw Pmulhw multiplication, 4*16 bits, least significant word multiplication, 4*16 bits, most significant word

Pentium MMX-architecture (2) Psra[w/d] Psll[w/d/q] Psrl[w/d/q] No. of positions in register or instruction Parallel shift of words, double words or 64 bit quad words Punpckl[bw/wd/dq] Punpckh[bw/wd/dq] Parallel unpack Packss[wb/dw] saturating Parallel pack Pand, Pandn Por, Pxor Logical operations on 64 bit words Mov[d/q] Move instruction

Appli-cation Scaled interpolation between two images Next word = next pixel, same color. 4 pixels processed at a time. pxor mm7,mm7 ;clear register mm7 movq mm3,fade_val;load scaling value movd mm0,imageA ;load 4 red pixels for A movd mm1,imageB ;load 4 red pixels for B unpcklbw mm1,mm7 ;unpack,bytes to words unpcklbw mm0,mm7 ;upper bytes from mm7 psubw mm0,mm1 ;subtract pixel values pmulhw mm0,mm3 ;scale paddw mm0,mm1 ;add to image B packuswb mm0,mm7 ;pack, words to bytes

Short vector instruction set extensions for Intel® Pentium®/AMD® processors 3DNow! (AMD, 1989) Streaming SIMD Extensions SSE (Intel, 1999) 16 new registers, floating point SIMD SSE2 (Intel, 2001; AMD, 2003) MMX instructions available for new SSE registers SSE3 (Intel, 2004; AMD) vector reduction, floating point conversion independent of global rounding mode, relaxed alignment restrictions SSE4 (Intel, 2006; AMD: 4 instructions implemented) String comparison, counting 1‘s, CRC, … SSE5 (AMD, 2007) 3-address instructions, … Advanced vector extensions AVX (Intel, 2008) Registers 256, … bit wide

Summary Hardware in a loop Sensors Discretization Information processing Importance of energy efficiency Special purpose HW very expensive Energy efficiency of processors Code size efficiency Run-time efficiency MPSoCs Reconfigurable Hardware D/A converters Actuators