Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics:

Slides:

Advertisements

Ähnliche Präsentationen

Cadastre for the 21st Century – The German Way

Advertisements

Vernetzung von Repositorien : DRIVER Guidelines Dr Dale Peters, SUB Goettingen 4. Helmholtz Open Access Workshop Potsdam, 17 Juni 2008.

Finding the Pattern You Need: The Design Pattern Intent Ontology

E-Solutions mySchoeller.com for Felix Schoeller Imaging

H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,

Embedded System Hardware

P. Marwedel Informatik 12, U. Dortmund

Peter Marwedel Informatik 12 TU Dortmund Germany

Finite state machines & message passing: SDL

Managing the Transition from School-to-Work Empirical Findings from a Mentoring Programme in Germany Prof. i.V. Dr. Martin Lang.

Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Multi-Processor Systems Peter Marwedel Informatik 12 TU.

Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Two Issues on Remote Data Access.

1 | R. Steinbrecher | IMK-IFU | KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Natural Sources SNAP11.

Multi electron atoms Atoms with Z>1 contain >1 electron. This changes the atomic structure considerably because in addition to the electron-nucleus interaction,

Informatik 12, TU Dortmund

fakultät für informatik informatik 12 technische universität dortmund Test Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra.

Embedded & Real-time Operating Systems

Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/10 Graphics:

Fakultät für informatik informatik 12 technische universität dortmund Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany.

Embedded System Hardware - Processing -

Fakultät für informatik informatik 12 technische universität dortmund Universität Dortmund Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany.

Fakultät für informatik informatik 12 technische universität dortmund Specifications Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte,

Peter Marwedel TU Dortmund, Informatik 12

Fakultät für informatik informatik 12 technische universität dortmund Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany 2011/03/09.

Fakultät für informatik informatik 12 technische universität dortmund Hardware/Software Partitioning Peter Marwedel Informatik 12 TU Dortmund Germany Chapter.

Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware Peter Marwedel Informatik 12 TU Dortmund Germany

Technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund.

Aufgabenbesprechung Programming Contest. Order 7 Bo Pat Jean Kevin Claude William Marybeth 6 Jim Ben Zoe Joey Frederick Annabelle 0 SET 1 Bo Jean Claude.

NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.

Wozu die Autokorrelationsfunktion?

Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.

Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Part 5 Fundamentals in.

Institut für Verkehrsführung und Fahrzeugsteuerung > Technologien aus Luft- und Raumfahrt für Straße und Schiene Automatic Maneuver Recognition in the.

Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.

Lehrstuhl für Künstliche Intelligenz - Univ. Würzburg Optimization of simulated biological multi-agent systems by means of evolutionary processes Alexander.

Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.

Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.

Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.

Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.

| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.

BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.

Quelle: Standish Group, 2006 Fourth Quarter Research Report, CHAOS Research Results.

Department of Computer Science Homepage HTML Preprocessor Perl Database Revision Control System © 1998, Leonhard Jaschke, Institut für Wissenschaftliches.

INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/

Faculty of Public Health Department of Health Economics and Management University of Bielefeld WP 3.1 and WP 4.1: Macrocost EUprimecare Plenary Meeting.

Impairments in Polarization-Multiplexed DWDM Channels due to Cross- Polarization Modulation Marcus Winter Christian-Alexander Bunge Klaus Petermann Hochfrequenztechnik-Photonik.

Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.

4th Symposium on Lidar Atmospheric Applications

Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.

Image Processing and Analysis Introduction. How do we see things ?

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware compilation - Sessions Peter Marwedel TU Dortmund.

ESSnet Workshop Conclusions.

Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) dCache T1 admins and experts Welcome to FZK.

3rd Review, Vienna, 16th of April 1999 SIT-MOON ESPRIT Project Nr Siemens AG Österreich Robotiker Technische Universität Wien Politecnico di Milano.

Wind Energy in Germany 2004 Ralf Christmann, BMU Joachim Kutscher, PTJ

AVL-Trees (according to Adelson-Velskii & Landis, 1962) In normal search trees, the complexity of find, insert and delete operations in search.

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.

Separable Verbs Turn to page R22 in your German One Book R22 is in the back of the book There are examples at the top of the page.

1 Intern | ST-IN/PRM-EU | | © Robert Bosch GmbH Alle Rechte vorbehalten, auch bzgl. jeder Verfügung, Verwertung, Reproduktion, Bearbeitung,

Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques 2010/12/20 Peter Marwedel TU Dortmund, Informatik.

Fakultät für informatik informatik 12 technische universität dortmund Memory architecture description languages - Session 20 - Peter Marwedel TU Dortmund.

Launch ON Global.vi System ID object name classname Services to suscribe Observer Control Ref vi-path Service name Step 1 : Objects register to the Global.vi´s,

1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.

Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.

Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.

Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.

TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.

Work in Progress Ignacio Yaselli, Brunel University

Präsentation transkript:

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/13 Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

- 2 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Structure of this course 2: Specification 3: ES-hardware 4: system software (RTOS, middleware, …) 8: Test 5: Validation & Evaluation (energy, cost, performance, …) 7: Optimization 6: Application mapping Application Knowledge Design repository Design Numbers denote sequence of chapters

- 3 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Task-level concurrency management Granularity: size of tasks (e.g. in instructions) Readable specifications and efficient implementations can possibly require different task structures. Granularity changes

- 4 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Merging of tasks Reduced overhead of context switches, More global optimization of machine code, Reduced overhead for inter-process/task communication.

- 5 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Splitting of tasks No blocking of resources while waiting for input, more flexibility for scheduling, possibly improved result.

- 6 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Merging and splitting of tasks The most appropriate task graph granularity depends upon the context merging and splitting may be required. Merging and splitting of tasks should be done automatically, depending upon the context.

- 7 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Automated rewriting of the task system - Example -

- 8 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Attributes of a system that needs rewriting Tasks blocking after they have already started running

- 9 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Work by Cortadella et al. 1.Transform each of the tasks into a Petri net, 2.Generate one global Petri net from the nets of the tasks, 3.Partition global net into sequences of transition 4.Generate one task from each such sequence Mature, commercial approach not yet available

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Result, as published by Cortadella Reads only at the beginning Initialization task Always true Never true

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Optimized version of Tin Tin () { READ (IN, sample, 1); sum += sample; i++; DATA = sample; d = DATA; L0: if (i < N) return; DATA = sum/N; d = DATA; d = d*c; WRITE(OUT,d,1); sum = 0; i = 0; return; } Always true j==i-1 j i Never true

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Task-level concurrency management (2) The dynamic behavior of applications getting more attention. Energy consumption reduction is the main target. Some classes of applications (i.e. video processing) have a considerable variation in processing power requirements depending on input data. Static design-time methods becoming insufficient. Runtime-only methods not feasible for embedded systems. How about mixed approaches?

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Example of a mixed TCM [IMEC, Belgium, …or they can define a probability for violating the deadline. t Deadline Task1 Task2 Task3 Static (compile-time) methods can ensure WCET feasible schedules, but waste energy in the average case. t E Deadline Runtime scheduler selects the most energy saving, deadline preserving combination. t Deadline Mixed methods use compile-time analysis to define a set of possible execution parameters for each task.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Floating-point to fixed point conversion Pros: Lower cost Faster Lower power consumption Sufficient SQNR, if properly scaled Suitable for portable applications Cons: Decreased dynamic range Finite word-length effect, unless properly scaled Overflow and excessive quantization noise Extra programming effort © Ki-Il Kum, et al. (Seoul National University): A Floating-point To Fixed-point C Converter For Fixed-point Digital Signal Processors, 2nd SUIF Workshop, 1996

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Fixed-Point Data Format S hypothetical binary point IWL=3 S (a) Integer (b) Fixed-Point FWL © Ki-Il Kum, et al Floating-Point vs. Fixed-Point Integer vs. Fixed-Point exponent, mantissa Floating-Point automatic computation and update of each exponent at run-time Fixed-Point implicit exponent determined off-line exponent, mantissa Floating-Point automatic computation and update of each exponent at run-time Fixed-Point implicit exponent determined off-line

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Assignment and Addition/Subtraction Assume y = x, with -x (IWL=2) and -y (IWL=3): s s x x>>1 y s Let result = x + y: equalizing each IWL s y s result + © Ki-Il Kum, et al s x x>>1 s

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Multiplication Assume result = x * y, with -x (IWL=2) and -y (IWL=3) ->result (IWL=2+3) s x * y s s result © Ki-Il Kum, et al s s

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Development Procedure Range Estimation C Program Execution Floating-Point C Program Fixed-Point C Program Floating- Point to Fixed-Point C Program Converter Range Estimator Manual specification IWL information © Ki-Il Kum, et al

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Range Estimator C pre-processor C front-end ID assignment Subroutine call insertion SUIF-to-C converter Floating-Point C Program Range Estimation C Program IWL Information Execution float iir1(float x) { static float s = 0; float y; y = 0.9 * s + x; range(y, 0); s = y; range(s, 1); return y; } float iir1(float x) { static float s = 0; float y; y = 0.9 * s + x; range(y, 0); s = y; range(s, 1); return y; } Range Estimation C Program © Ki-Il Kum, et al

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Operations in fixed point program 0.9 x 2 15 s iwl=4.xxxxxxxxxxxx * + x iwl=0.xxxxxxxxxxxx >>5 overflow if result

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Floating-Point to Fixed-Point Program Converter int iir1(int x) { static int s = 0; int y; y=sll(mulh(29491,s)+ (x>> 5),1); s = y; return y; } Fixed-Point C Program mulh to access the upper half of the multiplied result target dependent implementation sll to remove 2 nd sign bit opt. overflow check © Ki-Il Kum, et al

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Performance Comparison - Machine Cycles - © Ki-Il Kum, et al

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Performance Comparison - Machine Cycles - © Ki-Il Kum, et al

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Performance Comparison - SNR - © Ki-Il Kum, et al

fakultät für informatik informatik 12 technische universität dortmund High-level software transformations Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Impact of memory allocation on efficiency Array p[j][k] Row major order (C) Column major order (FORTRAN) j=0 j=1 j=2 … … k=0 k=1 k=2 j=0 j=1 … j=0 j=1 … j=0 j=1 …

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Best performance if innermost loop corresponds to rightmost array index Two loops, assuming row major order (C): for (k=0; k<=m; k++) for (j=0; j<=n; j++) for (j=0; j<=n; j++) ) for (k=0; k<=m; k++) p[j][k] =... p[j][k] =... For row major order j=0 j=1j=2 Good cache behavior Poor cache behavior Same behavior for homogeneous memory access, but: memory architecture dependent optimization

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Program transformation Loop interchange (SUIF interchanges array indexes instead of loops) Improved locality

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results: strong influence of the memory architecture Loop structure: i j k Time [s] [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004] Ti C6xx ~ 57% Intel Pentium 3.2 % Sun SPARC 35% Processor reduction to [%] Dramatic impact of locality Not always the same impact..

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Transformations Loop fusion (merging), loop fission for(j=0; j<=n; j++) for (j=0; j<=n; j++) p[j]=... ; {p[j]=... ; for (j=0; j<=n; j++), p[j]= p[j] +...} p[j]= p[j] +... Loops small enough toBetter locality for allow zero overheadaccess to p. LoopsBetter chances for parallel execution. Which of the two versions is best? Architecture-aware compiler should select best version.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Example: simple loops void ss1() {int i,j; for (i=0;i<size;i++){ for (j=0;j<size;j++){ a[i][j]+= 17;}} for(i=0;i<size;i++){ for (j=0;j<size;j++){ b[i][j]-=13;}}} void ms1() {int i,j; for (i=0;i< size;i++){ for (j=0;j<size;j++){ a[i][j]+=17; } for (j=0;j<size;j++){ b[i][j]-=13; }}} void mm1() {int i,j; for(i=0;i<size;i++){ for(j=0;j<size;j++){ a[i][j] += 17; b[i][j] -= 13;}}} #define size 30 #define iter int a[size][size]; float b[size][size]; #define size 30 #define iter int a[size][size]; float b[size][size];

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results: simple loops Merged loops superior; except Sparc with –o3 ss1 ms1 mm1 (100% max)

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Loop unrolling for (j=0; j<=n; j++) p[j]=... ; for (j=0; j<=n; j+=2) {p[j]=... ; p[j+1]=...} factor = 2 Better locality for access to p. Less branches per execution of the loop. More opportunities for optimizations. Tradeoff between code size and improvement. Extreme case: completely unrolled loop (no branch).

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Example: matrixmult #define s 30 #define iter 4000 int a[s][s],b[s][s],c[s] [s]; void compute(){int i,j,k; for(i=0;i<s;i++){ for(j=0;j<s;j++){ for(k=0;k<s;k++){ c[i][k]+= a[i][j]*b[j][k]; }}}} extern void compute2() {int i, j, k; for (i = 0; i < 30; i++) { for (j = 0; j < 30; j++) { for (k = 0; k <= 28; k += 2) {{int *suif_tmp; suif_tmp = &c[i][k]; *suif_tmp= *suif_tmp+a[i][j]*b[j][k];} {int *suif_tmp; suif_tmp=&c[i][k+1]; *suif_tmp=*suif_tmp +a[i][j]*b[j][k+1]; }}}} return;}

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results Ti C6xx Intel PentiumSun SPARC Processor Benefits quite small; penalties may be large [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004] factor

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results: benefits for loop dependences Small benefits; Ti C6xx Processor reduction to [%] #define s 50 #define iter int a[s][s], b[s][s]; void compute() { int i,k; for (i = 0; i < s; i++) { for (k = 1; k < s; k++) { a[i][k] = b[i][k]; b[i][k] = a[i][k-1]; }}} [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004] factor

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Program transformation Loop tiling/loop blocking: - Original version - for (i=1; i<=N; i++) for(k=1; k<=N; k++){ r=X[i,k]; /* to be allocated to a register*/ for (j=1; j<=N; j++) Z[i,j] += r* Y[k,j] } % Never reusing information in the cache for Y and Z if N is large or cache is small (2 N³ references for Z). j++ k++ i++ j++ k++ i++ j++ k++ i++

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Loop tiling/loop blocking - tiled version - for (kk=1; kk<= N; kk+=B) for (jj=1; jj<= N; jj+=B) for (i=1; i<= N; i++) for (k=kk; k<= min(kk+B-1,N); k++){ r=X[i][k]; /* to be allocated to a register*/ for (j=jj; j<= min(jj+B-1, N); j++) Z[i][j] += r* Y[k][j] } Reuse factor of B for Z, N for Y O(N³/B) accesses to main memory k++, j++ jj kk j++ k++ i++ jj k++ i++ Same elements for next iteration of i Compiler should select best option Monica Lam: The Cache Performance and Optimization of Blocked Algorithms, ASPLOS, 1991

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Example In practice, results by Buchwald are disappointing. One of the few cases where an improvement was achieved: Source: similar to matrix mult. [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004] Tiling-factor SPARC Pentium

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Transformation Loop nest splitting Example: Separation of margin handling + many if- statements for margin-checking no checking, efficient only few margin elements to be processed

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 if (x>=10||y>=14) for (; y<49; y++) for (k=0; k<9; k++) for (l=0; l<9;l++ ) for (i=0; i<4; i++) for (j=0; j<4;j++) { then_block_1; then_block_2} else {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (0 || 35<x3 ||0 || 48<y3) then-block-1; else else-block-1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2; }}}}}} Loop nest splitting at University of Dortmund Loop nest from MPEG-4 full search motion estimation for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++) {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (x3<0 || 35<x3||y3<0||48<y3) then_block_1; else else_block_1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2; }}}}}} for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++) analysis of polyhedral domains, selection with genetic algorithm [H. Falk et al., Inf 12, UniDo, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results for loop nest splitting - Execution times - [H. Falk et al., Inf 12, UniDo, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results for loop nest splitting - Code sizes - [Falk, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Array folding Initial arrays

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Array folding Unfolded arrays

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Inter-array folding Intra-array folding

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Application Array folding is implemented in the DTSE optimization proposed by IMEC. Array folding adds div and mod ops. Optimizations required to remove these costly operations. At IMEC, ADOPT address optimizations perform this task. For example, modulo operations are replaced by pointers (indexes) which are incremented and reset.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Results (Mcycles for cavity benchmark) ADOPT&DTSE required to achieve real benefit [C.Ghez et al.: Systematic high-level Address Code Transformations for Piece-wise Linear Indexing: Illustration on a Medical Imaging Algorithm, IEEE WS on Signal Processing System: design & implementation, 2000, pp ]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2010 Summary Task concurrency management Re-partitioning of computations into tasks Dynamic exploitation of slack Floating-point to fixed point conversion Range estimation Conversion Analysis of the results High-level loop transformations Fusion Unrolling Tiling Loop nest splitting Array folding