The Future of Parallel Computing Special Purpose Mesh Architectures Heiko Schröder, 1998 P A R C SA ISA PIPS RM OH
Heiko Schröder, 1998Slide 2 P A R C Contents Why meshes ??? Application specific parallel mesh architectures Fine grain 1983 Coarse grain Systolic Arrays -Instruction Systolic Arrays -PIPS -Reconfigurable mesh -Optical Highway
Heiko Schröder, 1998Slide 3 P A R C Physical limits c= km/sec OPS mm/OP 1000 PEs with OPS -- 30cm/OP massive parallelism distributed memory
Heiko Schröder, 1998Slide 4 P A R C Performance (MIPS) Year Pentium 2 Pentium Processor power
Heiko Schröder, 1998Slide 5 P A R C 0,5 µ 0,25 µ Scaling Faktor 2: 1/2 width 1/2 hight 1/2 switching time 8 x performance!
Heiko Schröder, 1998Slide 6 P A R C CMOS transistors ,01 0, Size of minimal transistor ca. 0,03
Heiko Schröder, 1998Slide 7 P A R C Mesh/Torus diameter bisection width 2D mesh
Heiko Schröder, 1998Slide 8 P A R C Hypercube 0-D D D D 01 4-D diameter log n bisection width n
Heiko Schröder, 1998Slide 9 P A R C VLSI Very Large Scale Integration simple cells few types regular architecture short connections mesh -- torus
Heiko Schröder, 1998Slide 10 P A R C Pin limitations 16x16 pins diameter pins diameter 16 16x12 pins
Heiko Schröder, 1998Slide 11 P A R C Bisection width Bisection width cm Bisection width 32K 32 m
Heiko Schröder, 1998Slide 12 P A R C Programming SA --- Systolic Array SIMD --- Single Instruction Multiple Data ISA --- Instruction Systolic Array MIMD --- Multiple Instruction Multiple Data
Heiko Schröder, 1998Slide 13 P A R C parallel merge initial situation: 1.) sort columns (odd-even-transposition sort) 2.) sort rows (odd-even-transposition sort) sorted !!!! x1x2 x3 x4 x5 x6 x7 x17 x18 y1 y2 y3 y4 y5 y6 y7 y17 y18...
Heiko Schröder, 1998Slide 14 P A R C 0-1 principle The 0-1 principle states that if all sequences of 0 and 1 are sorted properly than this is a correct sorter. The sorter must be based on moving data. initially 0s 1s after vertical sort 0s 1s after horizontal sort 0s 1s
Heiko Schröder, 1998Slide 15 P A R C MIMD-mesh (clocked) min maxTime: 2n
Heiko Schröder, 1998Slide 16 P A R C systolic merge
Heiko Schröder, 1998Slide 17 P A R C systolic merge
Heiko Schröder, 1998Slide 18 P A R C systolic merge
Heiko Schröder, 1998Slide 19 P A R C systolic merge
Heiko Schröder, 1998Slide 20 P A R C systolic merge
Heiko Schröder, 1998Slide 21 P A R C systolic merge
Heiko Schröder, 1998Slide 22 P A R C systolic merge
Heiko Schröder, 1998Slide 23 P A R C systolic merge
Heiko Schröder, 1998Slide 24 P A R C systolic merge
Heiko Schröder, 1998Slide 25 P A R C systolic merge
Heiko Schröder, 1998Slide 26 P A R C systolic merge
Heiko Schröder, 1998Slide 27 P A R C systolic merge
Heiko Schröder, 1998Slide 28 P A R C systolic merge
Heiko Schröder, 1998Slide 29 P A R C systolic merge
Heiko Schröder, 1998Slide 30 P A R C systolic merge
Heiko Schröder, 1998Slide 31 P A R C systolic merge
Heiko Schröder, 1998Slide 32 P A R C systolic merge
Heiko Schröder, 1998Slide 33 P A R C systolic merge
Heiko Schröder, 1998Slide 34 P A R C systolic merge
Heiko Schröder, 1998Slide 35 P A R C systolic merge
Heiko Schröder, 1998Slide 36 P A R C sorted !!! systolic merge
Heiko Schröder, 1998Slide 37 P A R C Characteristics of SAs Extremely high cost-performance no flexibility -- long development time Suitable for special signal processing tasks ???
Heiko Schröder, 1998Slide 38 P A R C Systolic architectures I
Heiko Schröder, 1998Slide 39 P A R C Systolic architectures II
Heiko Schröder, 1998Slide 40 P A R C ISA merge C:=min{C, CE} C:=max{C, CW}
Heiko Schröder, 1998Slide 41 P A R C ISA merge
Heiko Schröder, 1998Slide 42 P A R C ISA merge
Heiko Schröder, 1998Slide 43 P A R C ISA merge
Heiko Schröder, 1998Slide 44 P A R C ISA merge
Heiko Schröder, 1998Slide 45 P A R C ISA merge
Heiko Schröder, 1998Slide 46 P A R C ISA merge
Heiko Schröder, 1998Slide 47 P A R C ISA merge
Heiko Schröder, 1998Slide 48 P A R C ISA merge
Heiko Schröder, 1998Slide 49 P A R C ISA merge
Heiko Schröder, 1998Slide 50 P A R C ISA merge
Heiko Schröder, 1998Slide 51 P A R C ISA merge
Heiko Schröder, 1998Slide 52 P A R C ISA merge
Heiko Schröder, 1998Slide 53 P A R C ISA merge
Heiko Schröder, 1998Slide 54 P A R C ISA merge
Heiko Schröder, 1998Slide 55 P A R C ISA merge
Heiko Schröder, 1998Slide 56 P A R C ISA merge
Heiko Schröder, 1998Slide 57 P A R C ISA merge
Heiko Schröder, 1998Slide 58 P A R C ISA merge
Heiko Schröder, 1998Slide 59 P A R C ISA merge
Heiko Schröder, 1998Slide 60 P A R C ISA merge
Heiko Schröder, 1998Slide 61 P A R C ISA merge
Heiko Schröder, 1998Slide 62 P A R C ISA merge
Heiko Schröder, 1998Slide 63 P A R C ISA merge
Heiko Schröder, 1998Slide 64 P A R C ISA merge
Heiko Schröder, 1998Slide 65 P A R C ISA merge
Heiko Schröder, 1998Slide 66 P A R C ISA merge
Heiko Schröder, 1998Slide 67 P A R C ISA merge
Heiko Schröder, 1998Slide 68 P A R C ISA merge
Heiko Schröder, 1998Slide 69 P A R C ISA merge
Heiko Schröder, 1998Slide 70 P A R C ISA merge
Heiko Schröder, 1998Slide 71 P A R C Hough transform on the ISA good line detection method shear Fast tomography
Heiko Schröder, 1998Slide 72 P A R C robot vision projector CCD stereo vision
Heiko Schröder, 1998Slide 73 P A R C Use of the ISA Areas of application for ISA: automatic optical quality control real time signal processing computer graphics /visualization linear equations Cryptography --> Tele-medicine ? Special features: fast aggregate functions (sum, carry) fast local communication no local memory typical improvement over PC: Factor 20-30
Heiko Schröder, 1998Slide 74 P A R C Instruction Systolic Array
Heiko Schröder, 1998Slide 75 P A R C PIPS ( ) 1 M bit 1 M bit 32x32 torus 16 bit parallel communication 16 bit add prefetch memory control BHP -- CSIRO -- NU -- ADFA 1.4 M
Heiko Schröder, 1998Slide 76 P A R C Special features: local memory SIMD-torus memory pre-fetch Applications: visualization 3D-simulation (CFD, FEM)
Heiko Schröder, 1998Slide 77 P A R C
Heiko Schröder, 1998Slide 78 P A R C PIPS
Heiko Schröder, 1998Slide 79 P A R C Use in industry ? Performance [Gflops] Research Industry
Heiko Schröder, 1998Slide 80 P A R C Investments Investments into parallel computers [M$] Research Industry
Heiko Schröder, 1998Slide 81 P A R C Concentration Number of manufacturers
Heiko Schröder, 1998Slide 82 P A R C Degree of Parallelism May-93 Nov-93 May-94 Nov-94 May-95 Nov-95 May-96 Nov-96 May-97 Nov-97 1 to to to and more Number of new Systems
Heiko Schröder, 1998Slide 83 P A R C Evaluation Cost * computation time Parallel computers with standard components Imbedded parallel systems
Heiko Schröder, 1998Slide 84 P A R C reconfigurable mesh reconfigurable mesh = mesh + interior connections 15 positions low cost
Heiko Schröder, 1998Slide 85 P A R C global OR and modulo ** V log n on EREW-PRAM * 1 mod 3 log n / log log n on CRCW-PRAM
Heiko Schröder, 1998Slide 86 P A R C sorting with all-to-all mapping Sorting: sort blocks all-to-all (columns) sort blocks all-to-all (rows) o-e-sort blocks
Heiko Schröder, 1998Slide 87 P A R C all-to-all mapping n x n
Heiko Schröder, 1998Slide 88 P A R C vertical all-to-all
Heiko Schröder, 1998Slide 89 P A R C horizontal all-to-all
Heiko Schröder, 1998Slide 90 P A R C 1 step 2 steps 3 steps k/2 steps 3 steps 2 steps 1 step (k/2) 2 steps
Heiko Schröder, 1998Slide 91 P A R C sorting in optimal time (k/2) 2 steps k=n 1/3 each step takes n 1/3 time --> T= n/4 x 2 /2 T = n/2 all-to-all Sorting: sort blocks (O(n 2/3 )) all-to-all (n/2) sort blocks (O(n 2/3 )) all-to-all (n/2) sort blocks (O(n 2/3 )) time: n + o(n)
Heiko Schröder, 1998Slide 92 P A R C Reconfigurable mesh Special features SIMD constant diameter faster than PRAM ? Suitable applications routing/sorting/load balancing sparse matrix multiplication segmentation / component labeling feature extraction image database ?
Heiko Schröder, 1998Slide 93 P A R C Reconfigurable mesh
Heiko Schröder, 1998Slide 94 P A R C Optical Highway W=1; P=100 W=100; P=22 All-to-all connection
Heiko Schröder, 1998Slide 95 P A R C Horizontal all-to-all Vertical all-to-all
Heiko Schröder, 1998Slide 96 P A R C Features of optically connected meshes SIMD/SPMD/MIMD implement all major architectures all-to-all communication in 2 steps Bulk synchronous processing (BSP) no latency hiding no pin-limitation Applications coarse grain parallel computing only? ray-tracing ? ???
Heiko Schröder, 1998Slide 97 P A R C Optical Highway 1. H. Schröder et al, RMB --- A Reconfigurable Multiple Bus Network, HPCA 96, San Jose, H. Schröder, O. Sykora, I. Vrto, Optical All-to-All Communication for some Product Graphs, SOFSEM '97, Milovy, Czech Republic, 1997
Heiko Schröder, 1998Slide 98 P A R C Bisection-width / Diameter Diameter log n bisection width n diameter bisection width SA ISA PIPS Diameter 1 bisection width RM Diameter 1 bisection width n OH
Heiko Schröder, 1998Slide 99 P A R C Suitable problems ? diameter log n bisection width n SA: suitable applications? ISA: 2D-problems, aggregate functions local communication PIPS: 3D-problems, local communication RM: diameter-bound > bisection-width-bound OH: PRAM equivalent? SA ISA PIPS RM OH
? ? ? ?