Algorithm Engineering Parallele Algorithmen Stefan Edelkamp
Übersicht Parallele Externe Suche Parallele Verspätete Duplikatselimination Parallele Expansion Verteilte Sortierung Parallele Strukturierte Duplikatselimination Disjunkte Duplikatserkennungsbereiche Schlöser Parallele Algorithmen Matrix-Multiplikation List Ranking Euler Tour
Verteilte Suche Distributed setting provides more space. Experiments show that internal time dominates I/O.
Exploiting Independence Since each state in a Bucket is independent of the other – they can be expanded in parallel. Duplicates removal can be distributed on different processors. Bulk (Streamed) transfers much better than single ones.
Distributed Queue for Parallel Best- First Search P0 P1 P2 TOP Beware of the Mutual Exclusion Problem!!!
Multiple Processors - Multiple Disks Variant Sorted buffers w.r.t the hash val Sorted Files P1 P2 P3P4 Divide w.r.t the hash ranges Sorted buffers from every processor Sorted File h 0 ….. h k-1 h k ….. h l-1
Parallel External A*
Distributed Heuristic Evaluation Assume one child processor for each tile one master processor B3B3 B1B1 B2B2 B8B8 B4B4 B5B5 B6B6 B7B7 B9B9 B 10 B 11 B 12 B 13 B 14 B 15 B0B0 B3B3 B1B1 B2B2 B8B8 B4B4 B5B5 B6B6 B7B7 B9B9 B 10 B 11 B 12 B 13 B 14 B 15 B0B0
Distributed Pattern Database Search Only pattern databases that include the client tile need to be loaded on the client Because multiple tiles in pattern, from birds eye PDB loaded multiple times In 15-Puzzle with corner and fringe PDB this saves RAM in the order of factor 2 on each machine, compared to loading all In 36-Puzzle with 6-tile pattern databases this saves RAM in the order of factor 6 on each machine, compared to loading all Extends to additive pattern databases
Distributed Heuristic Evaluation
Same bottleneck in external-memory search Bottleneck: Duplicate detection Duplicate paths cause parallelization overhead A C D B BCDDDD Internal memory External memory vs. fast slow A
Disjoint duplicate-detection scopes B1B1 B0B0 B4B4 B0B0 B3B3 B1B1 B2B2 B8B8 B4B4 B5B5 B6B6 B7B7 B9B9 B 10 B 11 B 12 B 13 B 14 B 15 B0B0 B1B1 B4B4 B3B3 B2B2 B7B7 B2B2 B3B3 B7B7 B 12 B8B8 B 13 B 15 B 14 B 11 B8B8 B 12 B 13 B 11 B 15 B 14
Finding disjoint duplicate-detection scopes B1B1 B0B0 B4B B2B2 B3B3 B7B B8B8 B 12 B 13 B 11 B 15 B B1B1 B5B5 B6B6 B4B4 B9B
Implementation of Parallel SDD Hierarchical organization of hash tables One hash table for each abstract node Top-level hash func. = state-space projection func. Shared-memory management Minimum memory-allocation size m Memory wasted is bounded by O(m #processors) External-memory version I/O-efficient order of node expansions I/O-efficient replacement strategy Benötigt nur ein Mutex Schloss B3B3 B1B1 B2B2 B8B8 B4B4 B5B5 B6B6 B7B7 B9B9 B 10 B 11 B 12 B 13 B 14 B 15 B0B0
Parallelle Matrix- Multiplication
Parallele Matrix Multiplication
Exklusives Schreiben
Parallele Kopien
Fazit Matrix Multiplication
Paralleles List Ranking
List Ranking
Erster Algorithmus
Prinzip
Komplexität
Verbesserungen
Strategie
Unabhängige Mengen
2-Färbung
Reduktion
Restauration
Beispiel
Variablen
Beispiel (ctd.)
Pseudo Code
Nächster Schritt
Analyse
Backup
Algo
Speicher
Analyse
Ausblick: Randomisiert in O(n) whp?
Probleme mit DFS
Idee Euler Tour
Parallel DFS
DFS Nummern
Allgemein
Beispiel
Ein Zyklus oder mehrere?
Korrektheit
Beispiel
Konstruktion Euler Tour
Fazit Euler Touren
GPU Architektur
Effektivität
Hierarchischer Speicher
Hash-based Partitioning
BFS
Kernel Functions