Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012.

Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20122 A web-scale search engine maintains a copy of the World Wide Web –estimated to have 7.3 billion pages, March 2012 Challenges –search this copy in sub-seconds –identify duplicate content –compute the ‚importance‘ of pages (PageRank) –display content-related ads –…

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20123 The largest social network maintains the world‘s largest social network –721 million active users (May 2011) –68.7 billion friendship links, 2.7 billion likes per day Challenges –provide realtime updates of friends activities –suggest new friends (link prediction) –display content-related ads –…

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20124 The world‘s fastest news medium realtime communication via short messages –2009: 2 million tweets per day –2010: 65 million tweets per day –2011: 200 million tweets per day Challenges –allow search in (near) realtime –recommend interesting people (link prediction) –find topics in the messages –…

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20125 Big Data: Terabyte, petabyte, and beyond Do not fit in main memory, do not fit on a single disk A simple scan may already take a long time Everything with worse than linear complexity is impossible Indexing is very hard (and tasks are not that repetitive) Not thousands of ad-hoc queries, but dozens of ad-hoc analysis tasks Main trick: Parallelization

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20126 Embarrassingly Parallel Problems Parallel algorithms have been in focus for ages But: Most research focused on problems that are hard to parallelize (e.g. simulation) But many tasks are very easy (embarrassingly) to parallelize: Data parallelism –Analysis always performed on exactly one “record” –All records can be analyzed in parallel

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20127 Problems Remaining Describe your task in a way that allows scheduling and parallelization Manage 10.000 machines running concurrently Schedule your tasks (data partitions) on 10.000 machines Where are your machines? Cloud computing

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20128 Who should be here Bachelor Informatik, Diplominformatik Ability to read English papers Good knowledge in databases Queries, operator trees, query optimization Some knowledge in distributed/parallel systems

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 20129 How it will work Today: Presentation and choice of topics Meet your advisor by 15.5. to discuss topic and papers Present your topic in a 5min flash-presentation end of Mai Succeed in mini-exam on background papers Meet your advisor by 20.6. to discuss slides Present your topic (30-40min) at the Blockseminar Challenge another student as opponent at the B-seminar Write seminar thesis (10-15 pages) by 31.8.2012

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201210 ToC Introduction Topics Assignment Hints on presenting your topic and writing your thesis

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201211 Describe your Task

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201212 Two General Approaches Workflows –Strong tradition in BPM; Control- versus dataflow; complex semantics if cycles are allowed –Scientific workflows: (Only) dataflow, no cycles Query languages –Strong tradition in databases; selection and manipulation; clear (relational) semantics based in logics; much work on (parallel) optimization –Dataflow languages: Only SELECT, richer model than SQL Common ground: A dataflow –Data flows along edges –Nodes implement analysis steps (filter, compute, group, …) –Special nodes my implement bifurcation

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201213 JAQL

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201214 (Parallel) Execution of a Dataflow Principle: Distribute nodes (tasks) on machines and let data run through it Which tasks should run where? How is data passed around? –Data to tasks or tasks to data? Multi-core and many-core? Failures? Re-ordering of tasks? …

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201215 Parallel Execution of a Dataflow Your entire dataflow Task 1Task 2 Partition 1 All data Your entire dataflow Partition 2 Data parallelism Task parallelism

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201216 Pipelining Task 1Task 2 Partition 1Partition 2

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201217 Map Reduce Dean and Ghemawat: MapReduce: simplified data processing on large clusters, CACM, 2008 –Of the main drivers of this entire line of research –~4500 citations since 2008 Separate code of your tasks (user code) from code to schedule and manage execution of tasks Requires a certain (restricted) programming model Builds on a distributed file system (GFS) Open source implementation: Hadoop

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201218 Map and Reduce Map –Reads file with key/value pairs –Creates intermediate set of key/value pair Reduce –Reads all sets of intermediate key/value pairs –Sorts, groups and aggregates P1 P2 P3 All data Results MAPMAP REDUCEREDUCE

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201219 Example: Word Count Many texts Some texts count word:count Group by word agg counts Some texts word:count

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201220 Using Map Reduce Implement your map function(s) Implement your reduce function(s) System takes care of partitioning, scheduling, sorting after map, logging, restarting after failure, … How to map a dataflow into a map reduce flow?

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201221 Background knowledge Markl, V., Ewen, S. and Alexandrov, A. (2012). "Web- Scale Analytics for BIG Data.“ – to appear –You‘ll get a hardcopy Varia, J. (2008). "Cloud Architectures". Amazon Technical White Paper. Deelman, E., Gannon, D., Shields, M. and Taylor, I. (2009). "Workflows and e-Science: An overview of workflow system features and capabilities." Future Generation Computer Systems 25(5): 528-540.

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201223 TopicAssigned to Pig Latin (Leser) DryadLinq (Leser) JAQL & JSON (Rheinländer) Stratosphere (Rheinländer) Text Mining with MapReduce (Rheinländer) Cluster versus cloud (Leser) Taverna (Starlinger) Kepler (Starlinger) Scientific Workflow Interoperability (Starlinger) Pegasus (Bux) Workflow Scheduling (Bux) Workflow Optimization (Bux)

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201224 PIG-LATIN (Leser) Gates et al. Building a high-level dataflow system on top of Map-Reduce: the Pig experience, VLDB 2009 Olston et al. Pig latin: a not-so-foreign language for data processing, SIGMOD 2008 Simple language to program series of FILTER, SORT, JOIN, GROUP-BY etc. plus UDFs on files Compiles to series of MapReduce jobs on Hadoop –Virtually no optimization, only data parallelism Sweet spot between declarative SQL and proc. MapReduce In heavy use at Yahoo! Good engineering, little theory or break-through

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201225 DRYAD-LINQ (Leser) Ekanayake et al. DryadLINQ for Scientific Analyses, eScience 2009 Yu et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, OSDI 2008 Similar system from Microsoft Compiles LINQ programs to DAG-structured dataflows –Close embedding –No higher-order functions Optimization during scheduling with Drayd Second paper compares DryadLinq with Hadoop using four examples from scientific data analysis Less database-centric, more systems-oriented

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201226 JAQL & JSON (Rheinländer) JSON - Javascript object notation –Machine & human readable –Data exchange between applications –Schema specification possible JAQL –Declarative scripting language for large-scale data analysis –Data model: JSON –Parallelized and executed on Hadoop [Beyer et al.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. 2011. PVLDB]

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201227 Stratosphere (Rheinländer) Programming Model: PACT –Generalization of MapReduce –Tasks are expressed as data flow graphs –Programming Contracts (PACT) process data –PACT: Input contract + UDF + Output contract Execution Engine: Nephele –Handles parallel execution in local, cluster, or cloud mode [http://www.stratosphere.eu/projects/Stratosphere/wiki/ExecutePactProgram]

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201228 Text Processing with MapReduce (Rheinländer) Inverted index creation –Fast full text & web search –Baseline: dist.sort +group by Problem: Number of term occurrences can grow large  Scalability bottleneck –Mechanisms to eliminate bottleneck Modified algorithm Compression techniques Apache Mahout –Machine-learning algorithms using MapReduce –Clustering, Classification, … [http://mahout.apache.org] [Lin/Dyer: Data-intensive Text Processing with MapReduce. 2010. Morgan & Claypool]

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201229 Cloud versus Cluster (Leser) Jackson et al. Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud, Cloud Computing Technology and Science 2010 More literature should be found independently Runtime comparison of local cluster with cloud –Here: Amazon EC2 Takes into account data upload, virtualization layer, hardware, communication model, load balancing, instance type, … Some surprising (and simple) observations High-level viewpoint

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201230 Taverna (Starlinger) Missier et all. (2010): “Taverna Reloaded.” Popular SWFM in the life sciences –Offers many preconfigured tasks for bioinformatics web services Dataflow based workflow model Automatic data and task parallelism & pipelining –As parallel threads –no distributed execution Extensible plug-in infrastructure & dispatch stack

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201231 Kepler (Starlinger) Bowers et al. (2005) “Actor-oriented design of scientific workflows”. Supports multiple ‘directors’ (execution engines) –Corresponding to different Models of Computation –With different levels of parallelization SWFM with loads of backing research –“Lets formalize it!” Here: formal model for SWFs Recent developments towards distributed execution with MapReduce on Hadoop

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201232 Workflow Interoperability (Starlinger) Elmroth et all. “Three fundamental dimensions of scientific workflow interoperability...” How can we plug workflows from different SWFM together? Need to consider –Model of computation –Workflow Language –Execution environment Questions: –What exactly is interoperability? –What are these dimensions? –How do they affect interoperability?

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201233 Pegasus (Bux) Deelman et al. (2005). “Pegasus: A framework for mapping complex scientific workflows onto distributed systems.” Developed for the Grid, extended for the Cloud Workflows specified as DAG in XML Only supports task parallelism (no data parallelism) Many workflows from different large data projects

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201234 Workflow Scheduling (Bux) Mandal et al. (2005). “Scheduling Strategies for Mapping Application Workflows onto the Grid.” Map tasks of workflow on Grid resources Objective: minimize overall execution time 1st step: model performance of each task on each resource 2nd step: apply heuristics to find a good mapping Is it better to dispatch short-running or long-running tasks first?

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201235 Workflow Schedule Optimization (Bux) Yu and Buyya (2006). “A budget constrained scheduling of workflow applications on utility grids using genetic algorithms.”

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201236 TopicAssigned to Pig Latin (Leser) DryadLinq (Leser) JAQL & JSON (Rheinländer) Stratosphere (Rheinländer)Schultze (Will) Text Mining with MapReduce (Rheinländer) Cluster versus cloud (Leser) Taverna (Starlinger)Weber (Schultze) Kepler (Starlinger) Scientific Workflow Interoperability (Starlinger) Pegasus (Bux) Workflow Scheduling (Bux) Workflow Optimization (Bux)Will (Weber)

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201238 Allgemeine Hinweise Dozenten sind ansprechbar! Vorbesprechung des Themas Folien durchgehen Abgrenzung der Ausarbeitung Diskussion erwünscht Keine Angst vor Fragen: Fragen sind keine Kritik Eine Frage nicht beantworten können ist in Ordnung Tiefe, nicht Breite Lieber das Thema einengen und dafür Details erklären Bezug nehmen Vergleich zu anderen Arbeiten (im Seminar)

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201239 Allgemeine Hinweise Werten und bewerten Keine Angst vor nicht ganz zutreffenden Aussagen – solange gute Gründe vorhanden sind Begründen und argumentieren Kritikloses Abschreiben ist fehl am Platz Literaturrecherche ist erwünscht Die ausgegebenen Arbeiten sind Anker Weiterführende Arbeiten müssen herangezogen werden Auch Grundlagen nachlesen Wir schicken eine Liste zum Abhaken rum

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201240 Wie halte ich einen Seminarvortrag 1. Wenn man nun so einen Seminarvortrag halten muss, dann empfiehlt es sich, möglichst lange Sätze auf die Folien zu schreiben, damit die Zuhörer nach dem Vortrag aus den Folienkopien noch wissen, was man eigentlich gesagt hat. 2. Während so einem Vortrag schaut sowieso jeder zum Projektor, also kann man das selbst ruhig auch tun - damit kontrolliert man gleichzeitig auch, ob der Beamer wirklich alles projeziert, was auf dem Laptop zu sehen ist. Ausserdem kann man so den Strom für das Laptop-Display sparen. 3. Übersichtsfolien am Anfang sind langweilig, enthalten keinen Inhalt und nehmen den Zuhörern die ganze Spannung. Schliesslich gibt‘s im Kino am Anfang auch keine Inhaltsangabe. 4. Powerpoint kann viele lustige Effekte, hat tolle Designs und Animationen. Die sollte man zur Auflockerung des Vortrags unbedingt alle benutzen, um zu zeigen, wie gut man das Tool im Griff hat. 5. Nicht zu wenig auf die Folien schreiben. Man weiß ja nie, ob man sie nicht doch ausdrucken muss, und man kann so wertvolle Zeit sparen, wenn man nicht weiterschalten muss. 6. Man sollte versuchen, möglichst lange zu reden. Die Zeitvorgaben sind nur für die Leute, die nicht genug wissen - eigentlich will der Prüfer sehen, dass man sich auch darüber hinaus mit dem Thema beschäftigt hat. Bloß keine Hervorhebungen im Text – sonst müssen die Zuhörer ja gar nicht mehr aufpassen!

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201241 Hinweise zum Vortrag ~30 Minuten plus Diskussion Klare Gliederung Ab und an Hinweise geben, wo man sich befindet Bilder und Grafiken; Beispiele Font: mind. 16pt Eher Stichwörter als lange Sätze Vorträge können auch unterhaltend sein Gimmicks, Rhythmuswechsel, Einbeziehen der Zuhörer, etc. Adressat sind alle Teilnehmer, nicht nur die Betreuer Technik: Laptop? Powerpoint?

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201242 Der Opponent Opponenten kennen die Arbeiten –Aber schreiben nichts darüber Sie suchen Schwachstellen, unklare Passagen oder Kritikpunkte Nach dem Vortrag fragen die Opponenten den Vortragenden zu diesen Themen; der Vortragende verteidigt, so gut es geht Ziel: –Bloßstellung des Vortragenden –Anregung zur kritischen Auseinandersetzung mit einem Thema –Erkennen von Schwachstellen

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201243 Hinweise zur Ausarbeitung Eine gedruckte Version abgeben –Selbstständigkeitserklärung unterschreiben Eine elektronische Version schicken Referenzen Alle verwendeten und nur die Im Text referenzieren, Liste am Schluss Korrekt zitieren Vorsicht vor Übernahme von kompletten Textpassagen; wenn, dann deutlich kennzeichnen Aussagen mit Evidenz oder Verweis auf Literatur versehen Verwendung von gefundenen Arbeiten im Web Möglich, aber VORSICHT Eventuell Themenschwerpunkt verschieben – Betreuer fragen

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201244 Format Benutzung unserer Latex-Vorlage Nur eine Schriftart, wenig und konsistente Wechsel in Schriftgröße und –stärke Inhaltsverzeichnis Bilder: Nummerieren und darauf verweisen Referenzen: [1] Yan, X., Yu, P. S. and Han, J. (2004). "Graph Indexing: A Frequent Structure-Based Approach". SIGMOD, Paris, France. [YYH04]Yan, X., Yu, P. S. and Han, J. (2004). "Graph Indexing: A Frequent Structure-Based Approach". SIGMOD, Paris, France. Darf man Wikipedia zitieren? Nur zur äußersten Not – Perma-Link verwenden!

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201245 Hinweise zur Ausarbeitung –2- Gezielt und sachlich schreiben Ausführungen zur „Philosophische Überlegungen zu Vorzügen probabilistischer Verfahren im Vergleich zu Dempster‘s Theory of Evidence“ oder zur „Anmerkungen zur Trivialisierung des politischen Diskurs für soziale Netzwerke unter besonderer Berücksichtigung von Twitter“ möglichst kurz halten Füllwörter vermeiden (dabei, hierbei, dann, …) Knappe Darlegung, präzise Sprache Eine gute Gliederung ist die halbe Miete Kommen Sie zu Aussagen Vorteile, Nachteile, verwandte Arbeiten, mögliche Erweiterungen, Anwendbarkeit, eigene Erfahrungen, …

Ulf Leser et al.: Large Scale Data Analysis, Seminar, SoSe 201246 Wir suchen SHKs

Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012.

Ähnliche Präsentationen

Präsentation zum Thema: "Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012.

Ähnliche Präsentationen

Präsentation zum Thema: "Large Scale Data Analysis Ulf Leser, Marc Bux, Astrid Rheinländer, Johannes Starlinger Seminar Summer 2012."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback