Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of.

Slides:



Advertisements
Ähnliche Präsentationen
Service Oriented Architectures for Remote Instrumentation
Advertisements

R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
1 | R. Steinbrecher | IMK-IFU | KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Natural Sources SNAP11.
Peter Marwedel TU Dortmund, Informatik 12
Aufgabenbesprechung Programming Contest. Order 7 Bo Pat Jean Kevin Claude William Marybeth 6 Jim Ben Zoe Joey Frederick Annabelle 0 SET 1 Bo Jean Claude.
Art der Arbeit (Projekt-/Studien-/Diplomarbeit/
Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.
Java in 9 Folien Besser: Online-Buch Go to Java 2.
SOMA Service-Oriented Mobile learning Architecture.
Can you think of some KEY phrases which would be useful in multiple contexts? Take 2 minutes with a partner and come up with as many as you can!
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
CALPER Publications From Handouts to Pedagogical Materials.
4th Symposium on Lidar Atmospheric Applications
Design Patterns Ein Muster (pattern) ist eine Idee, die sich in einem praktischen Kontext als nützlich erwiesen hat und dies auch in anderen sein wird.
External Labels – The rules For all external labels the following rules apply (external labels are all labels which are not inside of a shape) - all labels.
RZPD Deutsches Ressourcenzentrum für Genomforschung GmbH DESPRAD-Meeting 02/09/2003 Steffen Schulze-Kremer (until 7/2003) Bernd Drescher (since 8/2003)
Launch ON Global.vi System ID object name classname Services to suscribe Observer Control Ref vi-path Service name Step 1 : Objects register to the Global.vi´s,
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
Coordinating Conjunctions Why we need them & how to use them deutschdrang.com.
Präsentiert von Riccardo Fuda.  Klassische (symmetrische) Kryptographie  Der weg zur modernen Kryptographie  Message Authentification Codes  Asymmetrische.
HADOOP – a Brief overview
Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.
1 Bauhaus-Universität Weimar ArchitekturProgrammierung Generative Entwurfsmethoden Processing Grundlagen Professur Informatik in der Architektur.
Synchronization: Multiversion Concurrency Control
SiPass standalone.
Stephanie Müller, Rechtswissenschaftliches Institut, Universität Zürich, Rämistrasse 74/17, 8001 Zürich, Criminal liability.
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
FTS usage at GridKa Forschungszentrum Karlsruhe GmbH
Welcome to Web Services & Grid Computing Jens Mache
Zürcher Fachhochschule Einführung in Big Data II Kurt Stockinger 1.
Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.
1 Konica Minolta IT Solutions Prinzip Partnerschaft MANAGED MONITORING ÜBERWACHJUNG DER SERVERINFRASTRUKTUR UND ANWENDUNGEN DIREKT AUS DER CLOUD.
1/15 Thursday, 21 June 2007 Werner Sudendorf, Jürgen Keiper Deutsche Kinemathek – Museum für Film und Fernsehen Werner Sudendorf, Jürgen Keiper Reconstructing.
Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)
Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.
1 Persistence Strategies for WebServices Senior Consultant Java Forum Stuttgart, 27. Juni 2002.
D. Sturzebecher, Institut für Betriebssysteme und Rechnerverbund, TU-Braunschweig Tele-Teaching with MACS n Tele-Teaching Requirements and Scenarios n.
Sven Koerber-Abe, 2015 Grammatik: können, wollen, möchten Grammatik: können, wollen, möchten.
Sven Koerber-Abe, 2015 Grammatik: müssen, dürfen Grammatik: müssen, dürfen.
How to play: Students are broken up into 2-3 teams (depending on class size). Students can see the game board and the categories, but not point values.
Sven Koerber-Abe, 2016 Grammatik: Artikel (Zusammenfassung) Grammatik: Artikel (Zusammenfassung)
Technische Universität München Alexander Neidhardt Forschungseinrichtung Satellitengeodäsie 1 Concepts for remote control of VLBI-telescopes: on the way.
LLP DE-COMENIUS-CMP Dieses Projekt wurde mit Unterstützung der Europäischen Kommission finanziert. Die Verantwortung für den Inhalt dieser.
Kanton Basel-Stadt Howto crash a sequencer …and a path to get a difficult package to work APP-V Swissgroup / Daniel Müller.
SAS Backstage Biljana Gigić1, Andreas Deckert2
Interrogatives and Verbs
Apache Camel Christian Schneider
The IT–Information System
Premiere Conferencing GmbH
Agenda Eröffnung und Begrüßung durch Mag.a Elisabeth Rosenberger
Azure Countdown Wenn der Freund und Helfer Freunde und Helfer braucht: Sichere Content-Upload-Plattform für Bürger.
Aspect-Oriented Programming: Fad or the Future
Grammatik: waren / hatten
Senseo Wir erweitern die statischen Informationen mit Hilfe von Laufzeitinformationen. Einbindung in IDE resp. JDT.
Pädagogische Hochschule NÖ
THE PAST TENSE (Part 3) VERBS WHICH TAKE SEIN
The new online recognition process
ELECTR IC CARS Karim Aly University of Applied Sciences.
Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.
CSL211 Computer Architecture
THE PAST TENSE (Part 3) VERBS WHICH TAKE SEIN
OFFICE 365 FOCUS SESSION SHAREPOINT ONLINE 101:LERNE DIE BASICS 19. März 2018 Höhr-Grenzhausen.
Integrating Knowledge Discovery into Knowledge Management
Practical Exercises and Theory
- moodle – a internet based learning platform
Eine Präsentation von Heiko Gericke
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Why Map/Reduce? 2 | J. Tao | MapReduce & Hadoop | Massive data Can not be stored on a single machine Takes too long to process in serial Application developers Never dealt with large amount (petabytes) of data Less experience with parallel programming Google proposed a simple programming model – MapReduce – to process large scale data The first MapReduce library on March 2003 Hadoop: Map/Reduce implementation for clusters Open source Widely adopted by large Web companies

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Introduction to MapReduce 3 | J. Tao | MapReduce & Hadoop | A programming model for processing large scale data Automatic parallelization Friendly to procedural languages Data processing is based on a map step and a reduce step Apply a map operation to each logical “record” in the input to generate intermediate values Followed by the reduce operation to merge the same intermediate values Sample use cases Web data processing Data mining and machine learning Volume rendering Biological applications (DNA sequence alignments)

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) The MapReduce Execution Model 4 | J. Tao | MapReduce & Hadoop | Splitting the input data Sorting the output of the map functions Passing the output of map functions to reduce functions Data Partitions Reduce outputs map tasks reduce tasks

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Simple Example: WordCount 5 | J. Tao | MapReduce & Hadoop | Count the occurrence of each word in a text Map ReduceShuffle Input (Text) See Bob run See Jim come Jim leave John come Output See 1 Bob 1 run 1 See 1 Jim 1 come 1 Jim 1 leave 1 John 1 come 1 Input See 1 Bob 1 run 1 Jim 1 come 1 leave 1 John 1 Output See 2 Bob 1 run 1 Jim 2 come 2 leave 1 John 1

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Programming WordCount 6 | J. Tao | MapReduce & Hadoop | Write map and reduce methods Input and output: map: (input of WordCount: no key) reduce: maps transform input records into intermediate records reduces reduce a set of intermediate values which share a key to a smaller set of values Implement Mapper and Reducer interfaces MapReduce user interfaces Most important: Mapper, Reducer Other core interfaces: JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputFormat, OutputCommitter

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) The Map Class 7 | J. Tao | MapReduce & Hadoop | public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Job Configuration 8 | J. Tao | MapReduce & Hadoop | public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); ……………………………. JobClient.runJob(conf); } How much data a map processes? Use the InputFormat in the job configuration How many maps? Driven by the total size of the inputs and the number of processor nodes Suggestion: a map takes at least a minute to execute

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) The Reduce Class 9 | J. Tao | MapReduce & Hadoop | public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } All intermediate values associated with an output key are grouped and passed to the Reducer(s) Control which keys go to which Reducer by implementing the Partitioner interface

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Run WordCount on Hadoop 10 | J. Tao | MapReduce & Hadoop | Create the executable Sample input files Run the application $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java $ jar -cvf /user/xxx/wordcount.jar -C wordcount_classes/. $ bin/hadoop dfs -ls /user/xxx/wordcount/input/ /user/xxx/wordcount/input/file01 /user/xxx/wordcount/input/file02 $bin/hadoop jar /user/xxx/wordcount.jar /user/xxx/wordcount/input /user/xxx/wordcount/output Output: $ bin/hadoop dfs -cat /user/xxx/wordcount/output/part Bob 1 come 2 Jim 2 ……

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop MapReduce Data to process do not fit on one node Move computation to data Distribute filesystem, built in replication, automatic failover in case of failure Apache Hadoop provides  Automatic parallelization and distribution  Fault tolerance  Monitoring and status reporting  Clean abstraction interface for developers Based on the Hadoop Distributed File System (HDFS) 11 | J. Tao | MapReduce & Hadoop |

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop MapReduce Architecture Single master running a JobTracker instance Multiple cluster-nodes running each a TaskTracker instance JobTracker distributes Map/Reduce tasks to TaskTrackers 12 | J. Tao | MapReduce & Hadoop | MapReduce job submitted by client computer JobTracker Master node TaskTracker Slave node TaskTracker Slave node TaskTracker

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop MapReduce Data Flow 13 | J. Tao | MapReduce & Hadoop |

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop Distributed File System A scalable, fault tolerant distributed file system able to run on a cluster of commodity hardware A master/slave architecture  Single NameNode  Multiple DataNodes Files are broken into 64 MB or 128 MB Blocks Blocks replicated across several DataNodes to handle hardware failure (default replication number is 3) Single file system Namespace for the whole cluster  NameNode holds the file system metadata (files names, block locations, replication factor, etc) 14 | J. Tao | MapReduce & Hadoop |

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop Distributed File System (cont.) Data operations via FS shell - the command line interface  bin/hadoop dfs -mkdir dirname  bin/hadoop dfs -rmr dirname  bin/hadoop dfs -ls dirname  bin/hadoop dfs -cat filename Example: copy the input of WordCount to HDFS  /bin/hadoop dfs –put file01 file02 /user/xxx/wordcount/input/  /bin/hadoop dfs -ls /user/xxx/wordcount/input/ Found 2 items …………… /user/xxx/wordcount/input/file01 …………… /user/xxx/wordcount/input/file02  /bin/hadoop jar ……. –input /user/xxx/wordcount/input/….. 15 | J. Tao | MapReduce & Hadoop |

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop on Distributed Clusters Motivation  Hadoop is limited to a single cluster  Existing distributed data centers owns solid cluster scheduling frameworks  Requirement on computational capacity from large applications Goal: g-Hadoop – runing MapReduce on multiple clusters An on-going work at SCC/KIT; collaboration with the Indiana University Tasks  Replacing HDFS with Gfarm – a Grid Datafarm  Integration with the batch systems 16 | J. Tao | MapReduce & Hadoop |

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) g-Hadoop vs. Hadoop: Hadoop Architecture View 17 | J. Tao | MapReduce & Hadoop | Hadoop Cluster Hadoop Map/Reduce Master Hadoop HDFS Master Hadoop Map/Reduce Slave HDFS Slave... Hadoop Map/Reduce Slave HDFS Slave Hadoop Map/Reduce Slave HDFS Slave

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) g-Hadoop vs. Hadoop: g-Hadoop Architecture 18 | J. Tao | MapReduce & Hadoop | G-Hadoop enabled Cluster N G-Hadoop enabled Cluster A TORQUE Master... GfarmFS Slave slave node Hadoop Map/Reduce Slave Hadoop Gfarm Plugin Hadoop TORQUE Plugin G-Hadoop Master Hadoop Map/Reduce Master GfarmFS Master Hadoop Gfarm Plugin Single Sign-on Map/Reduce Slave slave node

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 19 | J. Tao | MapReduce & Hadoop | Conclusion Map/Reduce is a framework for distributed computing on large data sets Hadoop is a widely accepted and used implementation of the Map/Reduce framework, but  No load-balancing mechanisms  Fault tolerance is based on a simple scheme Other implementations of MapReduce on clusters  Twister – iterative MapReduce  DryadLINQ MapReduce framework MapReduce on other architectures  Mars - Map/Reduce on NVIDIA Graphics processors  Phoenix - Map/Reduce for Shared memory

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 20 | J. Tao | MapReduce & Hadoop | Links Hadoop (also MapReduce and HDFS) Twister Interactive MapReduce 1st MapReduce Workshop (by HPDC 2010) Dean, Jeff and Ghemawat, Sanjay, MapReduce: Simplified Data Processing on Large Clusters