Data Warehouse Day 2 Day 1 Review / Recall

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

PRESENTATION HEADLINE
E-Solutions mySchoeller.com for Felix Schoeller Imaging
Anzahl der ausgefüllten und eingesandten Fragebögen: 211
Mean and variance.
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil3.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
Steinbeis Forschungsinstitut für solare und zukunftsfähige thermische Energiesysteme Nobelstr. 15 D Stuttgart WP 4 Developing SEC.
Dr. M. Schlottke Common Description of Web Services Source: P. Gerbert, 2002 Web Services allow to establish B2B e-commerce on the fly Web Services allow.
Multi electron atoms Atoms with Z>1 contain >1 electron. This changes the atomic structure considerably because in addition to the electron-nucleus interaction,
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
Peter Marwedel TU Dortmund, Informatik 12
Regional Support in the context of LCG/EGEE
NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.
Rechneraufbau & Rechnerstrukturen, Folie 2.1 © W. Oberschelp, G. Vossen W. Oberschelp G. Vossen Kapitel 2.
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil2.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and.
Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.
Prof. Dr. Bernhard Wasmayr
Lehrstuhl für Künstliche Intelligenz - Univ. Würzburg Optimization of simulated biological multi-agent systems by means of evolutionary processes Alexander.
Thomas Herrmann Software - Ergonomie bei interaktiven Medien Step 6: Ein/ Ausgabe Instrumente (Device-based controls) Trackball. Joystick.
Prof. Dr. Bernhard Wasmayr VWL 2. Semester
Hochschulteam der Agentur für Arbeit Trier Preventing the Brainware Crisis Workshop Schloss Dagstuhl Student Enrollment in Computer Science.
AWA 2007 Natur und Umwelt Natürlich Leben
POST MARKET CLINICAL FOLLOW UP
Zerlegung von Quadraten und ????
Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.
Prof. Dr. Günter Gerhardinger Soziale Arbeit mit Einzelnen und Familien Übersicht über die Lehrveranstaltung Grundlegende Bestimmungsfaktoren der Praxis.
Laurie Clarcq The purpose of language, used in communication, is to create a picture in the mind and/or the heart of another.
Machen Sie sich schlau am Beispiel Schizophrenie.
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Eine Einführung in die CD-ROM
Sanjay Patil Standards Architect – SAP AG April 2008
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
A good view into the future Presented by Walter Henke BRIT/SLL Schweinfurt, 14. November 2006.
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
3rd Review, Vienna, 16th of April 1999 SIT-MOON ESPRIT Project Nr Siemens AG Österreich Robotiker Technische Universität Wien Politecnico di Milano.
Z Corp Customer Examples
1 Ein kurzer Sprung in die tiefe Vergangenheit der Erde.
Alp-Water-Scarce Water Management Strategies against Water Scarcity in the Alps 4 th General Meeting Cambery, 21 st September 2010 Water Scarcity Warning.
Neno Loje Berater & MVP für Visual Studio ALM und TFS (ehemals VSTS) Hochqualitative Produkte mit Visual Studio & TFS 2010.
Quelle: Standish Group, 2006 Fourth Quarter Research Report, CHAOS Research Results.
3/28/2017 8:11 PM Visual Studio Tools für Office { Rapid Application Development für Office } Jens Häupel Platform Strategy Manager Microsoft Deutschland.
Department of Computer Science Homepage HTML Preprocessor Perl Database Revision Control System © 1998, Leonhard Jaschke, Institut für Wissenschaftliches.
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Development.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
Fusszeilentext – bitte in (Ansicht – Master – Folienmaster, 1. Folie oben) individuell ändern! Danach wieder zurück in Normalansicht gehen! 1 OTR Shearography.
Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Ertragsteuern, 5. Auflage Christiana Djanani, Gernot Brähler, Christian Lösel, Andreas Krenzin © UVK Verlagsgesellschaft mbH, Konstanz und München 2012.
Design Patterns Ein Muster (pattern) ist eine Idee, die sich in einem praktischen Kontext als nützlich erwiesen hat und dies auch in anderen sein wird.
Talking about yourself
Module 5 Strategic Enterprise Management and Reporting Tools Important Points of the Assignment These slides are designed to introduce the students to.
ESSnet Workshop Conclusions.
FORSCHUNGSINSTITUT FÜR ÖFFENTLICHE VERWALTUNG BEI DER DEUTSCHEN HOCHSCHULE FÜR VERWALTUNGSWISSENSCHAFTEN SPEYER Dr. Sonja Bugdahn 1 Can New Regulators.
By: Jade Bowerman. German numbers are quite a bit like our own. You start with one through ten and then you add 20, 30, 40 or 50 to them. For time you.
MINDREADER Ein magisch - interaktives Erlebnis mit ENZO PAOLO
Adjectiv Endungen Lite: Adjective following articles and pre-ceeding nouns. Colors and Clothes.
HRM A – G. Grote ETHZ, WS 06/07 HRM A: Work process design Overview.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.
Folie Beispiel für eine Einzelauswertung der Gemeindedaten (fiktive Daten)
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
Forschungsprojekt Statistik 2013 „Jugend zählt“ – Folie 1 Statistik 2013 „Jugend zählt“: Daten zur Arbeit mit Kindern und Jugendlichen.
© Handwerkskammer für München und Oberbayern, Max-Joseph-Straße 4, München Dietmar Schneider Foreign Trade Department of the Chamber of Trade and.
Folie Einzelauswertung der Gemeindedaten
Datum:17. Dezember 2014 Thema:IFRS Update zum Jahresende – die Neuerungen im Überblick Referent:Eberhard Grötzner, EMA ® Anlass:12. Arbeitskreis Internationale.
 Präsentation transkript:

Data Warehouse Day 2 Day 1 Review / Recall Name the phases of the Business Intelligence process ! How would you describe the current business dynamic ? Why focus on Customers and Customer behavior ? How would you describe a Customer ? What is a profitable Customer ? What information do we need to record about them ? What‘s the technical and logical reason for a Data Warehouse solution contrary to an operative system ? Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehousing Requirements Unabhängigkeit zwischen Datenquellen und Analyse-systemen (bzgl. Verfügbarkeit, Belastung, laufender Änderungen) Dauerhafte Bereitstellung integrierter und abgeleiteter Daten (Persistenz) Mehrfachverwendbarkeit der bereitgestellten Daten Möglichkeit der Durchführung prinizipiell beliebiger Auswertungen Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Requirements II Unterstützung individueller Sichten (z.B. bzgl. Zeithorizont, Struktur) Erweiterbarkeit (z.B. Integration neuer Quelle) Automatisierung der Abläufe Eindeutigkeit über Datenstrukturen, Zugriffsberechtigungen und Prozesse Ausrichtung am Zweck: Analyse der Daten Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Characteristics Priorities - Easy of use, flexible access, refresh, query Processor Use - Highly unpredictable (unvorhersehbar) Response Time - Seconds to hours (data mining may take hours) Database - usually relational (RDBMS) Data Content - Organized by subject partitioned Nature of Data - Historical Application Processing - unstructured, heuristic, analytical End Users - management, decision makers, knowledge workers Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Characteristics II User Expectations differences in response time may be significant between DWH and a client-server front end application you need to control user’s expectations regarding response set reasonable and achievable targets for query response, which can be assessed and proved in the first increment of development then you can define, specify and agree SLA Talk to the users ! Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Characteristics III Exponential Growth and Use once implemented, DWH continue to grow in size each refresh time - more data is added (or archived) DWH grow very quickly - magnitude of gigabytes a month, terabytes over year once the success of a DWH implementation is proven, the use increases dramatically use often grows faster than expected Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties III Subject Areas For a given subject - snapshots of data across the business different time periods, different emphasis of data view Typical subject areas Customer accounts Product sales Customer savings (Spareinlagen) Toll calls (telecommunication) Airline passenger booking information Insurance claim data (Ansprueche) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties IV Subject Areas and Warehouse Data Model you develop a data model to hold the data that you will use measure the business you include the information that you will use to analyze the business you measure the business according sales figures you analyze the sales by Customers, Region, Salesperson, Territory, Store (or any combination) Subject oriented information provides information departments within a corporation with a common understanding of their business Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties V Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties VI Data status of online transaction processing data: dispersed (verteilt) in diverse (verschiedene) and independent legacy systems it’s impossible to measure the business performance, because of the diversity inconsistency in the data differences in database management systems lack of external information Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties VII DWH to integrate the data into one set quality information, which is: meaningful, accurate and intelligible (verstaendlich) for analysis Standardization, Integration of Data: Naming conventions Coding structures Physical data attributes Measurement of variables Cleaning and integration process is time-consuming and costly ! Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties VIII Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties IX Time key is a vital database attribute analysis of data is over a time period (days, weeks, month, quarters, years) database key columns contain an element of time that determinates the business period to which the data relates structure and meaning of the element varies between implementation and business needs Refresh Cycles must be determined in the early stages of the analysis of the business user’s requirements Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties X Grain of Data (granularity - Körnigkeit) grain is level at which the data is held in DWH-tables operational system: grain of data is transactional (one record for each transaction) refresh cycle may not have the same grain as the data cycle it’s more usual to store data in a summarized form by week, month or other business defined time period you may choose refresh the data warehouse every week, but the grain of the data may be daily totals (monthly - week, etc.) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties XI Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Warehouse Properties XII Changing Data - the following operations are typical of a DWH initial set of data is loaded (first time load) frequent snapshots of core data are added, according to the refresh cycle DWH-Data may need to changed in other ways business determines how much historical data is needed for analysis (older: archived, purged (gesäubert)) inappropriate (unangebrachte) or inaccurate data values may be deleted from or migrated out of the DWH Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Enterprise -Wide Data Warehouse Stores all data from all subject areas within the business for analysis by end users the scope is the entire business and all operational aspects within the business normally created through a series of incrementally developed solutions EDWH provides: a single source of corporate enterprise-wide data a single source of synchronized data for each subject area a single point for distribution of data to dependent data marts Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts Aufgabe Bereitstellung einer inhaltlich beschränkten Sicht auf das DW (z.B. für Abteilung, oder Funktionen) Gründe Eigenständigkeit, Datenschutz, Lastverteilung, Datenvolumen, etc. Realisierung Verteilung der DW-Daten Formen Abhängige Data Marts, Unabhängige Data Marts Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts II Benefits provides localization - they server users at a specific level or for a specific purpose smaller and easier to manage then a EDWH the need may come from geographical, functional divisions or technical groups within an enterprise DM reduce the demands on warehouse date and also the data access traffic Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts Independent Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts Independent II build and loaded directly from operational system motivation for this kind of implementation: Line Of Business (LOB) empowerment short time frame for implementation the methods for extracting and loading of operational data as in the DH solution Integration and Transformation retrospectively (nachtraeglich) into a single DW-solution is possible Issue: independent data transformation process Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts Dependent Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Marts Dependent II subset of enterprise-wide data built and loaded from the Enterprise DW need only extract from the data warehouse and transport the date into themselves, higher grain then DW they don’t transform any data (faster, cheaper) other advantages performance, availability, connection costs more resistant to change maintains a single version of data Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Mart Dependent III Strukturelle Extrakte Beschränkung auf Teile des Schemas Bsp.: nur bestimmte Kennzahlen oder Dimensionen Inhaltliche Extrakte inhaltliche Beschränkung Bsp.: nur bestimmte Filialen oder das letzte Jahresergebnis Aggregierte Extrakte Verringerung der Granularität Bsp.: Beschränkung auf Monatsergebnisse Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Mart Considerations avoid disparate (unvereinbare) data mart solution build towards the enterprise-wide strategy consistent use of products, technology and processes are vital always employ (einsetzen) dependent data mart solutions to avoid the disparity problems Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Data Mart Characteristics Priorities - Easy of use, flexible data access Processor Use - Highly unpredictable (unvorhersehbar) Response Time - Seconds to several minutes Database - Relational, multidimensional Data Content - Organized by subject for LOB Nature of Data - historical (month, weeks rather then years) Application Processing - unstructured, heuristic, analytical End Users - see DW, + statisticians Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Operational Data Store Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Operational Data Store holds the current data for analysis or application integration may form a staging area for the Warehouse may contain integrated, clean, summarized data limited summary life expectation may be updated synchronously with operational system on a store-and forward basis exists in a separate environment Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary ODS - Characteristics Priorities - Easy of use, flexible data access Response Time - Seconds to minutes Database - relational Data Content - organized by subject, current value data, integrated Nature of Data - Dynamic Processing - structured, analytical End Users - DBA’s, clerical users Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Begriff: „ jede Art von Information, die für den Entwurf, die Konstruktion und die Benutzung eines Informationssystems benötigt wird“ für DW: notwendig zur Abdeckung der Informations-Schutz-und Sicherheitsbedürfnisse der Anwender und der Software werden in allen Phasen produziert und genutzt konsistente Bereitstellung der Metadaten aus unterschiedlichen Quellen notwendig -> Repository Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Nutzung Passiv: als Dokumentation der verschiedenen Aspekte eines DW-Systems Aktiv: Speicherung semantischer Aspekte (z.B. Transformationsregeln) sowie deren Interpretation zur Laufzeit Semiaktiv: Speicherung von Strukturinformationen (Tabellendefinitionen, Konfigurationsspezifikationen) und Nutzung zur Überprüfung (nicht direkt zur Ausführung) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Objekte Betriebswirtschaftliche Kennzahlen Sichten für einzelne Anwendergruppen Transformation der Daten aus Quellsystemen in das DW Laderoutinen und Regeln Aufbau von Anfragen, Filter, Anzeigeschablonen, Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Objekte II Administrationsinformationen: Zugriffsstatistiken,Backup/Recovery, Bildung von Aggregaten, ... Datenbankparameter und -einstellungen: Server, Hardware-Umgebung, Tuning-Parameter Anfrage-Performance: vorberechnete Aggregate, Caching, Optimierungsstrategien Granularität der Daten Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Objekte III allgemeine Attribute: Maßeinheiten etc. Sicherheitsstrategie: Anwenderprofile und -gruppen, Einschränkungen der Sichten Berichts- und Analyseobjekte, Reports Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Repository Ziel 1: Minimierung des Aufwandes für Aufbau und Betrieb eines DW Systemintegration: Integration auf Schema- und Datenebene erfordert Information über Struktur und Semantik der Quell- und Zielsysteme einheitliche Verwaltung von Metadaten für Integration der DW-Werkzeuge Automatisierung der Administration Steuerung der DW-Prozesse über Scheduling-/ Konfigurationsmetadaten Daten über Ausführung der Prozesse (Protokolle etc.) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Repository II Ziel 1 (cont.): Minimierung des Aufwandes für Aufbau und Betrieb eines DW Flexibler Softwareentwurf explizite Repräsentation sich häufig ändernder Aspekte (z.B. Transformationsregeln) verbesserte Wartbarkeit und Erweiterbarkeit Schutz- und Sicherheitsaspekte Behandlung von Zugriffs- und Benutzerrechten als Metadaten globale Zugriffsmechanismen Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Repository III Ziel 2: Gewährleistung eines optimalen Informationsgewinns für alle Anwendergruppen Datenqualität Sicherstellung der geforderten Qualität durch Überprüfungsregeln Nachvollziehbarkeitsinformationen (Quellsystem, Autor, Zeitpunkt usw.) Terminologie einheitliche Terminologie als Voraussetzung für einheitliche Interpretation zentrale Verwaltung im Metadaten-Repository Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Repository IV Ziel 2 (cont.): Gewährleistung eines optimalen Informationsgewinns für alle Anwendergruppen Datenanalyse Metadaten über Bedeutung von Daten, Kennzahlensysteme, Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Glossary Meta Data Anforderungen bzgl. Funktionalität Anwenderzugriff Mechanismen zur Navigation, Filterung, Selektion von Metadaten Unterstützung manueller Aktualisierung Interoperabilität und Werkzeugunterstützung Programmierschnittstelle für lesenden und schreibenden Zugriff Import- und Exportschnittstellen Erweiterbares Metamodell Change Management Versions- und Konfigurationsverwaltung Benachrichtigungsmechanismen Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure Reference Architecture I Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure Reference Architecture II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure Extraction, Transformation and Load Process (ETL) ETL-Prozeß Integrationsprobleme Data Cleaning Data Capture Methods Staging Area Load Window This area typically takes 70% of the overall effort in building DWH ! Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Probleme Vielzahl von Quellen Heterogenität Datenvolumen Komplexität der Transformation Schema- und Instanzintegration Datenbereinigung Kaum durchgängige Methoden- und System-unterstützung, jedoch Vielzahl von Werkzeugen vorhanden Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure Extraction, Transformation and Load Process (ETL) Extraktion: Selektion eines Ausschnitts der Daten aus den Quellen und Bereitstellung für Transformation Transformation: Anpassung der Daten an vorgegebene Schema- und Qualitätsanforderungen Load: physisches Einbringen der Daten aus dem Arbeitsbereich (staging area) in das Data Warehouse (einschl. eventuell notwendiger Aggregationen) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Definitionsphase Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Integrationsprobleme Schwerpunkt: Probleme der Datenintegration Ausgangspunkt: Daten liegen in den operativen Informationssystemen unterschiedliche Systeme -> Heterogenität Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Anforderungen an Integration alle relevanten Daten aus den operativen Systeme müssen im Data Warehouse aufgenommen werden können Überführung unterschiedliche Strukturierungen / Darstellungen semantisch gleicher oder zusammengehöriger Daten aus den Quellsystemen in eine gemeinsame Repräsentation Identifizierungen gleicher Informationen, die aus mehreren Systemen stammen Beseitigung ungewünschter Redundanz, die Analyseergebnisse verfälschen kann Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Integrationskonflikten Beschreibungskonflikte Heterogenitätskonflikte Strukturelle Konflikte in der Regel kombiniertes Auftreten dieser Konfliktarten zusätzlich- für Data Warehouses besonders wichtig: Datenkonflikte Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Beschreibungskonflikte unterschiedliche Eigenschaften/Attribute derselben Objekte in den lokalen Schemata homonyme und synonyme Bezeichnungen Datentypkonflikte / Wertebereichskonflikte: unterschiedliche Datentypen / Wertebereiche für die gleiche Eigenschaft Skalierungskonflikte: Verwendung unterschiedlicher, aber ineinander umrechenbarer Maßeinheiten Examples ? Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Heterogenitätskonflikte Unterschiedliche Datenmodelle der zu integrierenden Schemata unterschiedliche Modellierungskonstrukte und Ausdruckskraft impliziert oft auch strukturelle Konflikte Auflösung durch Transformation in ein gemeinsames globales Datenmodell Example: Objektorientierte DB vers relationales Modell Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Strukturelle Konflikte selbst bei Verwendung desselben Datenmodells (Objekt oder relational) oft unterschiedliche Modellierung eines Sachverhaltes insbesondere bei semantisch reichenDatenmodellen (mit vielen Modellierungskonstrukten) Example ? Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Datenkonflikte A. falsche Daten 1. nicht korrekte Einträge 2. veraltete Daten B. unterschiedliche Repräsentationen 1. verschiedene Ausdrücke 2. verschiedene Einheiten 3. Unterschiedliche Genauigkeit Examples ? Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Data Cleaning Korrektur inkorrekter, inkonsistenter oder unvollständiger Daten Auch: Data Cleansing, Data Scrubbing Techniken: Konvertierung unterschiedlicher Formate (z.B. Textdateien in DB-Tabellen über Oracle SQL*Loader) - Abbildung von Datenfeldern in ein gemeinsames Format (Zeichenketten in Großschreibung / Datumsformat: dd/mm/yyyy Währungen) - Einsatz spezielle Werkzeuge möglich (häufig auf Basis von Wörterbüchern) Beispiele: Produktbezeichnungen im Pharmabereich, Adressen über Adreßdatenbanken (Postleitzahlen, Telefonvorwahl) Synonyme und Abkürzungen („Str.“ für „Straße“) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Data Capture (Erfassungs) Methods Problem: after the initial load, incremental loads need to identify only the data that has changed on the source system Triggers on the operational System whenever a record has changed, the changed value is written to a file - problem: performance (database) operational system Operational System generates a delta file code can be added to the operational system to generate a file containing the changed records - problem add code in operational system Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Data Capture (Erfassungs) Methods Analyze log file of the operational system copy of log file can be used by checking the LAST UPDATE DATE field - recommended method Example ? Compare current extract to the last extract getting a specified extract file containing the latest snapshot of the operational data this is compared with the last extract file changes are inserted into the warehouse - most commonly used Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL -Staging Area contains the tables that are transported to the data warehouse platform supplies the warehouse with both the first-time and the regular refresh typical requirement of DWH implementation it may be an Operational Data Store (ODS) or a series of tables in a relational database server or flat files manipulated using in-house scripts, programs Multi-tier staging (optional) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Load Window Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Load Window simply the amount of time you have available to extract, transform, load, post-load process data and make the data warehouse available to the user load performs many sequential tasks that take time to execute you must endure that every event that occurs during the load window is planned, tested, proved and constantly monitored you may have to face poor load performance and gaps (Lücken) by providing the data for user access careful planning, defining, testing and scheduling is critical ! Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Data Warehouse Architecure ETL - Load Window Load Window Strategy load time is dependent upon a number of factors such as data volumes, network capacity and load utility capabilities consider the user requirements first - then work out the load schedule backwards from that point Load Recovery you may also have to allow sufficient time within the batch load window to recover back to logical business point in time (up to the close of business the previous day) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Overview Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Overview Warehouse / Mart will contain a large number of objects: Core Objects Fact Data - Tables Dimensional Data - Tables Reference Data - Tables Summary Data - Tables Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Star Schema Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Star Schema II single, large central table surrounded by a number of other smaller tables radiating from it connected by database primary and foreign keys outlying tables - dimension tables that control the query as they contain the data is found in the query predicates most dominant warehouse schema DWH will contain many stars, not just one, each subject area will have it’s own fact table many fact tables may share dimensions (e.g. time) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Star Schema III Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Snowflake Schema Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Snowflake Schema II closer to an entity relationship diagram than the classic star model the dimension data is normalized developing a snowflake model means building class hierarchies out of each dimension (normalizing data) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Snowflake Schema III Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Star Schema Advantages: easy to understand, the structure is simple and straightforward provides fast response to queries with optimization and reductions of joins required between fact and dimension tables supported by many front end tools Disadvantages may require more frequent rebuilding slow to build because of the level of denormalization not easy to design and use if you need to maintain the history of data or hierarchy within a dimension Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Snowflake Schema Advantages: certain advanced DSS tools and servers can use this structure directly provides a structure that is easier to change as requirement change loading data into smaller normalized tables is quicker than loading into huge denormalized tables Disadvantages large number of dimension hierarchy tables, may start to become an unmanageable model more joins may mean performance declines Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Fact Table comprises the bulk of data within the data warehouse, many million rows is the numerical measurement of the business performance, such as sales figures, customer banking transactions is accessed by data values stored in dimension tables contains multi-part primary key values, each part of the key references a dimension by which the fact data is accessed you should consider the design of the fact extremely carefully Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Fact Table - Granularity Granularity - Level of Detail individual transactions, daily snapshots, monthly, quarterly high level: transaction/daily low level: week/month ... determines size of data warehouse users define the level of granularity and not technical restrictions Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Fact Table - Design Considerations access performance and flexibility and manageability Partitioning Horizontal: fact table broken into number of smaller tables (load into one table, performance) Vertical: sliced into a number of narrower (schmal) tables (performance, different user groups) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Dimension Data Tables Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Dimension Data Tables II Updating Dimension Data not refreshed in the same way as fact data changes in dimension table - updates rather then inserts Example ? Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Dimension Data Tables III Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Dimension Data Tables - Time Time in different environment: Operational up-to-date snapshot of the busness transactions at any point in time time element constantly change, doesn’t contain serious amount of historical data Warehouse provide an explicit time series of data snapshots of operational system are moved into warehouse in series of layers Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Dimension Data Tables - Time II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Reference Data Tables Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Summary Data Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Summary Data Tables Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Summary Data Tables II Perfomance improves query performance by allowing queries direct access to pre-computed summaries and pre-defined views due to the user acceptance - one of the most important implementation consideration of a warehouse Content based on data stored in dimension tables (Customer attributes) Numbers of tables hundreds Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Warehouse Data Schemas Summary Data Tables III Summaries stored as additional or even stored within fact tables (separate level field indicator/index is used) Benefits of Separate Summary Fact Tables easier to manage: created, dropped, loaded and indexed separately accessed faster than embedding the summary within facts but: as this information must refer to dimensional data, additional dimension tables may also have to create Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Sizing Storage (Einschätzen) Attention must be paid to storage requirements for the warehouse: Data - facts, dimensions, reference and summary tables Staging file store Indexes Backup and Recovery Strategies temporary files log files Database should be three to four time the size of base fact table Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Sizing Storage (Einschätzen) II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Sizing Storage (Einschätzen) III Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning Not the same as OLTP - DBA’s not to hunt and kill expensive queries DWH - high throughput, insert/update intensive systems may contain large number of data that grow continuously and are accessed concurrently by hundreds of users Tuning goals are: availability Transaction speed Concurrency (numbers of users and transactions) Recoverability Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning II Techniques dependent on database vendors (Oracle, IBM ..) parallel query option Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning III Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning IV Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning V Partitioning by dimension (region, time) high query performance and high scalability high availability as each partition can be managed independently faster backup and restore operation can be done on individual partition Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning VI Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning VII Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning VIII Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Monitoring and Performance Tuning IX Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Archiving Data Old data may need to be archived you need to identify a archive frequency use the partitioning option for archiving archiving by dimension purge data and remove the details to the archive plan and design early ! Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Archiving Data II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Backup and Recovery Strategy needs to be developed early in the Project technology and approach drive by the user requirements Impact of: partitioning, batch load window hot, cold, standby approaches, full, incremental what: facts, dimensions & reference, dependant data marts when: before DWH refresh ?, after ?, before & after ? Recovery: structure, data export/import Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures SMP - Symmetric MultiProcessing Cluster - Processor Cluster (Einheit) MPP - Massive Parallel Processing NUMA - Non Uniform Memory Access Hybrids use SMP and MPP (Kreuzung) Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - SMP Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - SMP II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - Clusters Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - Clusters II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - NUMA Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - NUMA II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - MPP Ben Martin BA Lörrach, WI 4.Semester 4/21/2002

Managing the Warehouse Hardware Architectures - MPP II Ben Martin BA Lörrach, WI 4.Semester 4/21/2002