Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Datenanalyse und Data Mining mit der SAS Software

Ähnliche Präsentationen

Präsentation zum Thema: "Datenanalyse und Data Mining mit der SAS Software"—  Präsentation transkript:

1 Datenanalyse und Data Mining mit der SAS Software
SAS Enterprise Miner TM Datenanalyse und Data Mining mit der SAS Software Reinhard Strüby SAS Institute Heidelberg

2 Warum? - Data Mining und Anforderungen Was? - Data Mining Definition
Inhalt Warum? - Data Mining und Anforderungen Was? - Data Mining Definition Wer? - Anwendergruppen Wie? - Erfolgsfaktoren für Data Mining Wie? - Der SEMMA Prozess SAS! - Die SAS Data Mining Lösung Data mining is the technology which drives the major IT application today: Customer Relationship Management. Besides the Web data mining is one of the hotest IT topics today. In this presentation we want to address the following questions: - Why is d.m. of importance today? - What is it? - Who is the target audience for data mining and how to address their needs? A related question is who would be the decision maker on a data mining sale - What is needed to make DM a success? - The SAS solution is covering all this - today.

3 ??? DATA MINING ? This is the situation today:
Everybody talks about data mining, but despite the hype in the press very few understand it and know how to use it for business advantage. This is exactly the question which we want to address.

4 Welche Kontakte gab es bisher?
Die Geschäftsaufgabe “Kenne Deine Kunden!” Wer sind sie? Was wünschen sie? Welche Kontakte gab es bisher? Wie kann eine dauerhafte Beziehung hergestellt werden? Welche Kunden könnten uns verlassen? We are living in a time of world-wide competition. This means more opportunities, but also more potential problems. In the past IT was mainly focussed on getting the daily job done - so master the transactions - telco, banking, insurance, retai etc. Nowadays the customer generating all of these transactions becomes the focus of interest. Knowledge about your customer is power and makes you more successful. - Who are they? Customer segmentation. Do I have different customer groups with different needs? Can I differentiate “good” and “bad” c.? - What are they like? Customer profiling? What attributes (eg. demographics) make a good or bad customer? - What interactions have already taken place? Can I build up from past data a model of customer behaviour? This “training” is needed to generalise findings for new customers - How to build up a long term relationship? Depending on the industry it is 4-10 times more expensive to acquire a new customer than to make an old customer happy. - What customers might leave us? Churn management. Major concern, eg. in Telco industry.

5 Data Mining - warum jetzt?
Erhöhter Wettbewerbsdruck Geringere Speicherkosten und höhere Rechengeschwindigkeiten Data Warehouses oft vorhanden Versteckte Informationen in großen Dateien Data Mining - Methoden finden Muster GUI Data Mining Anwendungen Kundendruck auf Veränderungen ROI erhöhen The increased competitive pressure is the major business reason for the data mining popularity. Some IT infrastructure came into play to make DM more accessible. A first step is the ever decreasing cost to store large amounts of data and process it with ever faster CPUs. The larger amounts of data (incl. the web) need proper ways for storing and managing it for business advantage: the Data Warehouse. Traditional reporting on large amounts of data gives consistent results, but we are not sure if we ask the right questions. The question “which customers should we concentrate on in future” might be more relevant than the traditional “which products in what area in the last year generated the most profit?” Data mining is exactly that technology which looks for interesting patterns in customer behaviour which could be exploited for business advantage. Data mining was a specialist area until very recently. The widespread availability of user-friendly DM packages in the last few years made a big difference. Data mining is almost always directed towards a business question: eg how can I improve my return rate on direct mail? How can I find out which potential customer will turn into a bad deptor?

6 Data Mining Definition
Data Mining ist der Prozess des Selektierens, Erklärens und Modellierens großer Datenmengen, um bisher unbekannte Datenmuster für einen Geschäftsvorteil zu nutzen. Data Mining is defined different ways depending on who you are talking to ranging from artificial intelligence to statistical analysis to OLAP. So let’s start with a basic definition of data mining. Note that data mining implies some use and acceptance of new methodologies to understand relationships in data. Data mining is a process of steps we will look more into detail later on. Implicit in this definition is the fact the some traditional techniques are well suited for data mining activities as well. Key to the definition of data mining is the idea of large data sets, many sample sizes are in the range of 50 gigabytes and up. Also implicit in the definition is the concept of a model. Determining what are important customer/process attributes and then building a predictive model is the key to successful data mining.

7 Data Mining ist ein Prozess.
Data Mining beinhaltet die enge Kooperation von IT, Fachabteilung und Data Minern. Data Mining ist nicht beschränkt auf bestimmte Industriezweige oder Probleme. It is now generally accepted that data mining is a process of steps (META, IBM) and more insight into this process is needed. The next characteristic is that DM similar to Data Warehousing involves a cooperation issue. DW needs cooperation of IT and Business. Data Mining builds on top of DW and needs in addition the expertise of “data miners”, staff who are trained in analytical methods and who should be part of the data mining team (but not the only one group involved!). Data mining can be used anywhere, as we will discuss on the next slides.

General Customer Segmentation Targeted/cross marketing Pricing Analysis Associations & Demography Insurance & Health Care Claim Analysis Fraudulent Behavior Banking Credit Authorization Credit Card Fraud Detection Portfolio Analysis Cash Planning Telecommunications Call Behaviour Analysis Churn Management Retail/Marketing Market Basket Analysis Database Marketing Category Management Production and Utilities Process Management Demand Patterns Capacity Planning Inventory Planning When relating data mining apps to industries it can be differentiated into dm apps common to all industries (“general”) and to those specific for particular idustries. The financial sector, telecoms and retail in that order are the industries with the highest demand for data mining solutions. Industry specific DM apps are often related to the particular operational business of that industry, eg. banking: financial transactions insurance: claim analysis telco: call behaviour analysis. So this means that there is a chance to get a foot into the door to own this business-critical data via data mining, with data warehousing as a possible follow-up activity.

Postbank N.V. “50% response on first mailing payed for DM investment” US West “Reducing customer churn by any amount is 10 times cheaper than gaining a new customer” ABN AMRO “Interest earned on 40% reduction in cash in ATMs” Neckermann Versand AG “Increased number of good customers getting credit by 80 a day” Gloucestershire Constabulary “For the public, increased crime pattern identification and prevention is priceless” Data Mining is being used today in many areas today. And those companies listed here have made very favourable experiences with it. Finally all of these companies have something in common: they use the SAS System for data mining and they are happy to talk about.

10 Leiter von Fachabteilungen / Spezialisten Data Miner
DATA MINING - Nutzer Leiter von Fachabteilungen / Spezialisten Data Miner Who is the user and who decides on data mining projects? As we will see these groups are NOT identical. Users first. These are: Users in business departments, business analysts or knowledge workers and their department heads Data Miners or quantitative experts, who depending on industry or company may sit in IT, marketing, or planning departments. They are not usually known under that generic name but may have titles such as marketing analyst, financial analyst, actuary or such. Data miners are more common in the financial industry than in eg. retail. Heads of business departments are often intiators of a data mining project and influence the buying decision, although they will rarely buy themselves (probably IT). Data miners are at best influencers but play an important role as a user and support for business staff.

11 Nutzer: Fachabteilung
Kennen das Fachgebiet Verstehen die Inhalte der Daten Suchen nach Informationen, haben aber oft geringe analytische Kenntnisse Arbeiten häufig in Marketing-Abteilungen als Analyst Business technologists are individuals who know their industry and selected business processes of their company very well. Moreover they are familiar with the contents of the data reflecting their business processes although they might not know where/how they are being stored. Business analysts need to analyze business critical beyond the level of standard reporting but are not professional statisticians. They are not familiar with the syntax and options of SAS System quantitative procedures or with statistical terminology. Their focus is on making more money for their organizations through execution of specific business objectives. The business technologist requires a high level interface to the data mining capabilities and expects to be able to apply the SEMMA methodology without having to be a statistician. The business technologist may also wish to do campaign management.

12 Etwas isoliert von Geschäftsfragen
Nutzer: DATA MINER Quantitative Experten: statistischer/mathematischer Background oder vergleichbare Kenntnisse Etwas isoliert von Geschäftsfragen Vertraut mit Algorithmen und Datenanalyse-Prozess Häufig im Finanzsektor, sonst eher selten Data miners or quantitative experts may come from IT, maketing, or planning departments. As mentioned they are fairly common in the financial industry, but rarely to find in Telecomms or Retail. They have a statistical or mathematical background and now a lot about analytical techniques - the traditional SAS/STAT user. They probably build models now for tasks such as customer profitability, credit scoring, or financial modelling and deliver these to department heads, finally to upper management but are not involved in the decision making process. If there are no data miners at your prospect then this need to be compensated by SAS consulting or partners/QPs to exploit the SAS Data Mining Solution to its full potential.

Breakdown of figures from the previous page. I have left off non-software related categories such as data providers (Nielsen etc.), and systems integrators (HW suppliers, big 6 etc.). With these omitted Meta splits up the data mining software providers into 5 different categories: Vertical application providers to provide business functions such as churn management, supply chain optimisation etc. for particular industries Horizontal applications (cross-industry) such as marketing and sales tracking, customer service orientation Macro mining. These are basic toolkits to provide custom data mining solutions. Micro mining. These include Q&R, OLAP and pure statistical solutions. Data visualization - as part of statistics or GIS. With the exception of vertical applications the SAS System is mentioned in all categories as one of the leading contenders if not the leading one. And we have still room to grow into the protitable vertical applications area. Source: META Group, Data Mining Market Trends * SAS System mentioned in this Category.

14 Data Warehousing / Data Mining integrated
THIRD GENERATION DATA MINING -Integrated Corporate OLTP Systems Data Warehouses Data Marts Data Mining DSS, EIS OLAP VSAM IMS DB2 Demographic Data IT Data Miners Business Depts. Mkt Analysts Exec Mgmt The market is looking now for integrated solutions such as in the diagram above. First generation data mining involved outsourcing data mining to a service provider with turn around times of weeks, unsatisfactory support and most often results in paper form with no way to integrate into corporate IT. Second generation data mining already included in-house data mining but in from of stand-alone DM packages which often had serious deficiencies to read all sort of corporate data directly. Eg. neural network packages often could not even deal with character data and needed flat-file input. A much more desirable situation is the one above, where data storage and management is donein a data warehouse environment, extracts are made on a departmental level as data marts on which data mining can be pursued easily and painlessly. Functions are clearly separated: IT manages the operational data and the DW and data miners as well as business staff has access to both the data marts as well as the data to be mined. Furthermore anyone including Exec management is in a position to consume the results of data mining via customised reporting. Lifestyle and Behaviour Data Industry Data Data Warehousing / Data Mining integrated

15 Zugriff auf alle Datenquellen - Data Warehousing
Erfolgsfaktoren Zugriff auf alle Datenquellen - Data Warehousing Skalierbarkeit: HW / SW Breites Spektrum von DM Methoden: Konzentration auf Geschäftsprobleme Strategie der Implementation These are the basic 4 criteria for a successful data mining project. 1. Esotheric discussions on data mining methodology often miss the most important aspect: that ALL available data no matter what source or format can be read immediately for use in data mining. This is best done using Data Warehousing, also covering Web data as a source. 2. Data mining is typically done on A LOT of data, gigabytes, sometimes even terabytes. Data mining solutions therefore need to be scalable to cover this amount of data. Stand-alone PC packages quickly run out of steam. Data mining requires typically a client-server setup. 3. There is quite a discussion on what data mining technique gives the best result. Again this discussion misses an important point: it depends on the business question and on the data what fits best, so the more dm methods are available the more flexible the solution can be tailored to the problem. This way we can concentrate on the business problem rather than being forced to twist the data or problem to the technique at hand. 4. Do not forget the user. Key for the success of data mining is its acceptance by the user. The SEMMA methodology has proven to be a practical approach for data mining.

16 Vergleich OLAP gegen Data Mining
Report Writing Data Mining Methodology Nutzergesteuertes Reporting - Dimensionen bekannt Datengesteuerte Exploration - Suche nach Dimensionen Within data warehousing OLAP is another popular technique for data exploitation. OLAP vendors have tried in the past to position OLAP as a way to do data minin.g. This is far from the truth. OLAP is reporting on data along known and defined dimensions. A typical question would be: tell me the best selling product in the year 1966 for Europe. OLAP is a top-down approach - ask an exact question where all dimensions of the question are well defined. Data mining is exactly the opposite. Data mining is a bottom-up strategy. Starting from the data what are interesting patterns in the data which might lead to a business relevant question. Another way to put it: data mining is an exploration for the relevant questions - such as: which are the customers we should be concentrating on? OLAP and data mining are no alterternatives but work together well. Data mining finds the relevant questions and OLAP very well suited to track the answer at any point in time. Bestverkauftes Produkt im Jahr 1997 in der Region X ? Auf welche Kunden sollten wir uns konzentrieren ?

17 SAS DATA MINING SOLUTION Data Mining, IT and Business
Transform Data into Information Act on Business Question Data Warehouse DBMS Data Mining Processing EIS, Business Reporting, Graphics Identify Problem Measure Results The “virtual cycle of data mining” (Berry, Linhoff: Data Mining Techniques for Marketing, Sales and Customer Support, Wiley 1997) gives a good picture how well data mining integrates with business and IT strategies. The first phase is to identify a business opportunity, such as a sizeable number of profitable customers are leaving a bank. Business problems should be reflected in the corporate database. The data warehouse works as the corporate memory and we assume that there is information about the customers who have left in the data warehouse. In the next step, data mining, we would like to gain actionable results such as what is the profile of those leaving our bank as opposed to those who stay with us. Let’s suppose we have identified a major portion of the leavers as on average having multiple accounts, often using ATM’s of other banks, and often trying to call us on business matters but sometimes have to wait for a response. We would like to act on this information. If we could clearly identify the group of leavers mentioned above then we could provide them with an additional telephone number with a higher service priority. We follow-up on the results using standard reporting (EIS systems or the like). to see if We would like to measure the results. It may well be that this measure pays the desired result of reducing customer churn. If so we have to redefine the business opportunity, look for further ways to reduce churn or we may attack a completely new business opportunity such as introducing new financial services for customer groups with different needs. - The virtuous cycle of data mining starts again.

18 SEMMA Sample Explore Manipulate Model Assess Sampling? Visual
Exploration Data Reduction Grouping, Subsetting Transform Neural Networks Decision Trees Statistical Techniques Associations, Sequences Model Comparison, New Questions Within the data mining stage (see slide before) SAS Institute offers SEMMA, a refined data mining methodology which has been welcomed by our early adopters of data mining. Data mining might involve many tries to find patterns with multiple passes of the data. Taking a random sample of the data is very appropriate to save execution time. If we have found a reasonable explanation for an opportunity such as customer profitability we would of course score the whole data, so each customer for profitability. One word of caution: when being in competition with IBM you might find that they will wrongly accuse us that sampling indicates that we cannot deal with larger data sizes. Not true at all - sampling is just efficient, not required. Before any modelling is done it is a good idea to survey the data graphically or statistically for significant features, outliers, groupings and other obvious major characteristics. Exploration of data gives us a good idea what we should look for at the modelling stage. Insights gained in exploration should be reflected in the data in a data manipulation stage. Maybe we can concentrate on a set of customer characteristics which have proven to be interesting, certain groups may have been identified with different behaviour and the group information should be added, what about missing values etc. All the information gained so far can now be used to sensibly model customer behaviour with the appropriate technique. We offer a wide range of data mining techniques to fit to any business question. The final step of most interest for the business user is the assessment. How well does the model explain customer behaviour? If we have several alternatives for explanation which one is the best in terms of associated profits? Results in assessment should be presented in a business-like and graphical way to meet the needs of the business user.

19 SAMPLING ? Empfohlen, nicht Voraussetzung: Inhalte gehen nicht verloren. Erhebliche Performance Vorteile Modellprüfung: Training, Testing, Validation Samples

20 Erkennen von Ausreißern, Gruppen, Assoziationen ...
EXPLORATION Erkennen von Ausreißern, Gruppen, Assoziationen ... Visual Exploration: 3-dim. Charts Graphische Daten Analyse GIS Analytical Exploration: Cluster Analysis Correspondence Analysis PCA, Factor, MDS … Welche Fragen sollten gestellt werden ?

21 Welches sind wesentliche Variable? Fehlende Werte ?
DATA MANIPULATION Welches sind wesentliche Variable? Fehlende Werte ? Variablentransformation ? Neue Informationen hinzufügen: Groups, Labels etc. Mit welchen Informationen sollte ich arbeiten ?

22 MODELLING Welche Form haben meine Daten ? ... NNs Statistical
Tree-based Methods Time Series Welche Form haben meine Daten ? ...

23 Bewertung: Wie gut ist mein Modell ? Assessment - Scoring
Erklärungsbeitrag der Variablen, Ausreißer Assessment - Scoring Klassifikation Lift Charts Verallgemeinerung für andere Daten

24 SAS Data Mining Solution Currently (Feb 98)
Data Warehousing incl. Web Technology Analytical Solutions NNA - Production on Win, OS/2 and all major UNIX, ORLANDO I and II Tree Menue System Exploration: INSIGHT, SPECTRAVIEW, GIS Statistics Time Series Forecasting Market Research Methods EIS, Enterprise Reporter, Graphics SAS Institute has offered solutions for data mining before the term was fashionable. These always consisted of three key ingredients: Solutions for data access and management, now with the Warehouse Administrator and all new technology for parallel access of data (SPDS) and access of Web data (SAS/IntrNet). The multitude of analytical solutions SAS Institute was always known for, latest enhancements for data mining were the SAS Neural Network Application and the Tree Menue System. End user reporting tools such as the Enterprise Miner. Together these solutions make up the SASSystem for Data Mining, which according to the Meta Group and the Two Crows Corporation is the market leader for data mining in terms of customer usage(see later slides).

25 Neue SAS DM Lösung SAS Enterprise Miner TM
Einheitliche und voll skalierbare Business Lösung für das Data Mining Füllt den Platz zwischen Data Warehousing und Endnutzer Reporting aus. Das GUI schaft ein nutzerfreundliches front-end für den SEMMA Prozess.

26 SAS ENTERPRISE MINER Vorteile für die Nutzer: IT: DW Zugriff, Skalierbarkeit Business Nutzer: Intuitive Oberfläche und Orientierung auf die Geschäftsfragen Data Miners: Analytische Tiefe und Flexibilität

Projekte/Modelle in Win95 Hierarchiestruktur SEMMA Prozess in Process Flow Diagrams Bestehende SAS Programme und Anwendungen können einfach integriert werden. Alle Funktionalitäten des SAS Enterprise Miner wie die DMDB und alle analytischen Werkzeuge sind ausschließlich in dieser Data Mining Lösung verfügbar. SEMMA is an acronym for a data mining process. Data Mining is part of a process: 1) begin by generating ideas and hypotheses, 2) validate ideas based on patterns in the data, 3) transform into actionable segments, 4) measure the results. Note that preceding SEMMA is a data warehousing or data mart strategy and that following SEMMA is the delivery of this information to the “downstream” consumers of the data mining work. The downstream users are the ones who execute the marketing/sales campaign around the results of the data mining activities. The workflow of a campaign includes: 1) list pulling from a scored file 2) campaign analysis, 3) campaign tracking, 4)campaign profitability, and 5) history. Campaigns are launched as outreach efforts (such as mailings, telemarketing, and web activities) to the list of customers identified from the list pull. The results of the campaign are then passed back to the data mining environment for input into the next pass.

28 ENTERPRISE MINER User Interface
3 Hauptfenster: Projects, Data Mining Workspace, Tools Palette

29 ENTERPRISE MINER Projekt Fenster
Start: Doppel-click EM Icon Fenster der verfügbaren Projekte Maus-Steuerung Pull-down menus: File, Edit, View, Insert, Globals, Options, Help Toolbar: Up one level, Delete, Properties, Help Pop-up menu: Open, Rename, Delete, Properties Projekte: Create, Open, Save, Run, Close, Delete

30 ENTERPRISE MINER andere Fenster
Data Mining Window (DMW) Default: open Build, edit, run process flow diagrams Tools Window Tool palette, covers EM functionality D n’ D tools on DMW window Message Window Default: closed Messages generated when creating/running PFDs

31 ENTERPRISE MINER Process Flow Diagrams

32 ENTERPRISE MINER DM Workspace Window
Toolbar: Open, Save, Cut, Copy, Paste, Undo, Help Pull-down menu: File, Edit, View, Actions, Globals, Options, Windows, Help Pop-up menu: Add node, add endpoints, paste, undelete, select all, create subdiagram, refresh, up one level, top level, connect items, move and connect items Add nodes: dnd icons or use pop-up menu Connect, cut, delete nodes PFD logic: tools loosely organized according to SEMMA

33 ENTERPRISE MINER Funktionalitäten
Data: Input Data Source, Random Sample, Partition, DMDB Explore/Modify: Transform Data, Filter Outliers, Bar Chart, INSIGHT, Clustering, Variable Selection Modelling: DM Regression, Neural Networks, Tree Models, Associations Assessment: Scoring, Assessment Utilities: Group Processing, Data Replacement, SAS Code Node, Administrator, Nodes Manager, Control Points, Subdiagrams.

34 Regeln für die Knoten Input data source node zuerst in PFD. Sampling nach Input, dann beliebige Exploration, Modifizierung oder Modellierung An beliebiger Stelle: Filter outliers, transform, bar chart Nach Cluster: filter outliers, transform, bar chart, oder Modellierungen Einem Assessment muß Modellierung vorangehen.

35 Einheitliches Erscheinungsbild der Knoten
Dialog über Tabulatoren Datendialog Variablendialoge Notizendialog (einige Knoten): Browser für Resultate

DMINE Numerical Exploration DMREG (Logistic) Regression NEURAL Neural Networks SPLIT CHAID/ CART Graphical Sampling Random, Stratified DBMS, Data Warehouse Other Data Factor, Discrim ... Reporting, EIS Assessment Comparison DMDB Data + Metadata SAS Enterprise Miner Data Mining Database

37 SAS ENTERPRISE MINER Systemanforderungen
Pentium PC Windows NT 4.0+ or Win 95 250 Mb + freier Plattenplatz CD ROM Laufwerk

Client-server Lösung: Clients: Win 95, Win NT Servers: Win NT, all major UNIX Mainframe als Data Server, später auch Compute Server Beta: Only Win95, Win NT initially. Unix: AIX, HP-UX, Solaris

Etwa 100 EM Beta Anwendungen in USA Etwa 60 EM Beta Tester in EUROPA

40 Zusammenfassung SAS Enterprise Miner: Modelliert Data Mining als einen Prozess Ermöglicht Kooperation von IT, Business und Data Miners Vollständige SEMMA Implementation Integration von DW, DM and Reporting Wettbewerbsvorteil durch Data Mining

Herunterladen ppt "Datenanalyse und Data Mining mit der SAS Software"

Ähnliche Präsentationen