Präsentation zum Thema: "Datenanalyse und Data Mining mit der SAS Software"— Präsentation transkript:
1 Datenanalyse und Data Mining mit der SAS Software SAS Enterprise Miner TMDatenanalyse und Data Miningmit der SAS SoftwareReinhard StrübySAS Institute Heidelberg
2 Warum? - Data Mining und Anforderungen Was? - Data Mining Definition InhaltWarum? - Data Mining und AnforderungenWas? - Data Mining DefinitionWer? - AnwendergruppenWie? - Erfolgsfaktoren für Data MiningWie? - Der SEMMA ProzessSAS! - Die SAS Data Mining LösungData mining is the technology which drives the major IT application today: Customer Relationship Management. Besides the Web data mining is one of the hotest IT topics today.In this presentation we want to address the following questions:- Why is d.m. of importance today?- What is it?- Who is the target audience for data mining and how to address their needs? A related question is who would be the decision maker on a data mining sale- What is needed to make DM a success?- The SAS solution is covering all this - today.
3 ??? DATA MINING ? This is the situation today: Everybody talks about data mining, but despite the hype in the press very few understand it and know how to use it for business advantage. This is exactly the question which we want to address.
4 Welche Kontakte gab es bisher? Die Geschäftsaufgabe“Kenne Deine Kunden!”Wer sind sie?Was wünschen sie?Welche Kontakte gab es bisher?Wie kann eine dauerhafte Beziehung hergestellt werden?Welche Kunden könnten uns verlassen?We are living in a time of world-wide competition. This means more opportunities, but also more potential problems.In the past IT was mainly focussed on getting the daily job done - so master the transactions - telco, banking, insurance, retai etc.Nowadays the customer generating all of these transactions becomes the focus of interest. Knowledge about your customer is power and makes you more successful.- Who are they? Customer segmentation. Do I have different customer groups with different needs? Can I differentiate “good” and “bad” c.?- What are they like? Customer profiling? What attributes (eg. demographics) make a good or bad customer?- What interactions have already taken place? Can I build up from past data a model of customer behaviour? This “training” is needed to generalise findings for new customers- How to build up a long term relationship? Depending on the industry it is 4-10 times more expensive to acquire a new customer than to make an old customer happy.- What customers might leave us? Churn management. Major concern, eg. in Telco industry.
5 Data Mining - warum jetzt? Erhöhter WettbewerbsdruckGeringere Speicherkosten und höhere RechengeschwindigkeitenData Warehouses oft vorhandenVersteckte Informationen in großen DateienData Mining - Methoden finden MusterGUI Data Mining AnwendungenKundendruck auf VeränderungenROI erhöhenThe increased competitive pressure is the major business reason for the data mining popularity.Some IT infrastructure came into play to make DM more accessible.A first step is the ever decreasing cost to store large amounts of data and process it with ever faster CPUs.The larger amounts of data (incl. the web) need proper ways for storing and managing it for business advantage: the Data Warehouse.Traditional reporting on large amounts of data gives consistent results, but we are not sure if we ask the right questions. The question “which customers should we concentrate on in future” might be more relevant than the traditional “which products in what area in the last year generated the most profit?”Data mining is exactly that technology which looks for interesting patterns in customer behaviour which could be exploited for business advantage.Data mining was a specialist area until very recently. The widespread availability of user-friendly DM packages in the last few years made a big difference.Data mining is almost always directed towards a business question: eg how can I improve my return rate on direct mail? How can I find out which potential customer will turn into a bad deptor?
6 Data Mining Definition Data Mining ist der Prozess des Selektierens, Erklärens und Modellierensgroßer Datenmengen,um bisher unbekannte Datenmuster für einen Geschäftsvorteil zu nutzen.Data Mining is defined different ways depending on who you are talking to ranging from artificial intelligence to statistical analysis to OLAP. So let’s start with a basic definition of data mining.Note that data mining implies some use and acceptance of new methodologies to understand relationships in data. Data mining is a process of steps we will look more into detail later on. Implicit in this definition is the fact the some traditional techniques are well suited for data mining activities as well.Key to the definition of data mining is the idea of large data sets, many sample sizes are in the range of 50 gigabytes and up. Also implicit in the definition is the concept of a model. Determining what are important customer/process attributes and then building a predictive model is the key to successful data mining.
7 Data Mining ist ein Prozess. Data Mining beinhaltet die enge Kooperation von IT, Fachabteilung und Data Minern.Data Mining ist nicht beschränkt auf bestimmte Industriezweige oder Probleme.It is now generally accepted that data mining is a process of steps (META, IBM) and more insight into this process is needed.The next characteristic is that DM similar to Data Warehousing involves a cooperation issue. DW needs cooperation of IT and Business. Data Mining builds on top of DW and needs in addition the expertise of “data miners”, staff who are trained in analytical methods and who should be part of the data mining team (but not the only one group involved!).Data mining can be used anywhere, as we will discuss on the next slides.
8 DATA MINING - INDUSTRIES GeneralCustomer Segmentation Targeted/cross marketing Pricing Analysis Associations & DemographyInsurance & Health CareClaim AnalysisFraudulent BehaviorBankingCredit Authorization Credit Card Fraud Detection Portfolio AnalysisCash PlanningTelecommunicationsCall Behaviour AnalysisChurn ManagementRetail/MarketingMarket Basket AnalysisDatabase MarketingCategory ManagementProduction and UtilitiesProcess Management Demand PatternsCapacity PlanningInventory PlanningWhen relating data mining apps to industries it can be differentiated into dm apps common to all industries (“general”) and to those specific for particular idustries. The financial sector, telecoms and retail in that order are the industries with the highest demand for data mining solutions.Industry specific DM apps are often related to the particular operational business of that industry, eg.banking: financial transactionsinsurance: claim analysistelco: call behaviour analysis.So this means that there is a chance to get a foot into the door to own this business-critical data via data mining, with data warehousing as a possible follow-up activity.
9 IS DATA MINING IMPORTANT? Postbank N.V. “50% response on first mailing payed for DM investment”US West “Reducing customer churn by any amount is 10 times cheaper than gaining a new customer”ABN AMRO “Interest earned on 40% reduction in cash in ATMs”Neckermann Versand AG “Increased number of good customers getting credit by 80 a day”Gloucestershire Constabulary “For the public, increased crime pattern identification and prevention is priceless”Data Mining is being used today in many areas today.And those companies listed here have made very favourable experiences with it.Finally all of these companies have something in common: they use the SAS System for data mining and they are happy to talk about.
10 Leiter von Fachabteilungen / Spezialisten Data Miner DATA MINING - NutzerLeiter von Fachabteilungen / SpezialistenData MinerWho is the user and who decides on data mining projects? As we will see these groups are NOT identical.Users first. These are:Users in business departments, business analysts or knowledge workers and their department headsData Miners or quantitative experts, who depending on industry or company may sit in IT, marketing, or planning departments. They are not usually known under that generic name but may have titles such as marketing analyst, financial analyst, actuary or such. Data miners are more common in the financial industry than in eg. retail.Heads of business departments are often intiators of a data mining project and influence the buying decision, although they will rarely buy themselves (probably IT). Data miners are at best influencers but play an important role as a user and support for business staff.
11 Nutzer: Fachabteilung Kennen das FachgebietVerstehen die Inhalte der DatenSuchen nach Informationen, haben aber oft geringe analytische KenntnisseArbeiten häufig in Marketing-Abteilungen als AnalystBusiness technologists are individuals who know their industry and selected business processes of their company very well. Moreover they are familiar with the contents of the data reflecting their business processes although they might not know where/how they are being stored.Business analysts need to analyze business critical beyond the level of standard reporting but are not professional statisticians. They are not familiar with the syntax and options of SAS System quantitative procedures or with statistical terminology. Their focus is on making more money for their organizations through execution of specific business objectives. The business technologist requires a high level interface to the data mining capabilities and expects to be able to apply the SEMMA methodology without having to be a statistician. The business technologist may also wish to do campaign management.
12 Etwas isoliert von Geschäftsfragen Nutzer: DATA MINERQuantitative Experten: statistischer/mathematischer Background oder vergleichbare KenntnisseEtwas isoliert von GeschäftsfragenVertraut mit Algorithmen und Datenanalyse-ProzessHäufig im Finanzsektor, sonst eher seltenData miners or quantitative experts may come from IT, maketing, or planning departments. As mentioned they are fairly common in the financial industry, but rarely to find in Telecomms or Retail.They have a statistical or mathematical background and now a lot about analytical techniques - the traditional SAS/STAT user.They probably build models now for tasks such as customer profitability, credit scoring, or financial modelling and deliver these to department heads, finally to upper management but are not involved in the decision making process.If there are no data miners at your prospect then this need to be compensated by SAS consulting or partners/QPs to exploit the SAS Data Mining Solution to its full potential.
13 THE DATA MINING MARKET - IN MILL $ Breakdown of figures from the previous page. I have left off non-software related categories such as data providers (Nielsen etc.), and systems integrators (HW suppliers, big 6 etc.). With these omitted Meta splits up the data mining software providers into 5 different categories:Vertical application providers to provide business functions such as churn management, supply chain optimisation etc. for particular industriesHorizontal applications (cross-industry) such as marketing and sales tracking, customer service orientationMacro mining. These are basic toolkits to provide custom data mining solutions.Micro mining. These include Q&R, OLAP and pure statistical solutions.Data visualization - as part of statistics or GIS.With the exception of vertical applications the SAS System is mentioned in all categories as one of the leading contenders if not the leading one. And we have still room to grow into the protitable vertical applications area.Source: META Group, Data Mining Market Trends* SAS System mentioned in this Category.
14 Data Warehousing / Data Mining integrated THIRD GENERATION DATA MINING -IntegratedCorporate OLTPSystemsDataWarehousesDataMartsDataMiningDSS, EISOLAPVSAMIMSDB2DemographicDataITData MinersBusiness Depts.Mkt AnalystsExecMgmtThe market is looking now for integrated solutions such as in the diagram above.First generation data mining involved outsourcing data mining to a service provider with turn around times of weeks, unsatisfactory support and most often results in paper form with no way to integrate into corporate IT.Second generation data mining already included in-house data mining but in from of stand-alone DM packages which often had serious deficiencies to read all sort of corporate data directly. Eg. neural network packages often could not even deal with character data and needed flat-file input.A much more desirable situation is the one above, where data storage and management is donein a data warehouse environment, extracts are made on a departmental level as data marts on which data mining can be pursued easily and painlessly. Functions are clearly separated: IT manages the operational data and the DW and data miners as well as business staff has access to both the data marts as well as the data to be mined. Furthermore anyone including Exec management is in a position to consume the results of data mining via customised reporting.Lifestyle andBehaviour DataIndustryDataData Warehousing / Data Mining integrated
15 Zugriff auf alle Datenquellen - Data Warehousing ErfolgsfaktorenZugriff auf alle Datenquellen - Data WarehousingSkalierbarkeit: HW / SWBreites Spektrum von DM Methoden: Konzentration auf GeschäftsproblemeStrategie der ImplementationThese are the basic 4 criteria for a successful data mining project.1. Esotheric discussions on data mining methodology often miss the most important aspect: that ALL available data no matter what source or format can be read immediately for use in data mining. This is best done using Data Warehousing, also covering Web data as a source.2. Data mining is typically done on A LOT of data, gigabytes, sometimes even terabytes. Data mining solutions therefore need to be scalable to cover this amount of data. Stand-alone PC packages quickly run out of steam. Data mining requires typically a client-server setup.3. There is quite a discussion on what data mining technique gives the best result. Again this discussion misses an important point: it depends on the business question and on the data what fits best, so the more dm methods are available the more flexible the solution can be tailored to the problem. This way we can concentrate on the business problem rather than being forced to twist the data or problem to the technique at hand.4. Do not forget the user. Key for the success of data mining is its acceptance by the user. The SEMMA methodology has proven to be a practical approach for data mining.
16 Vergleich OLAP gegen Data Mining Report WritingData MiningMethodologyNutzergesteuertesReporting -Dimensionen bekanntDatengesteuerteExploration -Suche nach DimensionenWithin data warehousing OLAP is another popular technique for data exploitation. OLAP vendors have tried in the past to position OLAP as a way to do data minin.g.This is far from the truth. OLAP is reporting on data along known and defined dimensions. A typical question would be: tell me the best selling product in the year 1966 for Europe. OLAP is a top-down approach - ask an exact question where all dimensions of the question are well defined.Data mining is exactly the opposite. Data mining is a bottom-up strategy. Starting from the data what are interesting patterns in the data which might lead to a business relevant question. Another way to put it: data mining is an exploration for the relevant questions - such as: which are the customers we should be concentrating on?OLAP and data mining are no alterternatives but work together well. Data mining finds the relevant questions and OLAP very well suited to track the answer at any point in time.Bestverkauftes Produktim Jahr 1997in der Region X ?Auf welche Kundensollten wir unskonzentrieren ?
17 SAS DATA MINING SOLUTION Data Mining, IT and Business TransformData intoInformationAct onBusinessQuestionData WarehouseDBMSData MiningProcessingEIS, BusinessReporting,GraphicsIdentifyProblemMeasureResultsThe “virtual cycle of data mining” (Berry, Linhoff: Data Mining Techniques for Marketing, Sales and Customer Support, Wiley 1997) gives a good picture how well data mining integrates with business and IT strategies.The first phase is to identify a business opportunity, such as a sizeable number of profitable customers are leaving a bank.Business problems should be reflected in the corporate database. The data warehouse works as the corporate memory and we assume that there is information about the customers who have left in the data warehouse.In the next step, data mining, we would like to gain actionable results such as what is the profile of those leaving our bank as opposed to those who stay with us. Let’s suppose we have identified a major portion of the leavers as on average having multiple accounts, often using ATM’s of other banks, and often trying to call us on business matters but sometimes have to wait for a response.We would like to act on this information. If we could clearly identify the group of leavers mentioned above then we could provide them with an additional telephone number with a higher service priority. We follow-up on the results using standard reporting (EIS systems or the like). to see ifWe would like to measure the results. It may well be that this measure pays the desired result of reducing customer churn. If so we have to redefine the business opportunity, look for further ways to reduce churn or we may attack a completely new business opportunity such as introducing new financial services for customer groups with different needs. - The virtuous cycle of data mining starts again.
18 SEMMA Sample Explore Manipulate Model Assess Sampling? Visual ExplorationDataReductionGrouping,SubsettingTransformNeuralNetworksDecisionTreesStatisticalTechniquesAssociations,SequencesModel Comparison,New QuestionsWithin the data mining stage (see slide before) SAS Institute offers SEMMA, a refined data mining methodology which has been welcomed by our early adopters of data mining.Data mining might involve many tries to find patterns with multiple passes of the data. Taking a random sample of the data is very appropriate to save execution time. If we have found a reasonable explanation for an opportunity such as customer profitability we would of course score the whole data, so each customer for profitability. One word of caution: when being in competition with IBM you might find that they will wrongly accuse us that sampling indicates that we cannot deal with larger data sizes. Not true at all - sampling is just efficient, not required.Before any modelling is done it is a good idea to survey the data graphically or statistically for significant features, outliers, groupings and other obvious major characteristics. Exploration of data gives us a good idea what we should look for at the modelling stage.Insights gained in exploration should be reflected in the data in a data manipulation stage. Maybe we can concentrate on a set of customer characteristics which have proven to be interesting, certain groups may have been identified with different behaviour and the group information should be added, what about missing values etc.All the information gained so far can now be used to sensibly model customer behaviour with the appropriate technique. We offer a wide range of data mining techniques to fit to any business question.The final step of most interest for the business user is the assessment. How well does the model explain customer behaviour? If we have several alternatives for explanation which one is the best in terms of associated profits? Results in assessment should be presented in a business-like and graphical way to meet the needs of the business user.
19 SAMPLING ?Empfohlen, nicht Voraussetzung:Inhalte gehen nicht verloren.Erhebliche Performance VorteileModellprüfung: Training, Testing, Validation Samples
20 Erkennen von Ausreißern, Gruppen, Assoziationen ... EXPLORATIONErkennen von Ausreißern, Gruppen, Assoziationen ...Visual Exploration:3-dim. ChartsGraphische Daten AnalyseGISAnalytical Exploration:Cluster AnalysisCorrespondence AnalysisPCA, Factor, MDS …Welche Fragen sollten gestellt werden ?
21 Welches sind wesentliche Variable? Fehlende Werte ? DATA MANIPULATIONWelches sind wesentliche Variable?Fehlende Werte ?Variablentransformation ?Neue Informationen hinzufügen: Groups, Labels etc.Mit welchen Informationen sollte ich arbeiten ?
22 MODELLING Welche Form haben meine Daten ? ... NNs Statistical Tree-based Methods Time SeriesWelche Form haben meine Daten ? ...
23 Bewertung: Wie gut ist mein Modell ? Assessment - Scoring Erklärungsbeitrag der Variablen, AusreißerAssessment - ScoringKlassifikationLift ChartsVerallgemeinerungfür andere Daten
24 SAS Data Mining Solution Currently (Feb 98) Data Warehousing incl. Web TechnologyAnalytical SolutionsNNA - Production on Win, OS/2 and all major UNIX, ORLANDO I and IITree Menue SystemExploration: INSIGHT, SPECTRAVIEW, GISStatisticsTime Series ForecastingMarket Research MethodsEIS, Enterprise Reporter, GraphicsSAS Institute has offered solutions for data mining before the term was fashionable. These always consisted of three key ingredients:Solutions for data access and management, now with the Warehouse Administrator and all new technology for parallel access of data (SPDS) and access of Web data (SAS/IntrNet).The multitude of analytical solutions SAS Institute was always known for, latest enhancements for data mining were the SAS Neural Network Application and the Tree Menue System.End user reporting tools such as the Enterprise Miner.Together these solutions make up the SASSystem for Data Mining, which according to the Meta Group and the Two Crows Corporation is the market leader for data mining in terms of customer usage(see later slides).
25 Neue SAS DM Lösung SAS Enterprise Miner TM Einheitliche und voll skalierbare Business Lösung für das Data MiningFüllt den Platz zwischen Data Warehousing und Endnutzer Reporting aus.Das GUI schaft ein nutzerfreundliches front-end für den SEMMA Prozess.
26 SAS ENTERPRISE MINERVorteile für die Nutzer:IT: DW Zugriff, SkalierbarkeitBusiness Nutzer:Intuitive Oberfläche undOrientierung auf dieGeschäftsfragenData Miners: Analytische Tiefeund Flexibilität
27 SAS ENTERPRISE MINER Umgebung Projekte/Modelle in Win95 HierarchiestrukturSEMMA Prozess in Process Flow DiagramsBestehende SAS Programme und Anwendungen können einfach integriert werden.Alle Funktionalitäten des SAS Enterprise Miner wie die DMDB und alle analytischen Werkzeuge sind ausschließlich in dieser Data Mining Lösung verfügbar.SEMMA is an acronym for a data mining process. Data Mining is part of a process: 1) begin by generating ideas and hypotheses, 2) validate ideas based on patterns in the data, 3) transform into actionable segments, 4) measure the results.Note that preceding SEMMA is a data warehousing or data mart strategy and that following SEMMA is the delivery of this information to the “downstream” consumers of the data mining work. The downstream users are the ones who execute the marketing/sales campaign around the results of the data mining activities. The workflow of a campaign includes: 1) list pulling from a scored file 2) campaign analysis, 3) campaign tracking, 4)campaign profitability, and 5) history. Campaigns are launched as outreach efforts (such as mailings, telemarketing, and web activities) to the list of customers identified from the list pull. The results of the campaign are then passed back to the data mining environment for input into the next pass.
28 ENTERPRISE MINER User Interface 3 Hauptfenster: Projects, Data Mining Workspace, Tools Palette
29 ENTERPRISE MINER Projekt Fenster Start: Doppel-click EM IconFenster der verfügbaren ProjekteMaus-SteuerungPull-down menus: File, Edit, View, Insert, Globals, Options, HelpToolbar: Up one level, Delete, Properties, HelpPop-up menu: Open, Rename, Delete, PropertiesProjekte: Create, Open, Save, Run, Close, Delete
30 ENTERPRISE MINER andere Fenster Data Mining Window (DMW)Default: openBuild, edit, run process flow diagramsTools WindowTool palette, covers EM functionalityD n’ D tools on DMW windowMessage WindowDefault: closedMessages generated when creating/running PFDs
32 ENTERPRISE MINER DM Workspace Window Toolbar: Open, Save, Cut, Copy, Paste, Undo, HelpPull-down menu: File, Edit, View, Actions, Globals, Options, Windows, HelpPop-up menu: Add node, add endpoints, paste, undelete, select all, create subdiagram, refresh, up one level, top level, connect items, move and connect itemsAdd nodes: dnd icons or use pop-up menuConnect, cut, delete nodesPFD logic: tools loosely organized according to SEMMA
33 ENTERPRISE MINER Funktionalitäten Data: Input Data Source, Random Sample, Partition, DMDBExplore/Modify: Transform Data, Filter Outliers, Bar Chart, INSIGHT, Clustering, Variable SelectionModelling: DM Regression, Neural Networks, Tree Models, AssociationsAssessment: Scoring, AssessmentUtilities: Group Processing, Data Replacement, SAS Code Node, Administrator, Nodes Manager, Control Points, Subdiagrams.
34 Regeln für die KnotenInput data source node zuerst in PFD.Sampling nach Input, dann beliebige Exploration, Modifizierung oder ModellierungAn beliebiger Stelle: Filter outliers, transform, bar chartNach Cluster: filter outliers, transform, bar chart, oder ModellierungenEinem Assessment muß Modellierung vorangehen.
35 Einheitliches Erscheinungsbild der Knoten Dialog über TabulatorenDatendialogVariablendialogeNotizendialog(einige Knoten): Browser für Resultate
37 SAS ENTERPRISE MINER Systemanforderungen Pentium PCWindows NT 4.0+ or Win 95250 Mb + freier PlattenplatzCD ROM Laufwerk
38 SAS ENTERPRISE MINER Architektur Client-server Lösung:Clients: Win 95, Win NTServers: Win NT, all major UNIXMainframe als Data Server, später auch Compute ServerBeta: Only Win95, Win NT initially.Unix: AIX, HP-UX, Solaris
39 SAS ENTERPRISE MINER Beta Etwa 100 EM Beta Anwendungen in USAEtwa 60 EM Beta Tester in EUROPA
40 ZusammenfassungSAS Enterprise Miner:Modelliert Data Mining als einen ProzessErmöglicht Kooperation von IT, Business und Data MinersVollständige SEMMA ImplementationIntegration von DW, DM and ReportingWettbewerbsvorteil durch Data Mining