Testing the Importance of Cleansing Procedures for
Veröffentlicht von:Emil GibertGeändert vor über 3 Jahren
Präsentation zum Thema: "Testing the Importance of Cleansing Procedures for"— Präsentation transkript:
1 Testing the Importance of Cleansing Procedures for Overlaps in German Administrative Data.Patrycja Scioch(Research Data Centre of the BA at the IAB, Germany)I’d like to present a part of a research project, that analyses variations in processing administrative data and the stability of evaluation-results, based on these different processed data. Why is this of interest?New Techniques and Technologies for Statistics,
2 Motivationincreasing importance of using administrative data for researchin Germany we have two types of such data:collected for official statistical purposesby-product of administration (e.g. federal employment services)administrative data:not collected for researchdifferent and independent sources of datamerging may cause contradictions in informationIn the last couple of years researchers more and more detected administrative data as the basis for their analyses. In Germany two types of data relevant for researchers can be identified: first – data for official statistics, which are collected and processed for administrative needs. These are in most of the cases survey data, which are available for researchers too. Second kind of data are data which arise from administration processes, for example by-products of the daily business in the employment services. This kind of data gets more and more attention from research and is also the centre of my studies.The main characteristics of these data are the following ones: - these data are not collected for research, that means, that they are not in the condition a researcher would prefer to have, - in most cases they are collected from different sources. Which are independent and so the merging of theses information may cause contradictions. So why is it then a good idea to to use them? Because it does not cost anything to produce these data, they are by-products and so cheaper than surveys and they contain a lot of information. BUT, and there is always a but, it costs a lot of time to bring them to the desired shape and not always all information of interest is in one data set, so a combination is required, which leads to other problems and makes further research in quality necessary.
3 The Integrated Employment Biographies - IEB combination of four different sources:Employee HistoryBenefit Recipient HistoryApplicants Pool DataParticipants in Measure Datasetsubsample:2.2% random samplelatest update 2006characteristics:daily recordssplitted into episodesquality depends on source of informationThe data set I use is the Integrated Employment Biographies, created by the Institute for employment Research. These are individual data, not aggregated. It is a combination of data out of four different and independent sources. It contains information on employment periods and also times when persons receive wage substitution from the federal employment agency. Furthermore periods on search for employment are integrated as well as times of participation in measures. This is a huge dataset, with about 65 Million individuals and 950 million records. To simplify matters I use a 2.2 % random sample. The data records are splitted into episodes, so that spells are not overlapping but are exactly parallel. These parallel spells of different sources cause a lot of problems, due to the fact that their quality depends on the respective source and there may be contradictory statements belonging to the same person at the same point of time and one does not now which information to believe.
4 Literature previous findings: open issues: concentrate on the analysis of overlaps - qualitative and quantitative(Jaenichen et. al (2005), Bernhard et. al (2006))correction of single variables (Waller, M. (2007), Kruppe et. al (2007))evidence:need for data processing in the IEBthe way heavily depends on the research questionopen issues:impact on estimatesdata processing by transformation of structure of datasetTo improve the quality of this data some studies where done, like the ones of Jaenichen et al and Bernhard et al, who try to identify the most common overlaps and inconsistencies and propose options to deal with them properly. Waller and Kruppe et al were investigating the impact of single variables. Waller analysed the correction of enddates of training measures and Kruppe et al find 60 different definitions of unemployment and analysed their implication into the data. The conclusion of all of these papers is that there is a crucial need to put effort into data processing and that the way this should be done depends heavily on the underlying research question. Not analysed, or in an minor degree is the effect of correcting data in different ways on the results of estimations. Also an interesting point to look at is to transform the datastructure to process the data. These open issues are the matter of my study.
5 Identification/Method assumptions: dataset → processing → method → resultwithin the Case: Wunsch/Lechner (2007)evaluation of labour market programmes in West Germanyanalyses by comparing matching-estimatestime-dependent employment opportunities as outcomestep: replication of the data processing and variations of the analysis samplestep: replication of the evaluation study3. step: analyses of the effects of the variations on the resultsHow did I detect this question? Lets make it simple and say, that researchresults depend on the underlying data, the processing of the data and the method to analyse the data, regression, matching and so on. If I now take the same data and keep the method constant, then differences of the results should be attributed to the processing procedures. To follow up this Idea I perform a replication study within the case of a paper of Wunsch/lechner from last year. They are evaluating labour market programmes in West Germany by matchingestimations. Matching is the Comparison of the employment state or other things of two individuals, where the two are the same in there characteristics except that one of them took part in a programme and the other one did not. The aim is to make conclusions about the effectiveness of labour market programmes.In the first step I will replicate the processing of the data and then vary these procedures. So different analysis samples are created, which build the basis for the next step, the evaluation. The matching remains the same for each analysis sample. The last step is to analyse the differences of the estimation results and to draw conclusions about the effect of the processing.
6 ‚Matching-estimatior‘ - fix Approach/Frameworkanalysis-sampleV0outcomeV0IEB -data setanalysis-sampleV1outcomeV1Outcome ?analysis-sampleV2outcomeV2A simple illustration of the workflow is shown here, with the basis data on the left, which is processed in different ways to gain the analysis samples. V0 is the result of processing like Wunsch/Lechner did and V1 and V2 are variations I made. By keeping the estimation fix 3 outcomes are received, which are compared with each other and maybe there are no differences and the processing has noch impact, or there are some and they have to be interpreted.Processing - variable‚Matching-estimatior‘ - fixComparison
7 Processing rules time windows of two weeks multiple possibilities of spells (different sources, overlaps)goal: exact one state for each periodSort by duration and priority of sourceChoose the two with capital importanceSelect one final state using more priority-rulesdifferent analysis samplesHow are the data processed in particular?The periods are divided into time windows of two weeks.every two-week-window may have paralell spells of different sources or even of the same source and they don’t need to give identical information. Now it is not so easy to say which spell of which source is the right one and therefore which one to choose. So the aim is to determine one state for each period, means two-week-window. This is done by sorting them by duration within the two weeks and the priority of the source the information comes from. This priority is defined before and will be explained later on. Then the two spells with capital importance are chosen and out of them the final state for the two-week-spell is selected following further priority-rules. Changing the Priority leads to another selection and so different analysis samples are gained.
8 Rules of Priority Priority Model V0 Model V1 Model V2 1 Programme Employment2Benefits34ApplicantsDifferences:Model V1 prefers employment-spells to benefit-spells compared to V0Model V2 downgrades participation in programmes and prefers employmentIn this table you can see the Priorities in the different models. In column 1 you see the importance of the source, 1 is highest priority and 4 the lowest. Model V0 in column2 corresponds to Wunsch/Lechner where participating in a programme is more important than receiving benefits or being employed. That is because they are evaluating labour market programmes and so they attach great importance to them. The second is the receipt of benefits, because money is paid and when money is involved you can act on the assumption that the data are correct. Nearly the same argument comes along with employment-spells. These are relative reliable, because these are notices of the employer about their employees from the notification procedure for health, pension and unemployment insurance and therefore again linked with money. The Applicants Pool Data is not very reliable, cause it contains data, which are optional and so not always recorded or often with less care.Model V1 in Column 3 differs in the point that the priority of benefitreceipt and times of employment are inverted. That because of the Fact that both are relative reliable and no one can say which one is more valid.In the last Model V2 the Priority changed with respect to Model V1, by giving programmeparticipation the lowest priority. This comes from the consideration that these data are recorded in the employment agencies before the programme took place and afterwards no one knows if the unemployed really participated or not. More often than not this is the case, but you can never say with certainty.The differences I expect are more employment in Model V1 and less participation in V2 respective to V0
9 Results before starting the estimation programme – benefit – employment – applicantstate 1state 2final statewindow 1 (x3)benefitemploymentwindow 2applicantwindow 3 (x2)programmeanalysis-sampleV0programme – employment – benefit – applicantstate 1state 2final statewindow 1benefitemploymentwindow 2 (x2)window 3applicantwindow 4programmeWindow 5IEB-data setanalysis-sampleV1employment – programme – benefit – applicantstate 1state 2final statewindow 1benefitemploymentwindow 2 (x2)window 3applicantwindow 4programmeWindow 5Here you can see a simple example. The tables show two main states (green) within the two-week-windows for the person and the final state selected out of the two in the last column.Time-windows with exact the same contents were deleted for simplicity and space.analysis-sampleV2
10 Descriptive results Participants: differences between sample V0 and V1, V2different magnitudesinsignificantGroup of Non-Participants:significant differencesnot of practical importance
11 Estimation results - 1Effects of programme participation compared to non-participation1111
12 Estimation results - 2Variance in the estimation results1212
13 Summary/Prospectslarge insignificant differences during lock-in-effectsmaller at the end of observation period=> The Effect does not depend on the procedure (only the extent)!=> Rules are necessary, but time + effort should not exceed benefit!creation of a “naive”-modelcomparison with other countries
15 Back-UpReferencesBernhard, S., Dressel, C., Fitzenberger, B. und Schnitzlein, D. (2006): Überschneidungen in der IEBS: Deskriptive Auswertung und Interpretation, FDZ Methodenreport 4/2006, Nürnberg.Jaenichen, U., Kruppe, T., Stephan, G., Ullrich, B. und Wießner, F. (2005): You can split it if you really want: Korrekturvorschläge für ausgewählte Inkonsistenzen in IEB und MTG, FDZ Datenreport 4/2005, Nürnberg.Kruppe, T., Müller, E., Wichert, L. und Wilke, R. (2007): On the Definition of Unemployment and ist Implementation in Register Data – The Case of Germany, FDZ Methodenreport 3/2007, Nürnberg.Waller, M. (2007): Do Reported End Dates of Treatments Matter for Evaluation Results?, FDZ Methodenreport 1/2007, Nürnberg.Wunsch, C. und Lechner, M. (2007): What Did All the Money Do? On the General Ineffectiveness of Recent West German Labour Market Programmes, University of St. Gallen Department of Economics working paper series , Department of Economics, University of St. Gallen.