Testing the Importance of Cleansing Procedures for

Testing the Importance of Cleansing Procedures for
Overlaps in German Administrative Data. Patrycja Scioch (Research Data Centre of the BA at the IAB, Germany) I’d like to present a part of a research project, that analyses variations in processing administrative data and the stability of evaluation-results, based on these different processed data. Why is this of interest? New Techniques and Technologies for Statistics,

Motivation increasing importance of using administrative data for research in Germany we have two types of such data: collected for official statistical purposes by-product of administration (e.g. federal employment services) administrative data: not collected for research different and independent sources of data merging may cause contradictions in information In the last couple of years researchers more and more detected administrative data as the basis for their analyses. In Germany two types of data relevant for researchers can be identified: first – data for official statistics, which are collected and processed for administrative needs. These are in most of the cases survey data, which are available for researchers too. Second kind of data are data which arise from administration processes, for example by-products of the daily business in the employment services. This kind of data gets more and more attention from research and is also the centre of my studies. The main characteristics of these data are the following ones: - these data are not collected for research, that means, that they are not in the condition a researcher would prefer to have, - in most cases they are collected from different sources. Which are independent and so the merging of theses information may cause contradictions. So why is it then a good idea to to use them? Because it does not cost anything to produce these data, they are by-products and so cheaper than surveys and they contain a lot of information. BUT, and there is always a but, it costs a lot of time to bring them to the desired shape and not always all information of interest is in one data set, so a combination is required, which leads to other problems and makes further research in quality necessary.

The Integrated Employment Biographies - IEB
combination of four different sources: Employee History Benefit Recipient History Applicants Pool Data Participants in Measure Dataset subsample: 2.2% random sample latest update 2006 characteristics: daily records splitted into episodes quality depends on source of information The data set I use is the Integrated Employment Biographies, created by the Institute for employment Research. These are individual data, not aggregated. It is a combination of data out of four different and independent sources. It contains information on employment periods and also times when persons receive wage substitution from the federal employment agency. Furthermore periods on search for employment are integrated as well as times of participation in measures. This is a huge dataset, with about 65 Million individuals and 950 million records. To simplify matters I use a 2.2 % random sample. The data records are splitted into episodes, so that spells are not overlapping but are exactly parallel. These parallel spells of different sources cause a lot of problems, due to the fact that their quality depends on the respective source and there may be contradictory statements belonging to the same person at the same point of time and one does not now which information to believe.

Literature previous findings: open issues:
concentrate on the analysis of overlaps - qualitative and quantitative (Jaenichen et. al (2005), Bernhard et. al (2006)) correction of single variables (Waller, M. (2007), Kruppe et. al (2007)) evidence: need for data processing in the IEB the way heavily depends on the research question open issues: impact on estimates data processing by transformation of structure of dataset To improve the quality of this data some studies where done, like the ones of Jaenichen et al and Bernhard et al, who try to identify the most common overlaps and inconsistencies and propose options to deal with them properly. Waller and Kruppe et al were investigating the impact of single variables. Waller analysed the correction of enddates of training measures and Kruppe et al find 60 different definitions of unemployment and analysed their implication into the data. The conclusion of all of these papers is that there is a crucial need to put effort into data processing and that the way this should be done depends heavily on the underlying research question. Not analysed, or in an minor degree is the effect of correcting data in different ways on the results of estimations. Also an interesting point to look at is to transform the datastructure to process the data. These open issues are the matter of my study.

Identification/Method
assumptions: dataset → processing → method → result within the Case: Wunsch/Lechner (2007) evaluation of labour market programmes in West Germany analyses by comparing matching-estimates time-dependent employment opportunities as outcome step: replication of the data processing and variations of the analysis sample step: replication of the evaluation study 3. step: analyses of the effects of the variations on the results How did I detect this question? Lets make it simple and say, that researchresults depend on the underlying data, the processing of the data and the method to analyse the data, regression, matching and so on. If I now take the same data and keep the method constant, then differences of the results should be attributed to the processing procedures. To follow up this Idea I perform a replication study within the case of a paper of Wunsch/lechner from last year. They are evaluating labour market programmes in West Germany by matchingestimations. Matching is the Comparison of the employment state or other things of two individuals, where the two are the same in there characteristics except that one of them took part in a programme and the other one did not. The aim is to make conclusions about the effectiveness of labour market programmes. In the first step I will replicate the processing of the data and then vary these procedures. So different analysis samples are created, which build the basis for the next step, the evaluation. The matching remains the same for each analysis sample. The last step is to analyse the differences of the estimation results and to draw conclusions about the effect of the processing.

‚Matching-estimatior‘ - fix
Approach/Framework analysis- sample V0 outcome V0 IEB - data set analysis- sample V1 outcome V1 Outcome ? analysis- sample V2 outcome V2 A simple illustration of the workflow is shown here, with the basis data on the left, which is processed in different ways to gain the analysis samples. V0 is the result of processing like Wunsch/Lechner did and V1 and V2 are variations I made. By keeping the estimation fix 3 outcomes are received, which are compared with each other and maybe there are no differences and the processing has noch impact, or there are some and they have to be interpreted. Processing - variable ‚Matching-estimatior‘ - fix Comparison

Processing rules time windows of two weeks
multiple possibilities of spells (different sources, overlaps) goal: exact one state for each period Sort by duration and priority of source Choose the two with capital importance Select one final state using more priority-rules different analysis samples How are the data processed in particular? The periods are divided into time windows of two weeks. every two-week-window may have paralell spells of different sources or even of the same source and they don’t need to give identical information. Now it is not so easy to say which spell of which source is the right one and therefore which one to choose. So the aim is to determine one state for each period, means two-week-window. This is done by sorting them by duration within the two weeks and the priority of the source the information comes from. This priority is defined before and will be explained later on. Then the two spells with capital importance are chosen and out of them the final state for the two-week-spell is selected following further priority-rules. Changing the Priority leads to another selection and so different analysis samples are gained.

Rules of Priority Priority Model V0 Model V1 Model V2 1 Programme
Employment 2 Benefits 3 4 Applicants Differences: Model V1 prefers employment-spells to benefit-spells compared to V0 Model V2 downgrades participation in programmes and prefers employment In this table you can see the Priorities in the different models. In column 1 you see the importance of the source, 1 is highest priority and 4 the lowest. Model V0 in column2 corresponds to Wunsch/Lechner where participating in a programme is more important than receiving benefits or being employed. That is because they are evaluating labour market programmes and so they attach great importance to them. The second is the receipt of benefits, because money is paid and when money is involved you can act on the assumption that the data are correct. Nearly the same argument comes along with employment-spells. These are relative reliable, because these are notices of the employer about their employees from the notification procedure for health, pension and unemployment insurance and therefore again linked with money. The Applicants Pool Data is not very reliable, cause it contains data, which are optional and so not always recorded or often with less care. Model V1 in Column 3 differs in the point that the priority of benefitreceipt and times of employment are inverted. That because of the Fact that both are relative reliable and no one can say which one is more valid. In the last Model V2 the Priority changed with respect to Model V1, by giving programmeparticipation the lowest priority. This comes from the consideration that these data are recorded in the employment agencies before the programme took place and afterwards no one knows if the unemployed really participated or not. More often than not this is the case, but you can never say with certainty. The differences I expect are more employment in Model V1 and less participation in V2 respective to V0

Results before starting the estimation
programme – benefit – employment – applicant state 1 state 2 final state window 1 (x3) benefit employment window 2 applicant window 3 (x2) programme analysis- sample V0 programme – employment – benefit – applicant state 1 state 2 final state window 1 benefit employment window 2 (x2) window 3 applicant window 4 programme Window 5 IEB- data set analysis- sample V1 employment – programme – benefit – applicant state 1 state 2 final state window 1 benefit employment window 2 (x2) window 3 applicant window 4 programme Window 5 Here you can see a simple example. The tables show two main states (green) within the two-week-windows for the person and the final state selected out of the two in the last column. Time-windows with exact the same contents were deleted for simplicity and space. analysis- sample V2

Descriptive results Participants:
differences between sample V0 and V1, V2 different magnitudes insignificant Group of Non-Participants: significant differences not of practical importance

Estimation results - 1 Effects of programme participation compared to non-participation 11 11

Estimation results - 2 Variance in the estimation results 12 12

Summary/Prospects large insignificant differences during lock-in-effect smaller at the end of observation period => The Effect does not depend on the procedure (only the extent)! => Rules are necessary, but time + effort should not exceed benefit! creation of a “naive”-model comparison with other countries

Thank you for your attention!

Back-Up References Bernhard, S., Dressel, C., Fitzenberger, B. und Schnitzlein, D. (2006): Überschneidungen in der IEBS: Deskriptive Auswertung und Interpretation, FDZ Methodenreport 4/2006, Nürnberg. Jaenichen, U., Kruppe, T., Stephan, G., Ullrich, B. und Wießner, F. (2005): You can split it if you really want: Korrekturvorschläge für ausgewählte Inkonsistenzen in IEB und MTG, FDZ Datenreport 4/2005, Nürnberg. Kruppe, T., Müller, E., Wichert, L. und Wilke, R. (2007): On the Definition of Unemployment and ist Implementation in Register Data – The Case of Germany, FDZ Methodenreport 3/2007, Nürnberg. Waller, M. (2007): Do Reported End Dates of Treatments Matter for Evaluation Results?, FDZ Methodenreport 1/2007, Nürnberg. Wunsch, C. und Lechner, M. (2007): What Did All the Money Do? On the General Ineffectiveness of Recent West German Labour Market Programmes, University of St. Gallen Department of Economics working paper series , Department of Economics, University of St. Gallen.

Testing the Importance of Cleansing Procedures for

Ähnliche Präsentationen

Präsentation zum Thema: "Testing the Importance of Cleansing Procedures for"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Testing the Importance of Cleansing Procedures for

Ähnliche Präsentationen

Präsentation zum Thema: "Testing the Importance of Cleansing Procedures for"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback