Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Research Data Centre of the BA in the IAB in Nuremberg (FDZ) Provides access to micro data of the Institute for Employment Research (IAB) Federal Employment Agency (BA) Data surveys administrational (databases > 200 GB, samples ~ 2 GB) Data protection and data access anonymization (scientific use files) on-site use remote data access
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Remote data access Researchers send do-files – we return log-files 90% use Stata Disclosure limitation review Data have detailed information on persons and firms Neither firms nor persons may be identifiable by the contents of the log-files One heuristic: delete all information that is based on few observations
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Example. table sex edu | Education Sex | low medium high male | female |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Example. table sex edu | Education Sex | low medium high male | / 6 4 female |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Example. table sex edu, c(freq mean wage) | Education Sex | low medium high male | | | female | |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Example. table sex edu, c(freq mean wage) | Education Sex | low medium high male | / 6 4 | / | female | |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Disclosure limitation review Problems a lot of work if done by hand! Pearl scripts dont get all number of firms not visible weights Solutions use other software ;-) fdz-ados -tablefdz-
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung tablefdz sxntax very similar to -table- program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /],FIRMID(string) [Contents(string) Format(string)] error messages preserve drop all needless variables and observations
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `cell'
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung tablefdz calculation done for inner table, row, col and total results for inner table, row, col and total are appended tabdisp restore
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Results of tablefdz. tablefdz sex edu, firmid(firmid) | Education Sex | low medium high Total male | female | | Total |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung. tablefdz sex edu, firmid(firmid) c(mean wage sum wage) | Education Sex | low medium high Total male | | | | female | | | | Total | | |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung. tablefdz sex edu [pw= weight], c(mean wage sum wage) firmid(firmid) | Education Sex | low medium high Total male | | | | female | | | | Total | | |
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung tablefdz deleted values can somtimes be recalculated very easily more deletion and rounding by Pearl still necessary but now it is feasible tablefdz takes about twice the time as table Questions ideas to improve? (computation time) similar programs around?
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung Topic 2: Memory usage 16 gigabyte memory in Unix server, 15 gigabyte are free How much memory is the optimum we should allocate? 13? 15? dependant on matsize? rule-of-thumb? Automation for running Stata-jobs?
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung * Program to produce tables with automatically deleted values that are based on few observations * by Stefan Seth and Peter Jacobebbinghaus *! version mar2007 program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /], FIRMID(string) [Contents(string) Format(string)] tempvar anzbetr sv1 sv2 sv3 sv4 sv5 dum svar1 svar2 svar3 svar4 svar5 eins summe tempvar sum_gew1 sum_gew2 sum_gew3 sum_gew4 sum_gew5 // used when calculating weighted averages tempfile dat cross row col scol local minoffirms = 3 // min. number of firms per statistic local anzvar = wordcount("`varlist'") local fmt "format(`format')" preserve if _by() { marksample touse // take care of by-group } quietly {
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung if "`contents'" == "" { local anz_stats = 0 } if "`contents'" != "" { if mod( wordcount( "`contents'" ), 2 ) != 0 { di in red "Invalid statistics specification; please use the Contents option as follows:" di in red "contents( statistic variable )" di in red "You may specify up to 5 statistic/variable-pairs, e.g.: " di in red "tabfdz var1 var2, contents( mean var3 sum var4 ) exit } tokenize "`contents'" local stats "`1' `3' `5' `7' `9'" local svar "`2' `4' `6' `8' `10'" tokenize "`stats'" local anz_stats = wordcount( "`stats'" ) // list of statistics forvalues i = 1/`anz_stats' { local stat`i' "``i''" } tokenize "`svar'" // list of variables forvalues i = 1/`anz_stats' { local svar`i' "``i''" }
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung * error messages if "`weight'" != "" { forv i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean" ) == 0 { di in red "weights may only be used with 'sum' and 'mean'" exit } if "`weight'" != "pweight" { di in red "only pweights allowed" exit } forvalues i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean", "sd", "median", "min", "max" ) == 0 { di in red "Statistics sum mean sd median min max allowed only" exit }
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung * reduce size of data if _by() { keep if `touse' // restrict to by-group } capture keep `if' `in' // reduction of observations keep `firmid' `varlist' `svar' `exp' // reduction of variables gen byte `eins' = 1 tokenize `varlist' // dropping missings drop if missing( `1' ) if `anzvar' == 2 { drop if missing( `2' ) }
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung if "`weight'" != "" { replace `eins' = `exp' // multiply statistics variables with weights... forv i = 1 / `anz_stats' { local jmax = `i' - 1 local seen = 0 //... but only once forv j = 1 / `jmax' { if "`svar`i''" == "`svar`j''" { local seen = 1 } if `seen' == 0 { replace `svar`i'' = `svar`i'' * `exp' } save `dat'
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung * computation of statistics local vars "`varlist'" // inner part of the table sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) // mark obs with new firm by `vars': egen `anzbetr' = sum(`dum') // number of firms per cell by `vars': egen `summe' = sum( `eins' ) // (weighted) number of persons per cell forv i = 1 / `anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { // mean with weights per hand by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i''/`summe' // mean = sum(y_i * h_i)/sum(h_i) } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) // other statistics per person } by `vars': keep if _n == 1 // keep one line per cell forv i = 1 / `anz_stats' { replace `sv`i'' = -1 if `anzbetr' <= `minoffirms' // erase statistics based on few firms } replace `summe' = -1 if `anzbetr' <= `minoffirms' // erase number of cases based on few firms save `cross' // save results for the inner part
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung if `anzvar' == 2 { use `dat', clear // the same for rows tokenize `varlist' local vars "`1'" replace `2' =. sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `row'
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung use `dat', clear // the same for columns tokenize `varlist' replace `1' =. local vars "`2'" sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum( `dum' ) by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -3 if `anzbetr' <= `minoffirms' } replace `summe' = -3 if `anzbetr' <= `minoffirms' save `col' }
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung use `dat', clear // the same for the total tokenize `varlist' replace `1' =. if `anzvar' == 2 { replace `2' =. } sort `firmid' by `firmid': gen `dum' = ( _n == 1 ) egen `anzbetr' = sum(`dum') egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { egen `sum_gew`i'' = sum( `svar`i'' ) gen `sv`i'' = `sum_gew`i'' / `summe' } else { egen `sv`i'' = `stat`i''( `svar`i'' ) } keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -4 if `anzbetr' <= `minoffirms' } replace `summe' = -4 if `anzbetr' <= `minoffirms' save `scol'
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung use `cross', clear // appending the results append using `scol' if `anzvar' == 2 { append using `col' append using `row' } * format output forvalues i = 1/`anz_stats' { // headers local svlist = "`svlist'" + " `" + "sv`i'" + "'" } if `anzvar' == 1 & "`contents'" != "" { label variable `sv1' "`stat1'(`svar1')" cap label variable `sv2' "`stat2'(`svar2')" cap label variable `sv3' "`stat3'(`svar3')" cap label variable `sv4' "`stat4'(`svar4')" cap label variable `sv5' "`stat5'(`svar5')" } label variable `summe' "Freq." local linesize = c(linesize) set linesize 255 format `summe' %9.0g } // end of quietly
Forschungsdatenzentrum der Bundesagentur für Arbeit im Institut für Arbeitsmarkt- und Berufsforschung * output di "/*§BEGIN_tabfdz*/" tabdisp `varlist', cellvar(`summe' `svlist' ) totals `fmt' csepwidth(2) missing di "/*§END_tabfdz*/" restore set linesize `linesize' end