Data Management Flashcards
Respondents’ Unique IDs can potentially be used to:
Link respondents’ personally identifiable information to their responses
Link different tables with a different structure in a relational database.
Link raw data to analysis code and to analysis output
The metadata we track for the data collection process includes
Surveyor assignments
Completion rate
Surveyor attrition
Master code files are meant specifically to:
Run (call) all other coding files in the project
We use relative references in our code so that
We do not need to repeat the full file path of the working directory for each file used or created
Different analysts who have different locations for their project folder do not need to change the file path for each file
When publishing data, the code book should be created when?
After the dataset is final, before data publication
The codebook describes the data such as variable names, labels, question text, and summary statistics such as the mean, minimum and maximum values, etc. Because variables may be generated during analysis, and summary statistics may change after certain cleaning decisions, it is best to produce the code book at the very end, when the datasets are final.
Which documents are included in the “manual” for the published data?
ReadMe file
Code book
According to Gentzkow and Shapiro, rather than naming the latest version of a file: regressions_022713_mg.do, one should instead:
Use version control software, and not use dates
What is required to merge two datasets?
There needs to be a relational parameter or “foreign key” (i.e. variable on which to merge the two datasets)
Merge
A horizontal combination of datasets by a unique ID
Append
A vertical combination of data sets that possess variables in common (at least a subset); same variable names and datatypes
Adds observations to the existing variables
Master file
a file that runs ALL code in your project
Useful for:
– Setting any globals that might be used across do-files
– Installing user-written commands
Codebook
• Contains information about the data: variable name,
labels, question text, min/max values, etc.
• Critical for easy interpretation of the data and in
furthering analysis
• Have do-file that creates codebook from raw data
• When: Created once the data set is final
ReadMe Files
• Outlines key information about all published files: data
and analysis files, questionnaires, codebooks
– E.g. format of the data (such as # of observations per
student, # of variables)
• Describes how data/analysis files interact with one
another – e.g. which came first, is one a subset of another?
• When: Immediately after each round of data collection