Working with Administrative Data Flashcards
Which of these is an example of administrative data
Text from a tweet Information from a birth certificate Demographic information collected during a baseline survey Income information from tax records None of the above
Both the information from a birth certificate and income information from tax records are data that are collected during the normal operations of a program and not primarily collected for specific research.
Compared to survey data, administrative data are less susceptible to _______ bias because _______
recall bias,
because data are collected at the time of occurrence
T/F: In the context of randomized evaluations, researchers can obtain administrative data from both public (i.e., governmental) and private institutions.
True
Both public and private institutions have provided individual-level administrative data to researchers for the purposes of randomized evaluations.
In regards to administrative data, the _______ identified and sensitive the data that you are asking to be released, the _______ challenging it will be to get those data outside of the agency for research.
more, more
When choosing identifiers for matching study data to administrative data, which of the following identifiers would be preferable to using an individual’s street address
A government-issued, unique identification number
Date of birth
because these are both numerical identifiers as opposed to identifiers comprised of letters and numbers
The exact/deterministic matching strategy may lead to more ___________, while the fuzzy/probabilistic matching strategy may lead to more ___________.
False negatives, false positives
fuzzy and probabilistic matching strategies can account for minor discrepancies, but may lead to more false positives. On the other hand, exact and deterministic matching strategies do not account for minor discrepancies and might lead to more false negatives
During the data matching process, the _______ file and the _______ file are combined to create the _______ file
identified finder, administrative data, de-identified analysis
identified finder file
contains the identifiers of the study sample and a study ID. The study ID is a numeric ID that uniquely identifies each person in the study.
administrative data file
The data provider has the administrative data file that contains identifiers and the outcome variables of interest to the research team.
de-identified analysis file
The data flow process will determine how the identified finder file and the administrative data file are combined to create the de-identified analysis file
In addition to the data provider, who should sign the Data Use Agreement (DUA)?
An official institutional representative
rather than an individual PI or staff member.
If the research team never comes into contact with individuals in the study, they do not need to get IRB approval to use the individuals’ administrative data.
T/F
False
Even though the research team may never come into contact with the individuals whose information is included in the administrative data, it may still be necessary to complete the IRB process, even if just to confirm “exempt” status.
Reporting bias
occurs when people have incentive to under- or over-report information.
Why are administrative data useful?
The outcomes and metrics required for a study may already be tracked by a government or organization
• Available retrospectively
• Enable long-term follow-up
• Reduce logistical burden
• Include near census of relevant population
• Often cheaper than surveys
How do administrative data minimize recall bias?
Data recorded at the time of occurrence– no memory
needed (e.g., banking records)
How do administrative data minimize social desirability bias?
Non-self-reported data (e.g., arrest records)
How do administrative data minimize Differential attrition and non-response bias
Near census of relevant population
Identifiable vs. partially de-identified, de-identified
Identifiable - very easy to identify individuals
Partially de-identified - more difficult; but still possible especially with additional knowledge to piece together
De-identified - very difficult or impossible to identify
Exact/Deterministic Matching
Minor discrepancies are not well accounted for
– E.g., typos in name, reversed day and month in DOB
• Some records are not identified as matches even though they may be (false negatives)
Fuzzy/Probabilistic Matching
Accounts for the likelihood that identifiers may not align exactly to those in a data system
– E.g., SSN and last name match, DOB is off by a month…counts as a match
Differential Coverage bias
Differential ability to link individuals to administrative records
• Treatment and control are differentially likely to appear in administrative records
To address differential coverage bias
• Collect identifiers for linking during the baseline
survey
– To ensure that you are equally likely to be able to link
treatment and control individuals to their records
• Identify the data universe
– Which individuals are included in the data and which are excluded, and why?
– To ensure the intervention does not affect the likelihood of appearing in a data set
Differential Reporting
• Likelihood of reporting outcome is correlated with treatment
– True value of the outcome may not differ between treatment and control, but due to the intervention, treatment group is more likely to report a certain outcome or appear in administrative records
To address differential reporting
Identify how the intervention may affect the reporting of
outcomes
– Identify the context in which the data were collected
– Determine direction in which estimates are likely to be biased
– E.g., do number of doctor’s visits reflect severity of sickness or stronger connection to the health system?
To address possibly inaccurate data
• Cross-reference with other sources to ensure accuracy
• Identify the data agency’s quality control protocol
• Choose indicators that are unlikely to be incorrectly reported
– Select variables that are straightforward and less susceptible to human error
– Request raw variables
• Communicate with program or implementing partner
responsible for collecting data
– Ask how and why data are collected
Reporting Bias
• From an individual
– E.g., under-reporting income to qualify for a social welfare program
• From an administrative organization
– E.g., schools over-report attendance to meet
requirements
To address reporting bia
• Identify the context in which the data were collected
– Were there incentives to misreport information?
• Choose variables that are not susceptible to bias
– E.g., hospital visit v. value of insurance claim