Chapter 1: Origins of Data Flashcards
What are the alternative names for:
1. Data table
2. Observations
3. Variables
- The data matrix
- Cases
- Features
What does a data table consist of?
Rows with Observations and Columns with Variables with specific info relating to that observation. Each column is a variable.
What is the common format for data tables, and what does it stand for?
csv.
“Comma separated values”
What are csv files?
Text files of a data table, with rows and columns. Rows are separated by the end of line signs and columns are separated by a delimiter (i.e. semi colon). Can be imported in all stat software.
What are the 3 types of observation structures?
- Cross-sectional
- Time series
- Multi dimensional
What are the 5 features of xsec data?
- Observations come from the same time period and refer to different units i.e. different families
- Ideally = all observations in a xsec dataset are observed at the exact same time (a particular time interval)
- When the interval is narrow = is treated as a single point in time
- In most xsec data = the ordering of observations in the dataset doesn’t matter.
- Has the simplest data structure
What are the 2 features of tseries data?
- Observations refer to a single unit observed multiple times i.e. shop’s monthly sales
- There is a natural ordering of the observations
What is an alternative name for multi-dimensional data?
Panel data
What is the common type of panel data?
LONGITUDINAL DATA/ CROSS SECTIONAL TIME SERIES DATA (xt data) = It has many units, each observed multiple times.
What are 2 examples of xt data?
Countries observed repeatedly for several years, data on employees of a firm on a monthly basis etc
How can multi-dimensional datasets be represented in table formats for xt data? Explain.
Most convenient format has 1 observation representing 1 unit observed at 1 time (i.e. country-year observations) so that the one unit (country) is represented by multiple observations.
In xt data tables = observations are identified by 2 ID variables: 1 for the xsec units and one for time.
What is balanced xt data?
When all xsec units have observations for the very same time periods.
What is unbalanced xt data?
When some xsec units are observed more times than others
Name and explain the other important feature of data
Level of Aggregation of Observations.
Data with info on people may have observations at different levels i.e. age is at the individual level, home location is at the family level and real estate prices may be available as averages for zip code areas.
Time series data on transactions may have observations for each transaction, or for transactions aggregated over some time period.
Define the “garbage in - garbage out” principle
Summarises the prime importance of data quality.
The result of an analysis cannot be better than the data it uses.
What are the 6 key aspects of data quality?
- Content
- Validity
- Reliability
- Comparability
- Coverage
- Unbiased Selection
VRUCCC:
Real Value Understands Crap Crap Crap
OR/
Cash Value Comes from Reducing Useable Credit
Define “Content”
What the variables truly measure
Define “Validity”
Whether the variables measure what they are supposed to
Define “Reliability”
Whether the variables would lead to the same value if measured in the same way again.
Define “Comparability”
The extent to which the variables are measured the same way across different observations
Define “Coverage”
Is complete if all observations that were intended to be included are in the data
Define “Unbiased Selection”
If coverage is incomplete, may have had the problem of selection bias.
Define “Selection Bias”
Where the observations in the data are systematically different from the total
What is API?
Application Programming Interface.
Directly loads data into a stat software package.
Automated data collection is superior and less costly than manual data collection.
Define Web Scraping
Collecting data from the web using code
Explain the feature of web scraping
Well-written web scraping code can load and extract data from multiple web pages.
Some websites are easier to scrape than others, depending on the structure/presentation of info
Why is collecting data from admin sources important?
Due to its high reliability of the variables they measure and often complete coverage and thus large size.
What are the 2 pros of admin sources for data?
- Low costs (especially low marginal costs)
- Many observations
What are the 2 cons of admin sources for data?
- Typically includes few variables and misses many that may be useful for analysis
- Important variables may have low validity; their content may be quite different from what analysts would want to measure.
Explain what surveys are and the 3 types
Process where people (respondents) are asked questions and record their answers.
- In self-administered surveys i.e. web surveys = respondents answer questions on their own
- Interviews (personal, telephone) involve interviewer and respondents
- Mixed-mode surveys use multiple ways for different respondents or different parts of the survey for the same respondents
Define Population
The set of all observations relevant for the analysis
Define Sampling Frame
The list of all observations from which the sample is drawn
Define the Sample
The subset for which data is collected
What is a representative sample?
It has very similar distributions of all variables to that of the population.
How would you assess if a sample is representative?
Can benchmark stats available both in the sample and the population.
However = It’s a good indicator, but not a guarantee.
Define Benchmarking
Looks at variables for which we know something in the population
Define Random Sampling
Process that most likely leads to representative samples
What is the four V’s definition of Big Data?
- Volume (scale of data)
- Variety (different forms)
- Velocity (real-time data collection)
- Veracity (accuracy)
although the 4th one is often left out due to it relating to data quality.
What are the 3 features of Big Data?
- Massive = contains many observations and/or variables
- Complex = doesn’t fit into a single data table
- Continuous = often automatically and continuously collected and stored.
Cannot be stored on a hardware/software drive as too big/complex etc.
What are the 2 advantages of Big Data?
- Cheaper to collect due to it being collected from existing sources
- Coverage is often high, sometimes complete = reduces/eliminates selection bias.
Typically = Big Data is collected for purposes other than analysis.
What are the 4 good practices for data collection?
- Piloting data collection
- Assessing the validity and reliability of the important variables i.e. cognitive interviews or test-rest measurement = when feasible and economical
- Examining sources of imperfect coverage to assess potential selection bias
- Working in teams with experts to design data collection
Explain the legal and ethical issues of data collection
- Needs to be fully observed.
- Consulting experts = good practise before collecting data
- Just because data is online and published doesn’t mean it is free and legal to use.
- Important rules include: ensuring confidentiality and observing ownership rights.