Session 2.1 Flashcards
Data Structure (Volume/Velocity)
- Cross sectional
- Transactional
- Panel
Cross sectional
Data that (almost) never changes. (e.g. city names and locations, customer’s birth date, etc.)
Transactional
one observation represents one transaction (e.g. a website visit, or a purchase)
Panel
one observation represents one individual during a time period (e.g., monthly bill, website visits per week)
Data Structure: Tidy Data
Rules?
Put your data on a single table according to the following rules:
1 Each variable must have its own column.
2 Each observation must have its own row.
3 Each value must have its own cell.
Structured Data (Variety)
- Qualitative/Categorical Data
Nominal categories have no natural order
-> e.g., race, gender, country
Ordinal there is a natural ordering of the categories
-> e.g., age bracket, satisfaction level
Structured Data (Variety)
- Quantitative Data
Discrete countable number of distinct values
-> e.g., age, number of kids
Continuous any value within an interval
-> e.g., wage, temperature
Unstructured Data (Variety)
Text-based documents (e.g., tweets, webpages, complaints, emails, etc.)
Images / Videos
Unstructured Data (Variety)
Methods to transform unstructured data into structured data:
- Topic Modeling (text)
- Sentiment Analysis (text)
- Feature extraction (image / video / sound)
Data Quality (Veracity)
Data quality can be affected in two major ways:
- Missing Data
2. Measurement Error
Missing Data (Veracity)
- Missing observations
- Missing values in some observations
Reasons for missing data (Veracity)
- Missing at random
If data are missing at random, the remaining observations are still a representative sample of the population
Simplest Solution: listwise deletion, i.e., delete all observations that do not have values for all variables in the analysis
- Missing not at random
If data are missing not at random, then the remaining observations are not a representative sample of the population
No Simple Solution
Selection Bias (Veracity) occurs…
when the sampling procedure is not random, and thus the sample is not representative of the population
Selection Bias (Veracity)
- Self-selection
some members of the population are more likely to be included in the sample because of their characteristics
e. g., participants in a voluntary insurance program
2. Attrition
some observations may be less likely to be present in the sample due to time constraints
e.g., tendency to look only at firms that survive
Measurement Error (Veracity) occurs…
when the data collected contains errors that are non random