Chapter 1: Origins of Data Flashcards

1
Q

What are the alternative names for:
1. Data table
2. Observations
3. Variables

A
  1. The data matrix
  2. Cases
  3. Features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a data table consist of?

A

Rows with Observations and Columns with Variables with specific info relating to that observation. Each column is a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the common format for data tables, and what does it stand for?

A

csv.
“Comma separated values”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are csv files?

A

Text files of a data table, with rows and columns. Rows are separated by the end of line signs and columns are separated by a delimiter (i.e. semi colon). Can be imported in all stat software.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 types of observation structures?

A
  1. Cross-sectional
  2. Time series
  3. Multi dimensional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 5 features of xsec data?

A
  • Observations come from the same time period and refer to different units i.e. different families
  • Ideally = all observations in a xsec dataset are observed at the exact same time (a particular time interval)
  • When the interval is narrow = is treated as a single point in time
  • In most xsec data = the ordering of observations in the dataset doesn’t matter.
  • Has the simplest data structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 features of tseries data?

A
  • Observations refer to a single unit observed multiple times i.e. shop’s monthly sales
  • There is a natural ordering of the observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an alternative name for multi-dimensional data?

A

Panel data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the common type of panel data?

A

LONGITUDINAL DATA/ CROSS SECTIONAL TIME SERIES DATA (xt data) = It has many units, each observed multiple times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are 2 examples of xt data?

A

Countries observed repeatedly for several years, data on employees of a firm on a monthly basis etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can multi-dimensional datasets be represented in table formats for xt data? Explain.

A

Most convenient format has 1 observation representing 1 unit observed at 1 time (i.e. country-year observations) so that the one unit (country) is represented by multiple observations.

In xt data tables = observations are identified by 2 ID variables: 1 for the xsec units and one for time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is balanced xt data?

A

When all xsec units have observations for the very same time periods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is unbalanced xt data?

A

When some xsec units are observed more times than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name and explain the other important feature of data

A

Level of Aggregation of Observations.
Data with info on people may have observations at different levels i.e. age is at the individual level, home location is at the family level and real estate prices may be available as averages for zip code areas.

Time series data on transactions may have observations for each transaction, or for transactions aggregated over some time period.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define the “garbage in - garbage out” principle

A

Summarises the prime importance of data quality.
The result of an analysis cannot be better than the data it uses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 6 key aspects of data quality?

A
  1. Content
  2. Validity
  3. Reliability
  4. Comparability
  5. Coverage
  6. Unbiased Selection

VRUCCC:
Real Value Understands Crap Crap Crap

OR/

Cash Value Comes from Reducing Useable Credit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define “Content”

A

What the variables truly measure

18
Q

Define “Validity”

A

Whether the variables measure what they are supposed to

19
Q

Define “Reliability”

A

Whether the variables would lead to the same value if measured in the same way again.

20
Q

Define “Comparability”

A

The extent to which the variables are measured the same way across different observations

21
Q

Define “Coverage”

A

Is complete if all observations that were intended to be included are in the data

22
Q

Define “Unbiased Selection”

A

If coverage is incomplete, may have had the problem of selection bias.

23
Q

Define “Selection Bias”

A

Where the observations in the data are systematically different from the total

24
Q

What is API?

A

Application Programming Interface.

Directly loads data into a stat software package.

Automated data collection is superior and less costly than manual data collection.

25
Q

Define Web Scraping

A

Collecting data from the web using code

26
Q

Explain the feature of web scraping

A

Well-written web scraping code can load and extract data from multiple web pages.
Some websites are easier to scrape than others, depending on the structure/presentation of info

27
Q

Why is collecting data from admin sources important?

A

Due to its high reliability of the variables they measure and often complete coverage and thus large size.

28
Q

What are the 2 pros of admin sources for data?

A
  1. Low costs (especially low marginal costs)
  2. Many observations
29
Q

What are the 2 cons of admin sources for data?

A
  1. Typically includes few variables and misses many that may be useful for analysis
  2. Important variables may have low validity; their content may be quite different from what analysts would want to measure.
30
Q

Explain what surveys are and the 3 types

A

Process where people (respondents) are asked questions and record their answers.

  1. In self-administered surveys i.e. web surveys = respondents answer questions on their own
  2. Interviews (personal, telephone) involve interviewer and respondents
  3. Mixed-mode surveys use multiple ways for different respondents or different parts of the survey for the same respondents
31
Q

Define Population

A

The set of all observations relevant for the analysis

32
Q

Define Sampling Frame

A

The list of all observations from which the sample is drawn

33
Q

Define the Sample

A

The subset for which data is collected

34
Q

What is a representative sample?

A

It has very similar distributions of all variables to that of the population.

35
Q

How would you assess if a sample is representative?

A

Can benchmark stats available both in the sample and the population.

However = It’s a good indicator, but not a guarantee.

36
Q

Define Benchmarking

A

Looks at variables for which we know something in the population

37
Q

Define Random Sampling

A

Process that most likely leads to representative samples

38
Q

What is the four V’s definition of Big Data?

A
  1. Volume (scale of data)
  2. Variety (different forms)
  3. Velocity (real-time data collection)
  4. Veracity (accuracy)

although the 4th one is often left out due to it relating to data quality.

39
Q

What are the 3 features of Big Data?

A
  1. Massive = contains many observations and/or variables
  2. Complex = doesn’t fit into a single data table
  3. Continuous = often automatically and continuously collected and stored.

Cannot be stored on a hardware/software drive as too big/complex etc.

40
Q

What are the 2 advantages of Big Data?

A
  1. Cheaper to collect due to it being collected from existing sources
  2. Coverage is often high, sometimes complete = reduces/eliminates selection bias.

Typically = Big Data is collected for purposes other than analysis.

41
Q

What are the 4 good practices for data collection?

A
  1. Piloting data collection
  2. Assessing the validity and reliability of the important variables i.e. cognitive interviews or test-rest measurement = when feasible and economical
  3. Examining sources of imperfect coverage to assess potential selection bias
  4. Working in teams with experts to design data collection
42
Q

Explain the legal and ethical issues of data collection

A
  • Needs to be fully observed.
  • Consulting experts = good practise before collecting data
  • Just because data is online and published doesn’t mean it is free and legal to use.
  • Important rules include: ensuring confidentiality and observing ownership rights.