Chapter 1: Origins of Data Flashcards by S E

What are the alternative names for:
1. Data table
2. Observations
3. Variables

The data matrix
Cases
Features

How well did you know this?

Not at all

Perfectly

What does a data table consist of?

Rows with Observations and Columns with Variables with specific info relating to that observation. Each column is a variable.

How well did you know this?

Not at all

Perfectly

What is the common format for data tables, and what does it stand for?

csv.
“Comma separated values”

How well did you know this?

Not at all

Perfectly

What are csv files?

Text files of a data table, with rows and columns. Rows are separated by the end of line signs and columns are separated by a delimiter (i.e. semi colon). Can be imported in all stat software.

How well did you know this?

Not at all

Perfectly

What are the 3 types of observation structures?

Cross-sectional
Time series
Multi dimensional

How well did you know this?

Not at all

Perfectly

What are the 5 features of xsec data?

Observations come from the same time period and refer to different units i.e. different families
Ideally = all observations in a xsec dataset are observed at the exact same time (a particular time interval)
When the interval is narrow = is treated as a single point in time
In most xsec data = the ordering of observations in the dataset doesn’t matter.
Has the simplest data structure

How well did you know this?

Not at all

Perfectly

What are the 2 features of tseries data?

Observations refer to a single unit observed multiple times i.e. shop’s monthly sales
There is a natural ordering of the observations

How well did you know this?

Not at all

Perfectly

What is an alternative name for multi-dimensional data?

Panel data

How well did you know this?

Not at all

Perfectly

What is the common type of panel data?

LONGITUDINAL DATA/ CROSS SECTIONAL TIME SERIES DATA (xt data) = It has many units, each observed multiple times.

How well did you know this?

Not at all

Perfectly

What are 2 examples of xt data?

Countries observed repeatedly for several years, data on employees of a firm on a monthly basis etc

How well did you know this?

Not at all

Perfectly

How can multi-dimensional datasets be represented in table formats for xt data? Explain.

Most convenient format has 1 observation representing 1 unit observed at 1 time (i.e. country-year observations) so that the one unit (country) is represented by multiple observations.

In xt data tables = observations are identified by 2 ID variables: 1 for the xsec units and one for time.

How well did you know this?

Not at all

Perfectly

What is balanced xt data?

When all xsec units have observations for the very same time periods.

How well did you know this?

Not at all

Perfectly

What is unbalanced xt data?

When some xsec units are observed more times than others

How well did you know this?

Not at all

Perfectly

Name and explain the other important feature of data

Level of Aggregation of Observations.
Data with info on people may have observations at different levels i.e. age is at the individual level, home location is at the family level and real estate prices may be available as averages for zip code areas.

Time series data on transactions may have observations for each transaction, or for transactions aggregated over some time period.

How well did you know this?

Not at all

Perfectly

Define the “garbage in - garbage out” principle

Summarises the prime importance of data quality.
The result of an analysis cannot be better than the data it uses.

How well did you know this?

Not at all

Perfectly

What are the 6 key aspects of data quality?

Content
Validity
Reliability
Comparability
Coverage
Unbiased Selection

VRUCCC:
Real Value Understands Crap Crap Crap

OR/

Cash Value Comes from Reducing Useable Credit

How well did you know this?

Not at all

Perfectly

Define “Content”

Study These Flashcards

What the variables truly measure

Define “Validity”

Study These Flashcards

Whether the variables measure what they are supposed to

Define “Reliability”

Study These Flashcards

Whether the variables would lead to the same value if measured in the same way again.

Define “Comparability”

Study These Flashcards

The extent to which the variables are measured the same way across different observations

Define “Coverage”

Study These Flashcards

Is complete if all observations that were intended to be included are in the data

Define “Unbiased Selection”

Study These Flashcards

If coverage is incomplete, may have had the problem of selection bias.

Define “Selection Bias”

Study These Flashcards

Where the observations in the data are systematically different from the total

What is API?

Study These Flashcards

Application Programming Interface.

Directly loads data into a stat software package.

Automated data collection is superior and less costly than manual data collection.

Define Web Scraping

Collecting data from the web using code

Explain the feature of web scraping

Well-written web scraping code can load and extract data from multiple web pages. Some websites are easier to scrape than others, depending on the structure/presentation of info

Why is collecting data from admin sources important?

Due to its high reliability of the variables they measure and often complete coverage and thus large size.

What are the 2 pros of admin sources for data?

1. Low costs (especially low marginal costs) 2. Many observations

What are the 2 cons of admin sources for data?

1. Typically includes few variables and misses many that may be useful for analysis 2. Important variables may have low validity; their content may be quite different from what analysts would want to measure.

Explain what surveys are and the 3 types

Process where people (respondents) are asked questions and record their answers. 1. In self-administered surveys i.e. web surveys = respondents answer questions on their own 2. Interviews (personal, telephone) involve interviewer and respondents 3. Mixed-mode surveys use multiple ways for different respondents or different parts of the survey for the same respondents

Define Population

The set of all observations relevant for the analysis

Define Sampling Frame

The list of all observations from which the sample is drawn

Define the Sample

The subset for which data is collected

What is a representative sample?

It has very similar distributions of all variables to that of the population.

How would you assess if a sample is representative?

Can benchmark stats available both in the sample and the population. However = It's a good indicator, but not a guarantee.

Define Benchmarking

Looks at variables for which we know something in the population

Define Random Sampling

Process that most likely leads to representative samples

What is the four V's definition of Big Data?

1. Volume (scale of data) 2. Variety (different forms) 3. Velocity (real-time data collection) 4. Veracity (accuracy) although the 4th one is often left out due to it relating to data quality.

What are the 3 features of Big Data?

1. Massive = contains many observations and/or variables 2. Complex = doesn't fit into a single data table 3. Continuous = often automatically and continuously collected and stored. Cannot be stored on a hardware/software drive as too big/complex etc.

What are the 2 advantages of Big Data?

1. Cheaper to collect due to it being collected from existing sources 2. Coverage is often high, sometimes complete = reduces/eliminates selection bias. Typically = Big Data is collected for purposes other than analysis.

What are the 4 good practices for data collection?

1. Piloting data collection 2. Assessing the validity and reliability of the important variables i.e. cognitive interviews or test-rest measurement = when feasible and economical 3. Examining sources of imperfect coverage to assess potential selection bias 4. Working in teams with experts to design data collection

Explain the legal and ethical issues of data collection

- Needs to be fully observed. - Consulting experts = good practise before collecting data - Just because data is online and published doesn't mean it is free and legal to use. - Important rules include: ensuring confidentiality and observing ownership rights.

Chapter 1: Origins of Data Flashcards

(42 cards)