Quantitative data analysis Flashcards

1
Q

What is statistics and what are two types?

A

The science of collecting and analyzing data for drawing conclusions and making decisions
1. Descriptive statistics
2. Inferential statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is descriptive statistics?

A

It is a method of organizing, summarizing and presenting data in a convenient and informative way
- For example through graphs or numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is inferential statistics?

A
  • A branch of statistics that allows us to make predictions, estimates or generalizations about a popluation about a sample
  • Statistical inference is the process where we can acquire information about populations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between probability and statistics?

A

Probability is deductive, meaning given the information in a box, you can figure out what is in your hand

Statistics is inductive, meaning given the information in your hand, you can figure out what is in a box

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Qualitative (categorical) data representation vs quantitative data representation

A

Qualitative data representation means data is grouped into non-numerical and descriptive categories, and is then used to compare categories or proportions. Ex: Car colors

Quantitative data representation involves data that includes numbers and measurable quantities, and is used to analyze distributions, patterns or correlations. Ex: Car speeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Common tools for qualitative data representation

A
  • Summary table
  • Bar chart
  • Pie chart
  • Pareto diagram (innehåller både staplar och linjer)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common tools for quantitative data representation

A
  • Scatter plot
  • Histogram
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are three good practices when presenting data?

A
  1. Clearly labeled with title, labeled variables and specified units
  2. Source of data is identified
  3. Data have a date
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are 10 good practices when visualizing data?

A
  1. Identify target audience
  2. Make sure the data is clean
  3. Select the right chart
  4. Label the chart effectively
  5. Emphasize the important points
  6. Choose the best dashboard
  7. Format your chart for accessability
  8. Make use of color
  9. Ensure data is readable in all formats
  10. Accept feedback
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are two key concepts in numerical representation of data?

A
  1. Measure of location: Describe where the data is centered or positioned, key measures to do this are mean, median, mode and quartiles
  2. Measure of variability and dispersion: Describe how spread out or dispersed the data is around the center, key measures to do this are range, variance, standard deviation, coefficient of variation and box plots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are box plots and why are they useful?

A

Shows the median, quartiles and outliers (important!)
- A graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles

They are useful because if the median is closer to the bottom or top of the box, it suggests skewness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you compare boxplots?

A
  1. Median: Compare the position of the median lines, higher median = higher central tendency (determine how typical values (medians) differ between datasets)
  2. IQR (box length): Compare the size of the boxet, longer box = greater variability in the middle 50% of data (more numbers to include –> bigger box)
  3. Whiskers: Compare length of whiskers, longer whiskers = greater spread in tails (more numbers in data –> longer whiskers)
  4. Outliers: Compare number and position of outliers, more outliers = presence of extreme values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to construct boxplots

A
  1. Organize the n observations in the data set from smallest to largest
  2. Separate the smallest half and the largest half
  3. Find the lower fourth (median of smaller data-half )
  4. Find the highest fourth (median of larger data-helg)
  5. Find the fourth spread fs = upper fourth - lower fourth, outliers are >=1.5fs, extreme outliers are >=3fs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are main concepts in inferential statistics?

A
  • Sampling: Taking a subset (sample) of a population to estimate the characteristics of the whole population
  • Estimation: Using sample data to estimate population parameters like mean, proportion, etc
  • Hypothesis testing: Testing assumptions (hypotheses) about a population using sample data
  • Confidence intervals: A range of values used to estimate the true value of a population parameter with a given level of confidence (e.g. 95%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a sample?

A

An observed subset of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does statistical inference include?

A
  • Estimation: Determine the value of a population based on sample statistics
  • Hypothesis testing: To determine whether there is enough statistical evidence in favor of a certain belief about a parameter
17
Q

What does 95% confidence level mean?

A

If we repeat the sampling process many times, 95% of the intervals would contain the true mean

18
Q

What is the purpose of hypothesis testing and how do you formulate a hypothesis?

A

The purpose is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.

Method
- A null hypothesis, H0, is formulated and is assumed to be true
- The alternative hypothesis, Ha, is a claim contrary to H0
- The possible conclusions from hypothesis-testing analysis is then to REJECT H0 (if there is enough statistical evidence that it is not true) or FAIL TO REJECT H0 (if there is not enough statistical evidence to draw the conclusion that H0 is not true)

19
Q

What are two concepts of modeling analysis?

A
  1. Linear regression/regression analysis
  2. Machine learning
20
Q

Describe the steps in model development and two types of models

A
  1. Specification
  2. Estimation
  3. Validation
  4. Application

Two types
1. Linear (positive or negative linear relationship)
2. Non-linear

21
Q

Regression analysis

A

A simple method of supervised learning that models causality and provides prediction
- Explains the effect of the independent variable X on the dependent variable Y

Linear regression is a type of regression analysis and has a deterministic and probabilistic component.
- Assumes that the dependence of Y on X is linear

22
Q

What three aspects are estimating the coefficients in linear regression determined by?

A
  1. Drawing a sample from population of interest
  2. Calculating sample statistics
  3. Producing a line that cuts into the data
23
Q

Error term assumptions in regression models

A

For a valid model, some assumptions on error terms must be fulfilled
- Independent Identically Distributed (IID): Errors are independent from each other and have the same distribution across all observations
- Normally distributed (N): Errors follow a normal distribution with mean 0 and constant variance

24
Q

What are key considerations for variable selection in regression models?

A

Explanatory power: Variables should significantly explain the variation in the dependent variable
- Can be assessed using measures like R2 (coefficient of determination, more explanatory power if R2 is higher) or hypothesis test on coefficients

Explanatory power includes
- Causality: Causal relationship between independent and dependent variable
- Model performance: Evaluate models fit and accuracy using adjusted R2, residual analysis or other performance metrics

25
Q

Describe independence/no multicollineraity

A

It means that independent variables should not be highly correlated with each other

26
Q

What is important to remember with sample data and outliers?

A

Data need to be sufficient in size, be representative and free of bias

Larger samples are likely to lead to better estimates, but data collection is expensive. This means you should collect as much data as needed to provide the required level of statistical confidence.

It is also important to identify and handle outliers in sample data, for example using diagnostics like residual plots or boxplots

27
Q

What is the regression model building procedure?

A
  1. Theoretical analysis -> Hypothesized model specification: Form a theory or hypothesis about variable relationship to understand them better
  2. Initial examination prior to model building: Explore and prepare data before building regression model, select a sample and check for outliers, linearity and multicollinearity
  3. Model building with regression equation: Specify regression model, estimate coefficients, look at t-value, assess model fit ex with R2, test IID assumptions on error terms
  4. Make predictions: Within range of data it was trained on
28
Q

What are the four types of machine learning?

A
  • Supervised learning, learns from labeled data
  • Semi-supervised learning, learns from labeled and unlabeled data
  • Unsupervised learning
  • Reinforcement learning, learns through interactions with environment
29
Q

What are three types of machine learning models?

A
  1. Conventional machine learning (ex linear regression)
  2. Neural networks
  3. Deep neural networks
    3.1 Neutral network structure with feature extraction layer or classification/regression layer
    3.2 Applications for text, image and video