Quantitative data analysis Flashcards by Julia Engel

What is statistics and what are two types?

The science of collecting and analyzing data for drawing conclusions and making decisions
1. Descriptive statistics
2. Inferential statistics

How well did you know this?

Not at all

Perfectly

What is descriptive statistics?

It is a method of organizing, summarizing and presenting data in a convenient and informative way
- For example through graphs or numbers

How well did you know this?

Not at all

Perfectly

What is inferential statistics?

A branch of statistics that allows us to make predictions, estimates or generalizations about a popluation about a sample
Statistical inference is the process where we can acquire information about populations

How well did you know this?

Not at all

Perfectly

What is the difference between probability and statistics?

Probability is deductive, meaning given the information in a box, you can figure out what is in your hand

Statistics is inductive, meaning given the information in your hand, you can figure out what is in a box

How well did you know this?

Not at all

Perfectly

Qualitative (categorical) data representation vs quantitative data representation

Qualitative data representation means data is grouped into non-numerical and descriptive categories, and is then used to compare categories or proportions. Ex: Car colors

Quantitative data representation involves data that includes numbers and measurable quantities, and is used to analyze distributions, patterns or correlations. Ex: Car speeds

How well did you know this?

Not at all

Perfectly

Common tools for qualitative data representation

Summary table
Bar chart
Pie chart
Pareto diagram (innehåller både staplar och linjer)

How well did you know this?

Not at all

Perfectly

Common tools for quantitative data representation

Scatter plot
Histogram

How well did you know this?

Not at all

Perfectly

What are three good practices when presenting data?

Clearly labeled with title, labeled variables and specified units
Source of data is identified
Data have a date

How well did you know this?

Not at all

Perfectly

What are 10 good practices when visualizing data?

Identify target audience
Make sure the data is clean
Select the right chart
Label the chart effectively
Emphasize the important points
Choose the best dashboard
Format your chart for accessability
Make use of color
Ensure data is readable in all formats
Accept feedback

How well did you know this?

Not at all

Perfectly

What are two key concepts in numerical representation of data?

Measure of location: Describe where the data is centered or positioned, key measures to do this are mean, median, mode and quartiles
Measure of variability and dispersion: Describe how spread out or dispersed the data is around the center, key measures to do this are range, variance, standard deviation, coefficient of variation and box plots

How well did you know this?

Not at all

Perfectly

What are box plots and why are they useful?

Shows the median, quartiles and outliers (important!)
- A graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles

They are useful because if the median is closer to the bottom or top of the box, it suggests skewness

How well did you know this?

Not at all

Perfectly

How do you compare boxplots?

Median: Compare the position of the median lines, higher median = higher central tendency (determine how typical values (medians) differ between datasets)
IQR (box length): Compare the size of the boxet, longer box = greater variability in the middle 50% of data (more numbers to include –> bigger box)
Whiskers: Compare length of whiskers, longer whiskers = greater spread in tails (more numbers in data –> longer whiskers)
Outliers: Compare number and position of outliers, more outliers = presence of extreme values

How well did you know this?

Not at all

Perfectly

How to construct boxplots

Organize the n observations in the data set from smallest to largest
Separate the smallest half and the largest half
Find the lower fourth (median of smaller data-half )
Find the highest fourth (median of larger data-helg)
Find the fourth spread fs = upper fourth - lower fourth, outliers are >=1.5fs, extreme outliers are >=3fs

How well did you know this?

Not at all

Perfectly

What are main concepts in inferential statistics?

Sampling: Taking a subset (sample) of a population to estimate the characteristics of the whole population
Estimation: Using sample data to estimate population parameters like mean, proportion, etc
Hypothesis testing: Testing assumptions (hypotheses) about a population using sample data
Confidence intervals: A range of values used to estimate the true value of a population parameter with a given level of confidence (e.g. 95%)

How well did you know this?

Not at all

Perfectly

What is a sample?

An observed subset of a population

How well did you know this?

Not at all

Perfectly

What does statistical inference include?

Study These Flashcards

Estimation: Determine the value of a population based on sample statistics
Hypothesis testing: To determine whether there is enough statistical evidence in favor of a certain belief about a parameter

What does 95% confidence level mean?

Study These Flashcards

If we repeat the sampling process many times, 95% of the intervals would contain the true mean

What is the purpose of hypothesis testing and how do you formulate a hypothesis?

Study These Flashcards

The purpose is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.

Method
- A null hypothesis, H0, is formulated and is assumed to be true
- The alternative hypothesis, Ha, is a claim contrary to H0
- The possible conclusions from hypothesis-testing analysis is then to REJECT H0 (if there is enough statistical evidence that it is not true) or FAIL TO REJECT H0 (if there is not enough statistical evidence to draw the conclusion that H0 is not true)

What are two concepts of modeling analysis?

Study These Flashcards

Linear regression/regression analysis
Machine learning

Describe the steps in model development and two types of models

Study These Flashcards

Specification
Estimation
Validation
Application

Two types
1. Linear (positive or negative linear relationship)
2. Non-linear

Regression analysis

Study These Flashcards

A simple method of supervised learning that models causality and provides prediction
- Explains the effect of the independent variable X on the dependent variable Y

Linear regression is a type of regression analysis and has a deterministic and probabilistic component.
- Assumes that the dependence of Y on X is linear

What three aspects are estimating the coefficients in linear regression determined by?

Study These Flashcards

Drawing a sample from population of interest
Calculating sample statistics
Producing a line that cuts into the data

Error term assumptions in regression models

Study These Flashcards

For a valid model, some assumptions on error terms must be fulfilled
- Independent Identically Distributed (IID): Errors are independent from each other and have the same distribution across all observations
- Normally distributed (N): Errors follow a normal distribution with mean 0 and constant variance

What are key considerations for variable selection in regression models?

Study These Flashcards

Explanatory power: Variables should significantly explain the variation in the dependent variable
- Can be assessed using measures like R2 (coefficient of determination, more explanatory power if R2 is higher) or hypothesis test on coefficients

Explanatory power includes
- Causality: Causal relationship between independent and dependent variable
- Model performance: Evaluate models fit and accuracy using adjusted R2, residual analysis or other performance metrics

Describe independence/no multicollineraity

It means that independent variables should not be highly correlated with each other

What is important to remember with sample data and outliers?

Data need to be sufficient in size, be representative and free of bias Larger samples are likely to lead to better estimates, but data collection is expensive. This means you should collect as much data as needed to provide the required level of statistical confidence. It is also important to identify and handle outliers in sample data, for example using diagnostics like residual plots or boxplots

What is the regression model building procedure?

1. Theoretical analysis -> Hypothesized model specification: Form a theory or hypothesis about variable relationship to understand them better 2. Initial examination prior to model building: Explore and prepare data before building regression model, select a sample and check for outliers, linearity and multicollinearity 3. Model building with regression equation: Specify regression model, estimate coefficients, look at t-value, assess model fit ex with R2, test IID assumptions on error terms 4. Make predictions: Within range of data it was trained on

What are the four types of machine learning?

- Supervised learning, learns from labeled data - Semi-supervised learning, learns from labeled and unlabeled data - Unsupervised learning - Reinforcement learning, learns through interactions with environment

What are three types of machine learning models?

1. Conventional machine learning (ex linear regression) 2. Neural networks 3. Deep neural networks 3.1 Neutral network structure with feature extraction layer or classification/regression layer 3.2 Applications for text, image and video

Quantitative data analysis Flashcards

(29 cards)