Quantitative data analysis Flashcards
What is statistics and what are two types?
The science of collecting and analyzing data for drawing conclusions and making decisions
1. Descriptive statistics
2. Inferential statistics
What is descriptive statistics?
It is a method of organizing, summarizing and presenting data in a convenient and informative way
- For example through graphs or numbers
What is inferential statistics?
- A branch of statistics that allows us to make predictions, estimates or generalizations about a popluation about a sample
- Statistical inference is the process where we can acquire information about populations
What is the difference between probability and statistics?
Probability is deductive, meaning given the information in a box, you can figure out what is in your hand
Statistics is inductive, meaning given the information in your hand, you can figure out what is in a box
Qualitative (categorical) data representation vs quantitative data representation
Qualitative data representation means data is grouped into non-numerical and descriptive categories, and is then used to compare categories or proportions. Ex: Car colors
Quantitative data representation involves data that includes numbers and measurable quantities, and is used to analyze distributions, patterns or correlations. Ex: Car speeds
Common tools for qualitative data representation
- Summary table
- Bar chart
- Pie chart
- Pareto diagram (innehåller både staplar och linjer)
Common tools for quantitative data representation
- Scatter plot
- Histogram
What are three good practices when presenting data?
- Clearly labeled with title, labeled variables and specified units
- Source of data is identified
- Data have a date
What are 10 good practices when visualizing data?
- Identify target audience
- Make sure the data is clean
- Select the right chart
- Label the chart effectively
- Emphasize the important points
- Choose the best dashboard
- Format your chart for accessability
- Make use of color
- Ensure data is readable in all formats
- Accept feedback
What are two key concepts in numerical representation of data?
- Measure of location: Describe where the data is centered or positioned, key measures to do this are mean, median, mode and quartiles
- Measure of variability and dispersion: Describe how spread out or dispersed the data is around the center, key measures to do this are range, variance, standard deviation, coefficient of variation and box plots
What are box plots and why are they useful?
Shows the median, quartiles and outliers (important!)
- A graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles
They are useful because if the median is closer to the bottom or top of the box, it suggests skewness
How do you compare boxplots?
- Median: Compare the position of the median lines, higher median = higher central tendency (determine how typical values (medians) differ between datasets)
- IQR (box length): Compare the size of the boxet, longer box = greater variability in the middle 50% of data (more numbers to include –> bigger box)
- Whiskers: Compare length of whiskers, longer whiskers = greater spread in tails (more numbers in data –> longer whiskers)
- Outliers: Compare number and position of outliers, more outliers = presence of extreme values
How to construct boxplots
- Organize the n observations in the data set from smallest to largest
- Separate the smallest half and the largest half
- Find the lower fourth (median of smaller data-half )
- Find the highest fourth (median of larger data-helg)
- Find the fourth spread fs = upper fourth - lower fourth, outliers are >=1.5fs, extreme outliers are >=3fs
What are main concepts in inferential statistics?
- Sampling: Taking a subset (sample) of a population to estimate the characteristics of the whole population
- Estimation: Using sample data to estimate population parameters like mean, proportion, etc
- Hypothesis testing: Testing assumptions (hypotheses) about a population using sample data
- Confidence intervals: A range of values used to estimate the true value of a population parameter with a given level of confidence (e.g. 95%)
What is a sample?
An observed subset of a population
What does statistical inference include?
- Estimation: Determine the value of a population based on sample statistics
- Hypothesis testing: To determine whether there is enough statistical evidence in favor of a certain belief about a parameter
What does 95% confidence level mean?
If we repeat the sampling process many times, 95% of the intervals would contain the true mean
What is the purpose of hypothesis testing and how do you formulate a hypothesis?
The purpose is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.
Method
- A null hypothesis, H0, is formulated and is assumed to be true
- The alternative hypothesis, Ha, is a claim contrary to H0
- The possible conclusions from hypothesis-testing analysis is then to REJECT H0 (if there is enough statistical evidence that it is not true) or FAIL TO REJECT H0 (if there is not enough statistical evidence to draw the conclusion that H0 is not true)
What are two concepts of modeling analysis?
- Linear regression/regression analysis
- Machine learning
Describe the steps in model development and two types of models
- Specification
- Estimation
- Validation
- Application
Two types
1. Linear (positive or negative linear relationship)
2. Non-linear
Regression analysis
A simple method of supervised learning that models causality and provides prediction
- Explains the effect of the independent variable X on the dependent variable Y
Linear regression is a type of regression analysis and has a deterministic and probabilistic component.
- Assumes that the dependence of Y on X is linear
What three aspects are estimating the coefficients in linear regression determined by?
- Drawing a sample from population of interest
- Calculating sample statistics
- Producing a line that cuts into the data
Error term assumptions in regression models
For a valid model, some assumptions on error terms must be fulfilled
- Independent Identically Distributed (IID): Errors are independent from each other and have the same distribution across all observations
- Normally distributed (N): Errors follow a normal distribution with mean 0 and constant variance
What are key considerations for variable selection in regression models?
Explanatory power: Variables should significantly explain the variation in the dependent variable
- Can be assessed using measures like R2 (coefficient of determination, more explanatory power if R2 is higher) or hypothesis test on coefficients
Explanatory power includes
- Causality: Causal relationship between independent and dependent variable
- Model performance: Evaluate models fit and accuracy using adjusted R2, residual analysis or other performance metrics
Describe independence/no multicollineraity
It means that independent variables should not be highly correlated with each other
What is important to remember with sample data and outliers?
Data need to be sufficient in size, be representative and free of bias
Larger samples are likely to lead to better estimates, but data collection is expensive. This means you should collect as much data as needed to provide the required level of statistical confidence.
It is also important to identify and handle outliers in sample data, for example using diagnostics like residual plots or boxplots
What is the regression model building procedure?
- Theoretical analysis -> Hypothesized model specification: Form a theory or hypothesis about variable relationship to understand them better
- Initial examination prior to model building: Explore and prepare data before building regression model, select a sample and check for outliers, linearity and multicollinearity
- Model building with regression equation: Specify regression model, estimate coefficients, look at t-value, assess model fit ex with R2, test IID assumptions on error terms
- Make predictions: Within range of data it was trained on
What are the four types of machine learning?
- Supervised learning, learns from labeled data
- Semi-supervised learning, learns from labeled and unlabeled data
- Unsupervised learning
- Reinforcement learning, learns through interactions with environment
What are three types of machine learning models?
- Conventional machine learning (ex linear regression)
- Neural networks
- Deep neural networks
3.1 Neutral network structure with feature extraction layer or classification/regression layer
3.2 Applications for text, image and video