Statistics Flashcards
What is a P value?
A number describing how likely it is that your data would have occured by random chance. We want to know this to help us understand if the difference we observe between groups is significant.
What is R-squared?
R-squared is a goodness-of-fit measure for linear regression models. It describes how well the model fits the data. It essentially looks at the scatter of the data points around the fitted regression line. R-squared is always between 0 and 100%. 0 percent represents a model that does not explain any of the variation in the response variable around its mean. 100% represents a model that explains all the variation in the response variable around its mean. Usually the larger the R-squared, the better your regression model fits your observations.
How do we assess the accuracy of a model?
Stasticians say a regression model fits the data well if the differences between the observations and the predicted values are small and unbiased. Unbiased means that the fitted values are not systematically too high or too low anywhere in the observation space.
What is a model?
A model is just a simple, mathematical way of approximating reality.
What are the steps to data analysis?
PPDAC
Problem: Identify a problem and ask the research question for solving it.
Plan: Create a plan to address the prolem. What tools to use, how much time.
Data: What existing data do we have or what data should we collect? Do we have missing data and need to merge data from other sources?
Analysis: Collect the data, study it, use the data to make conclusions. Sometimes this is an iterative process. Collect more data.
Conclusion: What conclusions can we draw and what claims can we make based on the data. Go present those to stakeholders.
What scripting/programming languages do you know?
Most comfortable with Python. Have used R for statistics and have learned Stata and SQL on a case-by-case basis. I prefer Python.
What is quantitative analysis?
Quantitative analysis just means analyzing data that is numbers-based or can be easily converted into numbers without losing meaning.
What is statistics?
The practice or science of collecting and analysing numerical data in LARGE quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.
What is quantitatiave analysis used for?
It is used to measure differences between groups, to assess relationships between variables, or to test hypotheses scientifically.
What is qualitative analysis?
Qualitative analysis differs from quant in that it can’t be reduced to numbers, but is used to capture differences in perceptions and feelings.
What is quantitative analysis powered by?
Statistics
What are the two brances of quantitative analysis?
Descriptive statistics and inferential statistics.
What is a population?
The entire group of people you’re interested in sampling. Example: entire group of Tesla owners in US.
What is a sample?
It is extremely unlikely that you can survey every tesla owner in the US, so the smaller subset is the group of people you can get access to.
What does descriptive statistics do?
Descriptive statistics focuses on describing the contents of the sample. Analyzing the slice of cake (the sample)
What does inferential statistics do?
Inferential statistics aims to make predictions about the population based on the findings within the sample. Making predictions/draw conclusions about the entire chocolate cake based on what you learned from the sample of the one slice of chocolate cake.
What is the goal of descriptive statistics
Descriptive statistics helps you describe your sample. You’re just understanding the details of that sample. You are not trying to make inferences about the entire population. This is the first step and may be the only step depending on your research question.
What is the mean?
The average.
What is the median?
Median is the midpoint when numbers are all lined up in a set.
What is the mode?
The most frequent number in a data set.
What is the standard deviation?
This metrics indicates how dispersed a range of numbers is, how close all the numbers are to the mean/the average. When the numbers are close to the average, the standard deviation is low. Conversely, when numbers are scattered all over the place, the standard deviation is high.
What is skewness?
Skewness indicates how symmetrical a range of numbers is. Do they tend to cluster into a smooth bell curve shape on the graph. This is called a normal distribution. Or do they lean to the left or right, this is a non-normal distribution.
If the mean (72) and median (74) are quite similar, what does this suggest?
This suggest the data has a relatively symmetrical distribution. A relatively smooth distribution of rates clustered near the center.
What does the standard deviation tell you?
A high standard deviation of 10.6 tells you there is a wide spread of numbers. If you look at the data, you can see that the numbers range from 55 to 90, whereas remember, the average/ mean was 72. That’s pretty far spread out. Look at a graph to see this.
What does skewness of -.2 tell you?
It tells you that the data is very slightly negatively skewed. A very slight lean. This makes sense because the mean and median vary only slightly. 72 vs. 74
Google graphs and see difference between negative and positive skew.
Why does descriptive stats matter?
- It gives you a macro and micro view of the data.
- It also helps you identify errors and anomalies in the data. If average is way higher than you expect, this is a warning sign to double check your data.
- Descriptive stats also informs which inferential statistics you can use.
Summary: Descriptive statistics are really important, even though the methods used are quite basic.
So, to review, descriptions stats is about the details of your sample, and inferential aims to make inferences about your entire sample. You are trying to make predictions about your entire sample. What are the common uses of inferential statistics?
- You are trying to make predictions about differences between two or more groups. Ex. height differences between groups of children who play different sports.
- You are trying to make predictions about relationship between two or more variables. Ex. Link between body weight and people who do yoga regularly.
Summary: Inferential statistics allows you to connect the dots and make inferences about what you expect to see in real world population based on what you observed in the sample.
What is inferential statistics used for?
Hypothesis testing