Statistics Flashcards

1
Q

What is a P value?

A

A number describing how likely it is that your data would have occured by random chance. We want to know this to help us understand if the difference we observe between groups is significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is R-squared?

A

R-squared is a goodness-of-fit measure for linear regression models. It describes how well the model fits the data. It essentially looks at the scatter of the data points around the fitted regression line. R-squared is always between 0 and 100%. 0 percent represents a model that does not explain any of the variation in the response variable around its mean. 100% represents a model that explains all the variation in the response variable around its mean. Usually the larger the R-squared, the better your regression model fits your observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we assess the accuracy of a model?

A

Stasticians say a regression model fits the data well if the differences between the observations and the predicted values are small and unbiased. Unbiased means that the fitted values are not systematically too high or too low anywhere in the observation space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a model?

A

A model is just a simple, mathematical way of approximating reality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the steps to data analysis?

A

PPDAC
Problem: Identify a problem and ask the research question for solving it.
Plan: Create a plan to address the prolem. What tools to use, how much time.
Data: What existing data do we have or what data should we collect? Do we have missing data and need to merge data from other sources?
Analysis: Collect the data, study it, use the data to make conclusions. Sometimes this is an iterative process. Collect more data.
Conclusion: What conclusions can we draw and what claims can we make based on the data. Go present those to stakeholders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What scripting/programming languages do you know?

A

Most comfortable with Python. Have used R for statistics and have learned Stata and SQL on a case-by-case basis. I prefer Python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is quantitative analysis?

A

Quantitative analysis just means analyzing data that is numbers-based or can be easily converted into numbers without losing meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is statistics?

A

The practice or science of collecting and analysing numerical data in LARGE quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is quantitatiave analysis used for?

A

It is used to measure differences between groups, to assess relationships between variables, or to test hypotheses scientifically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is qualitative analysis?

A

Qualitative analysis differs from quant in that it can’t be reduced to numbers, but is used to capture differences in perceptions and feelings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is quantitative analysis powered by?

A

Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two brances of quantitative analysis?

A

Descriptive statistics and inferential statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a population?

A

The entire group of people you’re interested in sampling. Example: entire group of Tesla owners in US.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a sample?

A

It is extremely unlikely that you can survey every tesla owner in the US, so the smaller subset is the group of people you can get access to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does descriptive statistics do?

A

Descriptive statistics focuses on describing the contents of the sample. Analyzing the slice of cake (the sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does inferential statistics do?

A

Inferential statistics aims to make predictions about the population based on the findings within the sample. Making predictions/draw conclusions about the entire chocolate cake based on what you learned from the sample of the one slice of chocolate cake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the goal of descriptive statistics

A

Descriptive statistics helps you describe your sample. You’re just understanding the details of that sample. You are not trying to make inferences about the entire population. This is the first step and may be the only step depending on your research question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the mean?

A

The average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the median?

A

Median is the midpoint when numbers are all lined up in a set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the mode?

A

The most frequent number in a data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the standard deviation?

A

This metrics indicates how dispersed a range of numbers is, how close all the numbers are to the mean/the average. When the numbers are close to the average, the standard deviation is low. Conversely, when numbers are scattered all over the place, the standard deviation is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is skewness?

A

Skewness indicates how symmetrical a range of numbers is. Do they tend to cluster into a smooth bell curve shape on the graph. This is called a normal distribution. Or do they lean to the left or right, this is a non-normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

If the mean (72) and median (74) are quite similar, what does this suggest?

A

This suggest the data has a relatively symmetrical distribution. A relatively smooth distribution of rates clustered near the center.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the standard deviation tell you?

A

A high standard deviation of 10.6 tells you there is a wide spread of numbers. If you look at the data, you can see that the numbers range from 55 to 90, whereas remember, the average/ mean was 72. That’s pretty far spread out. Look at a graph to see this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does skewness of -.2 tell you?

A

It tells you that the data is very slightly negatively skewed. A very slight lean. This makes sense because the mean and median vary only slightly. 72 vs. 74

Google graphs and see difference between negative and positive skew.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why does descriptive stats matter?

A
  1. It gives you a macro and micro view of the data.
  2. It also helps you identify errors and anomalies in the data. If average is way higher than you expect, this is a warning sign to double check your data.
  3. Descriptive stats also informs which inferential statistics you can use.

Summary: Descriptive statistics are really important, even though the methods used are quite basic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

So, to review, descriptions stats is about the details of your sample, and inferential aims to make inferences about your entire sample. You are trying to make predictions about your entire sample. What are the common uses of inferential statistics?

A
  1. You are trying to make predictions about differences between two or more groups. Ex. height differences between groups of children who play different sports.
  2. You are trying to make predictions about relationship between two or more variables. Ex. Link between body weight and people who do yoga regularly.

Summary: Inferential statistics allows you to connect the dots and make inferences about what you expect to see in real world population based on what you observed in the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is inferential statistics used for?

A

Hypothesis testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is hypothesis testing?

A

To test hypotheses that predict changes or differences.

30
Q

Why is sample important for inferential statistics?

A

If your sample does not accurately represent the population you’re researching, then your findings won’t necessarily be very useful.

For example, if your population is 50% male and 50% female, you cannot make inferences about the sample if it had 80% male and 20% female since it is not representative. This area of statistics is called sampling.

31
Q

What are most common methods for inferential statistics?

A

First, t tests.

32
Q

What is a t test?

A

T tests compare the means/averages of two groups of data to assess whether they are different to a statistically significant extent. In other words, to see if they have statistically significant means, standard deviations, and skewness. Say you want to compare the mean blood pressure between two groups of people. One that has taken a new medication and one that hasn’t to assess whether they are significantly different. Looking at the means is not enough to draw a conclusion. You need to assess if the differences are statistically significant and that’s what t tests allow you to do.

33
Q

What is ANOVA?

A

ANOVA is analysis of variance. This allows you to assess the differences between multiple groups, not just 2. Basically a t test, but on steroids.

34
Q

What is correlation analysis?

A

This assesses the relationship between two variables. For example, if variable A goes up, what happens to variable B. If average temperate goes up, do ice cream sales go up too. We might expect this, and correlation analysis gives us a way to assess this scientifically.

35
Q

What is regression analysis?

A

Regression analysis is similar to correlation in that it assesses the relationship between variables, but it goes a step further to understand the relationship between cause and effect among variables, not just whether they move together. Does variable A actually cause variable B’s movement? Or do they just move together naturally due to another force? Just because two variables correlate, it does not mean one causes another.

For example, we’d expect there to be a relationship between weight and height. The more tightly results cluster together to form a line, the more correlated they are, and the stronger the relationship between the variables.

36
Q

Those were just some inferential methods. ANOVA, t test, correlation analysis, regression analysis. Each method has its own assumptions and limitations.

A

Some methods only work with normally distributed data, while others are designed specifically for data that are not normally distributed. And that’s why descriptive stats is so important. Descriptive stats is the first step to knowing which inferential methods you can and can’t use.

37
Q

How do you choose statistical methods?

A

First you consider data type & shape.

Second you consider your research questions and hypotheses.

38
Q

How to choose an inferential method?

A

First you look at the type of data collected, or the type of data you will collect. By data types, we mean the four levels of measurement: nominal, ordinal, interval, and ratio. This matters because different statistical methods require different types of data. Every method has its assumptions based on the data. For example, yes / no questions work with categorical data. Others like weight and age are numerical data. If you try to use a statistical method that doesn’t support the data type you have, your results will be largely meaningless.

39
Q

You can check which statistical methods support your data types, or you can reverse engineer the process to look at which statistical methods would give you the best insights and design your data collection strategy around this and decide which data types you need.

A
40
Q

Why is the shape of your data important to consider?

A

Does your data have a normal distribution? Is it a bell-shaped curve, or is it skewed to the left or right. Some methods are better for skewed data. This is why descriptive statistics are so important. They tell you all about the shape of your data.

41
Q

Why do your researh questions and hypotheses shape which statistical method you use?

A

If you are just interested in the attributes of your sample, descriptive statistics is all you need. But if you want to understand differences between groups or relationships between variables, you’ll likely need both descriptive and inferential statistics. Your choice of methods must align with all the factors covered: shape and type of data, research questions and hypotheses.

42
Q

Why do we use qualitative research?

A

We use it to make hypotheses and theories from the ground up (it is inductive), while quant is about the hard numbers to describe differences between groups or relationships between variables (it is deductive)

43
Q

What is the relationship between quant and qual research?

A

First you can do qual research to develop hypotheses, and then you can do quant research to test them. Or you could do the reverse. You could use quant research to get the bigger picture (the what), and qual research to understand the underlying reasons (the why) for a specific trend or observation in the data. While these are distinctly different, they are not at odds with each other. It is not a competition. They can be used together. Mixed methods to describe a high quality piece of research.

44
Q

What are qual and quant used for?

A

Explore and understand is used for qual.

Test and measure is used for quant.

They have a different purpose and are not interchangeable. Each approach has its purpose.

45
Q

What is a research design?

A

This is basically your justification for every design choice you make. This will help you choose your approach: quant, qual, or mixed methods.

46
Q

What are the three factors you should consider?

A
  1. The nature of your research aims and questions.
  2. The methodological approaches in previous studies and literature.
  3. Practicalities and constraints.
47
Q

What are the three types of research aims?

A
  1. Exploratory (understand a situation or issue). Tends to be qual.
  2. Confirmatory (measure or quantify something/test a set of hypotheses) Tends to be quant.
  3. A mix of both. Developing set of hypotheses and testing them. Tends to be mixed methods. A mono methods approach done well is much better than a mixed methods approach done poorly.
48
Q

Why do we review research methodologies?

A

To use tried and tested methods / stand on the shoulder of giants, but don’t just fall into the norms of other studies.

There are always tradeoffs between the theoretical and practical.

Constraints (data, time, money, equipment & software, knowledge & skills)

49
Q

What is statistics?

A

Statistics is the study of how to collect, organize, analyze, and interpret numerical information and data. Statistics is the science of uncertainty and extracting information from data. We use it to make decisions. Statistics is better than just randomly guessing.

50
Q

In statistics, what are individuals and what are variables?

A

Individuals are people or objects included in a study. 5 people, 5 reports, 5 records. Individuals can also be geographic locations.

A variable is a characteristic of the individual to be measured or observed. Individual’s age, time and individual’s record was collected, individual diagnosis.

51
Q

What is a population?

A

A population is a group of people or objects with a common theme. When every member of that group is considered, it is a population.

Theme: Nurses who work at MGH. Massachussests General Hospital

Population: List from human resources of every employee at MGH. The population is every single individual.

52
Q

What is a sample?

A

A sample is just a small portion of the population. It can be representative if it includes every individual or it can be biased. If you only survey ICU nurses, that is a sample, but not representative. It’d be more representative to survey a nurse from each department.

53
Q

What is population data?

A

In population data, data from every individual is available. entire population. That’s a census.

In sample data, data are only available from some individuals in the population. That’s called a sample. You don’t have to study every single patient, only a sample.

54
Q

Notation for population and sample.

A
N = population
n = sample
55
Q

What is a parameter v. a statistic?

A

A parameter is a measure that describes the entire population. Statistics describes only a sample of a population.

Use term parameter if it’s from a population and use term statistic if it’s from a sample

56
Q

What is the difference between descriptive statistics and inferential statistics?

A

To infer, means to get a hint about something indirectly. Descriptive statistics are pretty easy. You can do it for samples and populations. Descriptive statistics is just the practice of organizing, picturing, and summarizing information from samples and populations. Whereas inferential statistics uses a sample to draw conclusions about a population. The sample gives us a hint (to infer) about what the population is.

57
Q

What is a quantitative measurement?

A

That is a numerical measurement of something. Can you make a mean out of it.

58
Q

What is a qualitative measurement?

A

It refers to a quality or characteristic of something. You can’t make a mean out of it. For example, you can’t make a mean out of a stage of cancer or a country. It’s just a category.

59
Q

Usually the most interesting questions cannot be answered with statistics.

A

We can’t know why people eat fast food. Instead, we can know if people who eat fast food tended to work more than 80 hrs / week.

60
Q

WHat is descriptive stats?

A

Descriptive stats is about looking at the past.

61
Q

What is predictive analytics?

A

Predictive analytics is about predicting the future.

62
Q

WHat is a model?

A

A simplified description of a system or process to assist calculations and predictions. A model is just our version of the world. A model makes a prediction to mimic what might happen in the real world. I.e. The likelihood that a car insurance customer will get into an accident in the next year.

63
Q

What are the stages of modeling?

A

Exploratory analysis.
Variable selection.
Model selection.
Model evaluation: how effective is the model.
Model deployment: are we prepared to launch the model.

64
Q

What are the objectives of linear regression?

A

1) Establish if there is a relationship between two variables. As one increases, does the other increase? Is the relationship statistically significant? Is there a relationship between wage and gender?
2) Forecast new observations. Can we understand the relationship to forecast new values? What will the value of sales be over the next quarter? What will the ROI be on the new store contingent on store attributes?

65
Q

What is the dependent variable?

A

This is the variable we want to explain or forecast. We call it the dependent variable because its values depend on something. We denote it as y.

66
Q

What is the independent variable?

A

This is the variable that explains the other one. We say its values are independent. We denote it as x.

67
Q

Linear regression:

A

y = mx+b
in stats, we use a different notion:
We call it llinear because we plot it in a bidimensional plot.
Constant and then the coefficient of x.
y = 4 + 2x
If x has a value of 0, then the line crosses the vertical axis at 4.
Slope is 2. For every unit increase in x, y increases twice as much.

68
Q

What happens if you change the slope?

A

You’re changing the sensitivity of x. i.e. how slow y will change when a unit of x is changed.

for example 4 + 2x = y vs. 4 +5x = y. Slope changes much faster. 0 slope: doesn’t matter what value x has, y will always be 4. -3 is a downward slope. Y decreases by 3 units for every unit change in x.

69
Q

Data in the real world is not always linear. There are going to be errors. We try to draw a line that minimizes these errors.

A
y = b0 + b1x + e 
Y is dependent variable
X is independent variable 
b0 is the constant or intercept
B1 is x's slope or coefficient
e is the error term

y_i = \alpha + \beta x_i + \varepsilon_i.

70
Q

What explains a family’s consumption of a product?

A

Maybe how large a family is. Income and consuptiom. 40 observations of different families. Trying to explain income based on consumption.