Statistics Flashcards

Question 1

Q

What are measures of central tendency

Answer

A

Median, Mean, and Mode

Median is the middle number when the numbers are in numerical order, it is more resistant to outliers than the mean

Mean = average of numbers (sum all numbers and divide by amount of int in set) - the mean is least resistant to outliers

Question 2

Q

What is variance? What is it used for?

Answer

A

Variance tells us how much the values in a dataset differ from the mean value.

A large variance results indicates large variability in the data
A small variance indicates small variability in data
–

Population Variance = the sum of (x - mean)^2 /n (aka the average of the dev^2

Sample Variance = the sum of (x - mean)^2 /n-1 ; -1 helps to make up for the variability found in the population compared to the sample data (aka the average of the dev^2)

If there is a constant increase in the values, the variance doesn’t change even if the mean does

Question 3

Q

What is the standard deviation and why do we need it?

Answer

A

Variance is a squared formula resulting in squared units which don’t make sense mathematically. So we take the sq rt of the variance to get the standard deviation (and correct units)

Question 4

Q

things to review:
t-test
z-test
ANOVA
confidence intervals (inferential)

regression analysis (inferential
correlation
r squared and RMSD

Answer

A

https://www.linkedin.com/learning/excel-statistics-essential-training-1/the-z-test-for-independent-samples?resume=false&u=36492188

https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap

https://www.datacamp.com/blog/statistics-interview-questions - make this into flashcards

Question 5

Q

what does Right skewed mean?

Answer

A

the data has a positive skew or a tail that is to the right. this means that the mean is greater than the median

Question 6

Q

what does Left skewed mean?

Answer

A

the data has a negative skew or a tail that is to the left. this means that the median is greater than the mean

Question 7

Q

what does skewness measure

Answer

A

asymmetry of a dataset around its mean

Question 8

Q

what is a histogram? what does it show?

Answer

A

a graphical representation of a distribution of data. it divides the data into bins or intervals to show the frequency or count of datapoints within each bin. they are used for continuous data and help to identify patterns including skewness, mode, and outliers

Question 9

Q

what is inferential statistics

Answer

A

it is the use of statistics to make predictions about a population based on a random sample from that population. we use this data to draw conclusions on large populations

Question 10

Q

what is descriptive statistics

Answer

A

it is the use of statistics to summarize and describe the features of the dataset including measures of central tendency.

Question 11

Q

what are the 4 main sampling methods?

Answer

A

random sampling = every member has an equal chance of being selected
systemically sampling = selecting every k-th member of the population starting at a randomly determined point
stratified sampling - divides population into subgroups with random samples taken from ea
cluster sampling - dives population into clusters and randomly selects some clusters to sample all members in

Question 12

Q

what is the central limit theorem?

Answer

A

sampling air states that the sampling distribution of the sample mean will approach a normal distribution as the sample size increases, regardless of the population’s distribution – provided that the samples are indeed and identically distributed

Question 13

Q

What are the differences: joint, marginal and conditional probabilities?

Answer

A

marginal = the prob of a single event occurring regardless of other events
joint = prob of 2 events occurring together at the same time
conditional = probability of an event occurring given that another event has occurred (not at the same time)

Question 14

Q

what is probability distribution? & what are the two types?

Answer

A

it describes how random variables are distributed.

The two main types are discrete ex a binomial distribution
2. continuous = ex: normal distribution

Question 15

Q

what is a normal distribution

Answer

A

a bell shaped curve that is symmetric around the mean. aka the mean is equal to the median

approx 68% of the data is 1 st dev
95% is 2 st dev
99.7% is 3 st deviations

Question 16

Q

what is a binomial distribution

Answer

A

there are two distinct peaks, usually indicating 2 results that have a greater likelihood than others. ex a coin flip

Question 17

Q

what is a p-value

Answer

A

p-value is our way of determining if a test is statistically significant, or how the probability of obtaining a test statistic randomly as extreme as the one observed. 0.05 is usually the accepted p-value

Question 18

Q

what’s the difference between type I and type II errors?

Answer

A

type 1 error is a false positive
type 2 error is a false negative

Question 19

Q

what are examples of parametric tests?

Answer

A

t-test
z-test
ANOVA

Question 20

Q

what is a regression analysis?

Answer

A

it is method to examine the relationship between independent and dependent variables – how the dependent variable changes based on the indep variable

Question 21

Q

what is a residual?

Answer

A

it is the difference between the expected output (based on the regression analysis) and the true value. they help to determine the fit of our model

Question 22

Q

what is a coefficient in a linear regression model?

Answer

A

it is basically the slope, it describes the amount of change that is expected to be seen from the dependent variable based on the indep variable

Question 23

Q

what is a confidence interval?

Answer

A

a 95% confidence interval states how confident we are that if we were to take many different samples about 95% of the intervals would contain the true population parameters.

we are 95% confident the population parameter lies in the estimated interval

Question 24

Q

what is a t-test?

Answer

A

t-test is used to compare the means of 2 groups, usually when there is a small sample size and we don’t know the population st dev. you use a table to compare it to the critical value

Question 25

Q

what is a z-test

Answer

A

a test to compare the sample and population means OR 2 sample means – use when we know the standard deviation f the population ; use when the sample size is large

use a z-score to determine if the null hypothesis is rejected

Question 26

Q

ANOVA = Analysis of Variance - when to use

Answer

A

this tests for significant differences between 3 or more groups

compare to the F distribution table

Question 27

Q

what value are we looking at to determine correlation? what is correlation?

Answer

A

correlation measures the strength and direction of a relationship between 2 variables. We are looking at Pearson’s r (or the correlation coefficient)

Correlation ranges from -1 = strong negative correlation, to +1 strong positive correlation

Question 28

Q

what are the cut offs for type of correlation

Answer

A

> = 0.7 = strong correlation
0.4<x<0.7 = moderate correlation
<0.4 = weak correlation

Question 29

Q

what is r squared value

Answer

A

it measures how well a regression model fits the data

it has similar cut offs to correlation - we want a strong fit to be r-squared >0.7
0.4<x<0.7 = moderate fit
<0.4 = weak fit

Question 30

Q

what is RMSD - root mean square deviation

Answer

A

this measures the average magnitude of error prediction in a model – we want this number to be small

Question 31

Q

what is the hdeip? what is important when starting a project?

Answer

A

hypothesis, data collection, exploration, insights, plan

first creating hypothesis based on your data
next create an issue tree based on that hypothesis or problem statement
then ensure your data is ready for exploration. clean up the data as necessary to effectively run tests including correlation analysis or investigative vizulations. the goal is to identify patterns and trends
ensure you are doing macro analysis before micro analysis

based on your findings create actionable insights and summarize work in a way that will best meet your audience’s needs
/develop a plan to implement changes or make recommendations based on those insights

Question 32

Q

what are 3 essential questions to creating visuals?

Answer

A

type of data, 2. what needs to be communicated, 3. who is the end user.

for example, categorical data should be visualized using a type of bar chart or pie chart while continuous time series would be better fit for a line chart. if we are communicating comparison a regular bar chart would do, but if we need to communicate composition a 100% stacked bar chart would be best.

lastly, we need to consider the audience and the level of detail they need to see. an executive likely wants high level KPIs where a departmental manager may need more specifics in certain subject areas

Question 33

Q

what charts show distribution?

Answer

A

histogram & box and whiskers plot

Question 34

Q

what charts show composition?

Answer

A

pie chart, donut chart, stacked bar chart, stacked area chart

Question 35

Q

what charts show comparison?

Answer

A

bar chart, line chart, scatter plot

Question 36

Q

what are the 4 types of data for data visualization?

Answer

A

categorical data, numerical data, time series data, relational data

Question 37

Q

what charts are best for categorical data?

Answer

A

bar chart, pie chart

Question 38

Q

what charts are best for numerical data?

Answer

A

histogram, line chart, scatterplot, box plot

Question 39

Q

what charts are best for time series data?

Answer

A

line chart

Question 40

Q

what charts are best for relational data?

Answer

A

scatter plots