Statistics Flashcards
A/B testing
A/B testing is a way to compare two versions of something to find out which version performs better
Why companies use A/B Testing?
- optimized product performance
- improve customer experience.
Descriptive Stats
- describe or summarize the main features of a dataset.
- Descriptive stats are very useful because they let you quickly understand a large amount of data.
- mean, median, etc
Summary Stats
summarize your data using a single number
2 main types of summary stats
- measures of central tendency
- measures of dispersion
Measures of central tendency
Measures of central tendency like the mean, let you describe the center of your database
measures of dispersion
- measures of dispersion like standard deviation, let you describe the spread of your dataset or the amount of variation in your data points.
- standard deviation
Inferential Stats
- allow data professionals to make inferences about a dataset based on a sample of the data.
- use samples to make inferences about populations.
2 Statistical Methods
- Descriptive
- Inferential
Population
- Population includes every possible element that you are interested in measuring.
- parameter: is a characteristic of a population
- ex. height of the entire population of giraffes is a parameter.
Sample
- sample is a subset of a population.
- A statistic is a characteristic of a sample,
- ex. The average height of a random sample of 100 giraffes is a statistic.
Parameter vs Statistic
- parameter: is a characteristic of a population (height)
- statistic is a characteristic of a sample (avg height)
Name 3 measures of central tendency
Mean, Median, Mode
Median
- median is the middle value in a dataset.
- This means half the values in the dataset are larger than the median and half are smaller.B
Mode
- most frequently occurring value in the dataset.
- A dataset can have
- no mode,
- one mode or
- more than one mode.
When to use the mean, the median, and the mode?
Mean: no outliers
Median: have outliers
Mode: categorical
1 main disadvantage of Mean
sensitive to outliers
Why use mode for categorical data?
because it clearly shows you which category occurs most frequently.
2 Measures of dispersion
- Range
- standard deviation
Range
- range is the difference between the largest and smallest value in a dataset.
- quick understanding of the overall spread of your dataset.
What does Variance measure?
- A Measure of Spread
- Variance is a way to measure how spread out a set of numbers is. It tells you how much the numbers “vary” from the average (or mean).
- average of the squared difference of each data point from the mean.
- standard deviation squared
What does Standard Deviation measure and what does a larger value indicate?
- Standard deviation measures how spread out your values are from the mean of your dataset.
- The larger the standard deviation, the more spread out your values are from the mean.
How are Measures of Position helpful?
help you determine the position of a value in relation to other values in a dataset.
3 Measures of Position
- percentiles
- quartiles
- interquartile range
Percentiles
- A percentile is the measure that tells you what percentage of values in a dataset are less than or equal to a particular value.
- Percentiles show the relative position or rank of a particular value in a dataset.
- If you’re in the 75th percentile for height, it means 75% of people are shorter than you, and 25% are taller.
- (percentiles used to rank test scores on school exams.)
Quartiles
- A quartile divides the values in a dataset into four equal parts.
- Quartiles let you compare values relative to the four quarters of data.
Q1
The first quartile, Q1, is the middle value in the first half of the dataset. Q1 refers to the 25th percentile. 25% of the values in the entire dataset are below Q1, and 75% are above it.
Q2
The second quartile, Q2, is the median of the dataset.
- Q2 refers to the 50th percentile. 50% of the values in the entire dataset are below Q2, and 50% are above it.
Q3
The third quartile, Q3, is the middle value in the second half of the dataset. Q3 refers to the 75th percentile. 75% of the values in the entire dataset are below Q3, and 25% are above it.
interquartile range (Q3*Q1)
- is the distance between the first quartile, Q1, and the third quartile, Q3.
- is a measure of dispersion because it measures the spread or the middle half or middle 50 percent of your data.
- IQR is also useful for determining the relative position of your data values.
5 Number Summary
- The minimum
- The first quartile (Q1)
- The median, or second quartile (Q2)
- The third quartile (Q3)
- The maximum
Visualize 5 Number Summary
with boxplot
If mean close to median?
Low/No outliers
2 main types of probability
- objective
- subjective
Objective probability
probability is based on statistics, experiments, and mathematical measurements.
* 2 types
* classical
* empirical
Classical Probability
Classical probability is based on formal reasoning about events with equally likely outcomes.
Example: throw a coin. probably of getting head is 1/2 = 50% always
Empirical Probability
based on experimental or historical data; it represents the likelihood of an event occurring based on the previous results of an experiment or **past events.?*
Empirical Probability & AB Testing
Data professionals rely on empirical probability to help them make accurate predictions based on sample data
- For example, in an A/B test of a website, you test a sample of users to make a prediction about the future behavior of all users. Say the sample of users prefer a green addtocart button over a blue one. You may infer from this data that the larger population of future users will probably share their preference. An A/B test lets you make a reasonable prediction about future users based on empirical probability.
Subjective probability
Subjective probability is based on personal feelings, experience, or judgment.
foundation of probability theory:
- Random experiment
- Outcome
- Event
Random experiment
process whose outcome cannot be predicted with certainty. For example, before tossing a coin or rolling a die, you can’t know the result of the toss or the roll. The result of the coin toss might be heads or tails. The result of the die roll might be 3 or 6.
All random experiments have three things in common:
- The experiment can have more than one possible outcome.
- You can represent each possible outcome in advance.
- The outcome of the experiment depends on chance.
outcome
the result of a random experiment. example, if you roll a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6.
event
a set of one or more outcomes. Using the example of rolling a die, an event might be rolling an even number. The event of rolling an even number consists of the outcomes 2, 4, 6. Or, the event of rolling an odd number consists of the outcomes 1, 3, 5.
probability of an event
The probability that an event will occur is expressed as a number between 0 and 1. Probability can also be expressed as a percent.
- If the probability of an event equals 0, there is a 0% chance that the event will occur.
- If the probability of an event equals 1, there is a 100% chance that the event will occur.
Calculate the probability of an event
of desired outcomes ÷ total # of possible outcomes
P(A)
The probability of event A
P(B)
The probability of event B
For any event A, 0 ≤ P(A) ≤ 1
the probability of any event A is always between 0 and 1.
P(A) > P(B)
then event A has a higher chance of occurring than event B.
P(A) = P(B)
event A and event B are equally likely to occur.
Mutually exclusive events
Two events are mutually exclusive if they cannot occur at the same time.
For example, you can’t be on the Earth and on the moon at the same time, or be sitting down and standing up at the same time.
Independent events
Two events are independent if the occurrence of one event does not change the probability of the other event. This means that one event does not affect the outcome of the other event.
For example, watching a movie in the morning does not affect the weather in the afternoon.
Three basic rules of probability
- Complement rule (mutually exclusive events)
- Addition rule (mutually exclusive events)
- Multiplication rule (independent events)
Complement rule P(A’)
The complement rule deals with mutually exclusive events. In statistics, the complement of an event is the event not occurring. The complement rule states that the probability that event A does not occur is 1 minus the probability of A.
P(A’) = 1 * P(A)
P(A’)
the probability of not A. or probability of event A NOT occurring,
Addition rule
if events A and B are mutually exclusive, then the probability of A or B occuring is the sum of the probabilities of A and B.
P(A or B) = P(A) + P(B)
P(rolling 2 or rolling 4) = P(rolling 2) + P(rolling 4) = ⅙ + ⅙ = ⅓
So, the probability of rolling either a 2 or a 4 is one out of three, or 33%.
Multiplication rule
if events A and B are independent, then the probability of both A and B occuring is the probability of A multiplied by the probability of B.
P(A and B) = P(A)×P(B)
P(rolling 1 on the first roll and rolling 6 on the second roll) = P(rolling 1 on the first roll)×P(rolling 6 on the second roll) = ⅙×⅙ = 1/36
So, the probability of rolling a 1 and then a 6 is one out of thirty*six, or about 2.8%.
Conditional probability
applies to two or more dependent events.
P(A and B) = P(A) * P(B|A)
the vertical bar between the letters B and A indicates dependence, or that the occurrence of event B depends on the occurrence of event A. You can say this as “the probability of B given A.”
Dependent events P(B|A)
two events are dependent if the occurrence of one event changes the probability of the other event. This means that the first event affects the outcome of the second event.
For instance, if you want to get a good grade on an exam, you first need to study the course material. Getting a good grade depends on studying.
P(B|A)
the probability of B given A.
P(B|A) = P(A and B) / P(A)
probability of event B given event A equals the probability that both A and B occur divided by the probability of A.