Statistics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

A/B testing

A

A/B testing is a way to compare two versions of something to find out which version performs better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why companies use A/B Testing?

A
  • optimized product performance
  • improve customer experience.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive Stats

A
  • describe or summarize the main features of a dataset.
  • Descriptive stats are very useful because they let you quickly understand a large amount of data.
  • mean, median, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Summary Stats

A

summarize your data using a single number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 main types of summary stats

A
  1. measures of central tendency
  2. measures of dispersion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Measures of central tendency

A

Measures of central tendency like the mean, let you describe the center of your database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

measures of dispersion

A
  • measures of dispersion like standard deviation, let you describe the spread of your dataset or the amount of variation in your data points.
  • standard deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Inferential Stats

A
  • allow data professionals to make inferences about a dataset based on a sample of the data.
  • use samples to make inferences about populations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 Statistical Methods

A
  1. Descriptive
  2. Inferential
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Population

A
  • Population includes every possible element that you are interested in measuring.
    • parameter: is a characteristic of a population
    • ex. height of the entire population of giraffes is a parameter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample

A
  • sample is a subset of a population.
    • A statistic is a characteristic of a sample,
    • ex. The average height of a random sample of 100 giraffes is a statistic.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Parameter vs Statistic

A
  • parameter: is a characteristic of a population (height)
  • statistic is a characteristic of a sample (avg height)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Name 3 measures of central tendency

A

Mean, Median, Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Median

A
  • median is the middle value in a dataset.
  • This means half the values in the dataset are larger than the median and half are smaller.B
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Mode

A
  • most frequently occurring value in the dataset.
  • A dataset can have
    • no mode,
    • one mode or
    • more than one mode.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When to use the mean, the median, and the mode?

A

Mean: no outliers
Median: have outliers
Mode: categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

1 main disadvantage of Mean

A

sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why use mode for categorical data?

A

because it clearly shows you which category occurs most frequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

2 Measures of dispersion

A
  • Range
  • standard deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Range

A
  • range is the difference between the largest and smallest value in a dataset.
  • quick understanding of the overall spread of your dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does Variance measure?

A
  • A Measure of Spread
  • Variance is a way to measure how spread out a set of numbers is. It tells you how much the numbers “vary” from the average (or mean).
  • average of the squared difference of each data point from the mean.
  • standard deviation squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does Standard Deviation measure and what does a larger value indicate?

A
  • Standard deviation measures how spread out your values are from the mean of your dataset.
  • The larger the standard deviation, the more spread out your values are from the mean.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How are Measures of Position helpful?

A

help you determine the position of a value in relation to other values in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

3 Measures of Position

A
  • percentiles
  • quartiles
  • interquartile range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Percentiles

A
  • A percentile is the measure that tells you what percentage of values in a dataset are less than or equal to a particular value.
  • Percentiles show the relative position or rank of a particular value in a dataset.
  • If you’re in the 75th percentile for height, it means 75% of people are shorter than you, and 25% are taller.
  • (percentiles used to rank test scores on school exams.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Quartiles

A
  • A quartile divides the values in a dataset into four equal parts.
  • Quartiles let you compare values relative to the four quarters of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Q1

A

The first quartile, Q1, is the middle value in the first half of the dataset. Q1 refers to the 25th percentile. 25% of the values in the entire dataset are below Q1, and 75% are above it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Q2

A

The second quartile, Q2, is the median of the dataset.
- Q2 refers to the 50th percentile. 50% of the values in the entire dataset are below Q2, and 50% are above it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Q3

A

The third quartile, Q3, is the middle value in the second half of the dataset. Q3 refers to the 75th percentile. 75% of the values in the entire dataset are below Q3, and 25% are above it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

interquartile range (Q3*Q1)

A
  • is the distance between the first quartile, Q1, and the third quartile, Q3.
  • is a measure of dispersion because it measures the spread or the middle half or middle 50 percent of your data.
  • IQR is also useful for determining the relative position of your data values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

5 Number Summary

A
  • The minimum
  • The first quartile (Q1)
  • The median, or second quartile (Q2)
  • The third quartile (Q3)
  • The maximum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Visualize 5 Number Summary

A

with boxplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If mean close to median?

A

Low/No outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

2 main types of probability

A
  • objective
  • subjective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Objective probability

A

probability is based on statistics, experiments, and mathematical measurements.
* 2 types
* classical
* empirical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Classical Probability

A

Classical probability is based on formal reasoning about events with equally likely outcomes.

Example: throw a coin. probably of getting head is 1/2 = 50% always

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Empirical Probability

A

based on experimental or historical data; it represents the likelihood of an event occurring based on the previous results of an experiment or **past events.?*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Empirical Probability & AB Testing

A

Data professionals rely on empirical probability to help them make accurate predictions based on sample data

  • For example, in an A/B test of a website, you test a sample of users to make a prediction about the future behavior of all users. Say the sample of users prefer a green addtocart button over a blue one. You may infer from this data that the larger population of future users will probably share their preference. An A/B test lets you make a reasonable prediction about future users based on empirical probability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Subjective probability

A

Subjective probability is based on personal feelings, experience, or judgment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

foundation of probability theory:

A
  • Random experiment
  • Outcome
  • Event
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Random experiment

A

process whose outcome cannot be predicted with certainty. For example, before tossing a coin or rolling a die, you can’t know the result of the toss or the roll. The result of the coin toss might be heads or tails. The result of the die roll might be 3 or 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

All random experiments have three things in common:

A
  • The experiment can have more than one possible outcome.
  • You can represent each possible outcome in advance.
  • The outcome of the experiment depends on chance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

outcome

A

the result of a random experiment. example, if you roll a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

event

A

a set of one or more outcomes. Using the example of rolling a die, an event might be rolling an even number. The event of rolling an even number consists of the outcomes 2, 4, 6. Or, the event of rolling an odd number consists of the outcomes 1, 3, 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

probability of an event

A

The probability that an event will occur is expressed as a number between 0 and 1. Probability can also be expressed as a percent.

  • If the probability of an event equals 0, there is a 0% chance that the event will occur.
  • If the probability of an event equals 1, there is a 100% chance that the event will occur.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Calculate the probability of an event

A

of desired outcomes ÷ total # of possible outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

P(A)

A

The probability of event A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

P(B)

A

The probability of event B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

For any event A, 0 ≤ P(A) ≤ 1

A

the probability of any event A is always between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

P(A) > P(B)

A

then event A has a higher chance of occurring than event B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

P(A) = P(B)

A

event A and event B are equally likely to occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Mutually exclusive events

A

Two events are mutually exclusive if they cannot occur at the same time.

For example, you can’t be on the Earth and on the moon at the same time, or be sitting down and standing up at the same time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Independent events

A

Two events are independent if the occurrence of one event does not change the probability of the other event. This means that one event does not affect the outcome of the other event.

For example, watching a movie in the morning does not affect the weather in the afternoon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Three basic rules of probability

A
  • Complement rule (mutually exclusive events)
  • Addition rule (mutually exclusive events)
  • Multiplication rule (independent events)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Complement rule P(A’)

A

The complement rule deals with mutually exclusive events. In statistics, the complement of an event is the event not occurring. The complement rule states that the probability that event A does not occur is 1 minus the probability of A.

P(A’) = 1 * P(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

P(A’)

A

the probability of not A. or probability of event A NOT occurring,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Addition rule

A

if events A and B are mutually exclusive, then the probability of A or B occuring is the sum of the probabilities of A and B.

P(A or B) = P(A) + P(B)

P(rolling 2 or rolling 4) = P(rolling 2) + P(rolling 4) = ⅙ + ⅙ = ⅓
So, the probability of rolling either a 2 or a 4 is one out of three, or 33%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Multiplication rule

A

if events A and B are independent, then the probability of both A and B occuring is the probability of A multiplied by the probability of B.

P(A and B) = P(A)×P(B)
P(rolling 1 on the first roll and rolling 6 on the second roll) = P(rolling 1 on the first roll)×P(rolling 6 on the second roll) = ⅙×⅙ = 1/36

So, the probability of rolling a 1 and then a 6 is one out of thirty*six, or about 2.8%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Conditional probability

A

applies to two or more dependent events.

P(A and B) = P(A) * P(B|A)

the vertical bar between the letters B and A indicates dependence, or that the occurrence of event B depends on the occurrence of event A. You can say this as “the probability of B given A.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Dependent events P(B|A)

A

two events are dependent if the occurrence of one event changes the probability of the other event. This means that the first event affects the outcome of the second event.

For instance, if you want to get a good grade on an exam, you first need to study the course material. Getting a good grade depends on studying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

P(B|A)

A

the probability of B given A.

P(B|A) = P(A and B) / P(A)
probability of event B given event A equals the probability that both A and B occur divided by the probability of A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Bayes’s theorem

A

is a math formula for determining conditional probability.

For example, let’s say a medical condition is related to age. You can use Bayes’s theorem to more accurately determine the probability that a person has the condition based on age.

The prior probability would be the probability of a person having the condition.
The posterior, or updated, probability would be the probability of a person having the condition if they are in a certain age group.

63
Q

prior probability

A

refers to the probability of an event before new data is collected.

64
Q

Posterior probability

A

updated probability of an event based on new data.

65
Q

Bayes’s theorem Formula

A

P(A|B) = [ P(B|A) * P(A) ] / P(B)

P(A): Prior probability, probability of event A.
P(A|B): Posterior probability, probability of event A given event B.
P(B|A): Likelihood, probability of event B given event A,
P(B): Evidence, probability of event B.

You want to find out the following:

  • P(Spam | Money), or posterior probability: the probability that an email is spam given that the word “money” appears in the email
  • P(Spam), or prior probability: the probability of an email being spam = 0.2, or 20%
  • P(Money), or evidence: the probability that the word “money” appears in an email = 0.15, or 15%
  • P(Money | Spam), or likelihood: the probability that the word “money” appears in an email given that the email is spam = 0.4, or 40%

P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) = 0.4 * 0.2 / 0.15 = 0.53333, or about 53.3%.

So, the probability that an email is spam given that the email contains the word “money” is 53.3%.

66
Q

Discrete probability distributions

A

represent discrete random variables, or discrete events. Often, the outcomes of discrete events are expressed as whole numbers that can be counted. For example, rolling a die can result in a 2 or a 3, but not a decimal value such as 2.575 or 3.184.

67
Q

probability distribution

A

describes the likelihood of the possible outcomes of a random event.

68
Q

4 discrete probability distributions:

A
  • Uniform
  • Binomial
  • Bernoulli
  • Poisson
69
Q

Uniform distribution

A

describes events whose outcomes are all equally likely, or have equal probability. For example, rolling a die can result in six outcomes: 1, 2, 3, 4, 5, or 6. The probability of each outcome is the same: 1 out of 6, or about 16.7%.

applies to both discrete and continuous random variables.

There is no skewness present in uniform distribution graphs.

70
Q

Examples of Uniform Distribution in Machine learning and Data Science

A

Random initialization: In many machine learning algorithms, such as neural networks and k*means clustering, the initial values of the parameters can have a significant impact on the final result. Uniform distribution is often used to randomly initialize the parameters, as it ensures that all values in the range have an equal probability of being selected.
Sampling: Uniform distribution can also be used for sampling. For example, if you have a dataset with an equal number of samples from each class, you can use uniform distribution to randomly select a subset of the data that is representative of all the classes.
Data augmentation: In some cases, you may want to artificially increase the size of your dataset by generating new examples that are similar to the original data. Uniform distribution can be used to generate new data points that are within a specified range of the original data.
Hyperparameter tuning: Uniform distribution can also be used in hyperparameter tuning, where you need to search for the best combination of hyperparameters for a machine learning model. By defining a uniform prior distribution for each hyperparameter, you can sample from the distribution to explore the hyperparameter space.

71
Q

Binomial distribution

A

models the probability of events with only two possible outcomes: success (1) or failure (0). These outcomes are mutually exclusive and cannot occur at the same time.

This definition assumes the following:
* Each event is independent, or does not affect the probability of the others.
* Each event has the same probability of success.

72
Q

Data professionals might use the binomial distribution to

A
  • a new medication generates side effects
  • a credit card transaction is fraudulent
  • a stock price rises in value

In machine learning, the binomial distribution is often used to classify data.

73
Q

Bernoulli distribution

A

The Bernoulli distribution is similar to the binomial distribution as it also models events that have only two possible outcomes (success or failure). The only difference is that the Bernoulli distribution refers to only a single trial of an experiment, while the binomial refers to repeated trials. A classic example of a Bernoulli trial is a single coin toss.

74
Q

Poisson distribution

A

models the probability that a certain number of events will occur during a specific time period.
* The number of events in the experiment can be counted.
* The mean number of events that occur during a specific time period is known.
* Each event is independent.

75
Q

Data professionals use the Poisson distribution

A
  • Calls per hour for a customer service call center
  • Customers per day at a shop
  • Thunderstorms per month in a city
  • Financial transactions per second at a bank
76
Q

Continuous probability distributions

A

On a continuous distribution, the xaxis refers to the value of the variable you’re measuring * in this case, cherry tree height. The yaxis refers to probability density. Note that probability density is not the same thing as probability.

77
Q

probability function

A

mathematical function that provides probabilities for the possible outcomes of a random variable.

78
Q

2 types of probability functions:

A
  • Probability Mass Functions (PMFs) represent discrete Random variables
  • Probability Density Functions (PDFs) represent continuous Random variables
79
Q

The normal (gaussian) distribution

A

is a continuous probability distribution that is symmetric about the mean and bell*shaped.

  • The shape is a bell curve
  • The mean is located at the center of the curve
  • The curve is symmetrical on both sides of the mean
  • The total area under the curve equals 1
80
Q

The empirical rule

A

values on a normal curve are distributed in a regular pattern, based on their distance from the mean.

  • 68% of values fall within 1 standard deviation of the mean
  • 95% of values fall within 2 standard deviations of the mean
  • 99.7% of values fall within 3 standard deviations of the mean
81
Q

z*score or standard scores

A

is a measure of how many standard deviations below or above the population mean a data point is. A z*score gives you an idea of how far from the mean a data point is.

Z*scores range from *3 to +3.

82
Q

Standardization

A

standardization is the process of putting different variables on the same scale.

83
Q

standard normal distribution

A

is just a normal distribution with a mean of 0 and a standard deviation of 1. Standardization is useful because it lets you compare scores from different data sets that may have different units, mean values and standard deviations.

84
Q

Z*score for outliers

A

use z*scores for anomaly detection, which finds outliers in datasets.

Applications of anomaly detection include finding fraud in financial transactions, flaws in manufacturing products, intrusions in computer networks and more.

85
Q

Transformation on Probability Distribution :

A

process of applying mathematical functions to data to change its underlying distribution. Transformations can be critical in statistics and machine learning when you need to work with algorithms that assume a normal distribution. Many statistical methods and machine learning algorithms perform best when the data follows a normal distribution, owing to properties like symmetry, defined mean and standard deviation, and consistent spread.

86
Q

transformationsin Data Science

A
  1. Statistical Assumptions
    Statistical tests like t*tests, ANOVA, and many regression models assume that the underlying data or residuals (errors) are normally distributed. When the data doesn’t meet this assumption, the results can be biased or misleading. Transformations can help ensure that data fits these assumptions.
  2. Improving Algorithm Performance
    Machine learning algorithms, particularly linear regression and logistic regression, may perform better when the data or residuals are normally distributed. This is because the assumptions underlying these algorithms are closely related to normality. Making the data more normally distributed through transformation can improve the algorithm’s predictive accuracy and reduce bias.
  3. Stabilizing Variance
    When data has unstable variance (heteroscedasticity), it can lead to errors in modeling and reduce the effectiveness of algorithms that expect consistent variance. Transformations can help stabilize variance, making it more constant across different ranges of the data.
  4. Reducing Skewness
    Skewed data can lead to inaccurate conclusions and complicate the interpretation of results. Algorithms that expect symmetric data may perform poorly with skewed inputs. Transformations like log transformation can reduce skewness, bringing data closer to a normal distribution.
87
Q

Common Transformations for Achieving Normality

A

Log Transformation: Converts data by taking the natural logarithm, reducing positive skewness. Useful for data with exponential growth or a long right tail.

Square Root Transformation: Converts data by taking the square root to reduces skewness, often used for count data or data with variance increasing with the mean.

BoxCox Transformation: A flexible power transformation that can turn a range of nonnormal data into a more normal distribution. It requires non*negative data and determines the best power transformation parameter (λ) to achieve normality. It can be mathematically expressed as:

Reciprocal Transformation: Involves taking the reciprocal (1/x) to transform the data, reducing positive skewness.

88
Q

relationship between PDF and CDF

A

pdf is the derivative of CDF

89
Q

sample

A

a subset of a population.

90
Q

(Target) Population

A

includes every possible element that you are interested in measuring, or the entire dataset that you want to draw conclusions about.

91
Q

Sampling

A

process of selecting a subset of data from a population. sampling can help you make valid inferences about the population as a whole.

92
Q

Data professionals use sampling because:

A
  • It’s often impossible or impractical to collect data on the whole population due to size, complexity, or lack of accessibility
  • It’s easier, faster, and more efficient to collect data from a sample
  • Using a sample saves money and resources
  • Storing, organizing, and analyzing smaller datasets is usually easier, faster, and more reliable than dealing with extremely large datasets
93
Q

representative sample

A

accurately reflects the characteristics of a population.

94
Q

representative sample importance

A

the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about a population, make sure your sample is representative of the population.

95
Q

probability sampling

A
  • help ensure your sample is representative by collecting random samples from the various groups within a population.
  • reduce sampling bias and
  • increase the validity of your results.
96
Q

Stages of sampling process

A
  1. Identify the target population
  2. Select the sampling frame
  3. Choose the sampling method
  4. Determine the sample size
  5. Collect the sample data
97
Q

Sampling Process: 1. Identify the target population

A

The target population is the complete set of elements that you’re interested in knowing more about.

98
Q

Sampling Process: 2. Select the sampling frame

A

sampling frame is a list of all the individuals or items in your target population.

  • So, if your target population is all the customers who purchased the refrigerator,
  • your sampling frame could be an alphabetical list of the names of all these customers. The customers in your sample will be selected from this list.
  • some customers may have changed their contact information since their purchase, and you may be unable to locate or contact them.
  • Your sampling frame is the accessible part of your target population.
99
Q

Sampling Process: 3 Choose the sampling method

A

There are two main types of sampling methods:

  1. probability sampling (preferred)
  2. non*probability sampling
100
Q

Sampling Process: 4 Determine the sample size

A

Sample size helps determine the precision of the predictions you make about the population.

In general, the larger the sample size, the more precise your predictions.

However, using larger samples typically requires more resources.

101
Q

Sampling Process: 5: Collect the sample data

A

ready to collect your sample data, which is the final step in the sampling process.

102
Q

4 probability sampling methods

A
  • Simple random sampling
  • Stratified random sampling
  • Cluster random sampling
  • Systematic random sampling
103
Q

What’s the difference between a target population and a sampling frame?

A

the population is general and the frame is specific.

104
Q

probability sampling

A

uses random selection to generate a sample. Because probability sampling methods are based on random selection, every element in the population has an equal chance of being included in the sample. This gives you the best chance to get a representative sample, as your results are more likely to accurately reflect the overall population.

105
Q

Non*probability sampling

A

often based on convenience, or the personal preferences of the researcher, rather than random selection. Often, probability sampling methods require more time and resources than non*probability sampling methods.

106
Q

Simple random sampling

A

every member of a population is selected randomly and has an equal chance of being chosen. You can randomly select members using a random number generator, or by another method of random selection.

107
Q

Simple random samples PROs

A
  1. fairly representative, since every member of the population has an equal chance of being chosen. 2. avoid bias, and surveys like these give you more reliable results.
108
Q

Simple random samples CONs

A
  • often expensive and time*consuming to collect large simple random samples.
  • And if your sample size is not large enough, a specific group of people in the population may be underrepresented in your sample.
109
Q

Stratified random sampling

A

divide a population into groups, and randomly select some members from each group to be in the sample.

Strata can be organized by age, gender, income, or whatever category you’re interested in studying.

110
Q

Stratified random sampling CON

A

can be difficult to identify appropriate strata for a study if you lack knowledge of a population.

111
Q

Stratified random sampling PROs

A

ensure that members from each group in the population are included in the survey. This method helps provide equal representation for underrepresented groups, and allows you to draw more precise conclusions about each of the strata. There may be significant differences in the purchasing habits of a 21yearold and a 51yearold. Stratified sampling helps ensure that both perspectives are captured in the sample.

112
Q

Cluster random sampling

A

you divide a population into clusters, randomly select certain clusters, and include all members from the chosen clusters in the sample.

Clusters are divided using identifying details, such as age, gender, location, or whatever you want to study.

113
Q

Difference between cluster sampling and stratified random sampling

A

in stratified sampling, you randomly choose some members from each group to be in the sample. In cluster sampling, you choose all members from a group to be in the sample.

114
Q

Cluster random sampling PRO

A

gets every member from a particular cluster, which is useful when each cluster reflects the population as a whole. This method is helpful when dealing with large and diverse populations that have clearly defined subgroups. If researchers want to learn more about home ownership in the suburbs of Auckland, New Zealand, they can use several well*chosen suburbs as a representative sample of all the suburbs in the city.

115
Q

Cluster random sampling CON

A

difficult to create clusters that accurately reflect the overall population. For example, for practical reasons, you may only have access to restaurants in England when the franchise has locations all over the world. And employees in England may have different characteristics and values than employees in other countries.

116
Q

Systematic random sampling

A

put every member of a population into an ordered sequence. Then, you choose a random starting point in the sequence and select members for your sample at regular intervals.

Ex. Starting with number 4, you select every 10th name on the list (4, 14, 24, 34, … ), until you have a sample of 100 students.

117
Q

Systematic random sampling CON

A
  • need to know the size of the population that you want to study before you begin. If you don’t have this information, it’s difficult to choose consistent intervals.
  • Plus, if there’s a hidden pattern in the sequence, you might not get a representative sample. For example, if every 10th name on your list happens to be an honor student, you may only get feedback on the study habits of honor students – and not all students.
118
Q

Systematic random sampling PRO

A
  • representative of the population, since every member has an equal chance of being included in the sample
  • quick and convenient when you have a complete list of the members of your population.
119
Q

Non*probability sampling methods

A

use nonrandom methods of selection, so not all members of a population have an equal chance of being selected. This is why nonprobability methods have a high risk of sampling bias.

120
Q

Non*probability sampling methods CON

A

high risk of sampling bias.

121
Q

Non*probability sampling methods PRO

A
  • less expensive and more convenient for researchers to conduct.
  • can be useful for exploratory studies, which seek to develop an initial understanding of a population, rather than make inferences about the population as a whole.
122
Q

4 methods of Non*probability sampling

A
  • Convenience
  • Voluntary response sampling
  • Snowball sampling
  • Purposive sampling
123
Q

Convenience sampling

A

choose members of a population that are easy to contact or reach. For example, to conduct an opinion poll, a researcher might stand at the entrance of a shopping mall during the day and poll people that happen to walk by.

124
Q

Convenience sampling CON

A
  • not reliable
  • convenience samples often suffer from undercoverage bias.
    Undercoverage bias occurs when some members of a population are inadequately represented in the sample.
125
Q

Convenience sampling PRO

A

quick and inexpensive,

126
Q

Voluntary response sampling

A

A voluntary response sample consists of members of a population who volunteer to participate in a study

127
Q

Voluntary response sampling CON

A

Voluntary response samples tend to suffer from nonresponse bias, which occurs when certain groups of people are less likely to provide responses. People who voluntarily respond will likely have stronger opinions, either positive or negative, than the rest of the population. In this case, only students who really like or really dislike the food may be motivated to fill out the survey.

128
Q

Snowball sampling

A

researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. Like a snowball, the sample size gets bigger and bigger as more participants join in.

129
Q

Snowball sampling PRO

A

use snowball sampling when the population they want to study is difficult to access.

130
Q

Snowball sampling CON

A

Snowball sampling can take a lot of time, and researchers must rely on participants to successfully continue the recruiting process and build up the “snowball.” This type of recruiting can also lead to sampling bias. Because initial participants recruit additional participants on their own, it’s likely that most of them will share similar characteristics, and these characteristics might be unrepresentative of the total population under study.

131
Q

Purposive sampling

A

researchers select participants based on the purpose of their study. Because participants are selected for the sample according to the needs of the study, applicants who do not fit the profile are rejected.

For example, imagine a game development company wants to conduct market research on a new video game before its public release. The research team only wants to include gaming experts in their sample. So, they survey a group of professional gamers to provide feedback on potential improvements.

132
Q

Purposive sampling PRO

A

Purposive sampling is often used when a researcher wants to gain detailed knowledge about a specific part of a population, or where the population is very small and its members all have similar characteristics.

133
Q

Purposive sampling CON

A

not effective for making inferences about a large and diverse population.

134
Q

Point Estimate

A
  • A point estimate uses a single value to estimate a population parameter.
  • ex. A data professional might use the mean weight of the sample of 100 penguins to estimate the mean weight of the population.
135
Q

Parameter

A

characteristic of a population. The mean weight of the total population of 10,000 penguins.

136
Q

statistic

A

characteristic of a sample. the mean weight of a random sample of 100 penguins

137
Q

Sampling Distribution

A
  • is a probability distribution of a sample statistic.
  • a sampling distribution represents the possible outcomes for a sample statistic
  • Sample statistics are based on randomly sampled data, and their outcomes cannot be predicted with certainty. You can use a sampling distribution to represent statistics such as the mean, median, standard deviation, range, and more.
138
Q

Sampling variability

A

refers to how much an estimate varies between samples. if your sample is large enough, your sample mean will roughly equal the population mean.

  • population mean = 3 lbs
    • sample mean = 3.3 lbs
    • sample mean = 2.8 lbs
    • sample mean = 2.4 lbs
139
Q

Standard Error (SE)

A
  • the standard deviation of a sample statistic
  • The standard error of the mean measures variability among all your sample means.
  • larger SE > sample means are more spread out > more variability
  • smaller SE > sample means are closer together > less variability
  • The less SE, the more likely it is that your sample mean is an accurate estimate of the population mean.
  • as sample size increase, SE decreases
140
Q

High Sampling Variability means?

A
  • The more variability in your sample data, the less likely it is that the sample mean is an accurate estimate of the population mean.
  • use the standard deviation of the sample means to measure this variability
141
Q

Standard Error (SE) Calculation example

A

SE = s ÷ √n

s = sample standard deviation
n = sample size

2 ÷ √100 = 2 ÷ 10 = 0.2

This means you should expect that the mean length from one sample to the next will vary with a standard deviation of about 0.2 inches.

142
Q

central limit theorem

A

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. And, as you sample more observations from a population, the sample mean gets closer to the population mean.

The central limit theorem can help you infer population parameters like the mean even if you only have available data on a portion of the population. The larger your sample size, the more precise your estimate of the population mean is likely to be.

143
Q

The central limit theorem holds true for any population?

A

You don’t need to know the shape of your population distribution in advance to apply the theorem—the distribution could be bell*shaped, skewed, or have another shape. If you collect a large enough sample, the shape of your sampling distribution will follow a normal distribution.

144
Q

central limit theorem conditions

A

In order to apply the central limit theorem, the following conditions must be met:

  • Randomization
  • Independence
  • 10%
  • Sample size
145
Q

Randomization

A

Your sample data must be the result of random selection. Random selection means that every member in the population has an equal chance of being chosen for the sample.

146
Q

Independence

A

Your sample values must be independent of each other. Independence means that the value of one observation does not affect the value of another observation. Typically, if you know that the individuals or items in your dataset were selected randomly, you can also assume independence.

147
Q

10% population size

A

Your sample size should be no larger than 10% of the total population. This applies when the sample is drawn without replacement, which is usually the case.

148
Q

(CLT) Sample size

A

The sample size needs to be sufficiently large.

  • Requirements for precision: The larger the sample size, the more closely your sampling distribution will resemble a normal distribution, and the more precise your estimate of the population mean will be.
  • The shape of the population: If your population distribution is roughly bell*shaped and already resembles a normal distribution, the sampling distribution of the sample mean will be close to a normal distribution even with a small sample size.
  • In general, many statisticians and data professionals consider a sample size of 30 to be sufficient when the population distribution is roughly bell*shaped, or approximately normal.
  • However, if the original population is not normal—for example, if it’s extremely skewed or has lots of outliers—data professionals often prefer the sample size to be a bit larger. Exploratory data analysis can help you determine how large of a sample is necessary for a given dataset.
149
Q

Population proportion

A
  • refers to the percentage of individuals or elements in a population that share a certain characteristic.
  • Proportions, measure percentages or parts of a whole.
150
Q

sampling distribution of the proportion

A
  • to estimate the proportion of all visitors to a website who make a purchase before leaving.
  • Assembly line products that meet quality control standards
  • voters who support a candidate in an upcoming election.
151
Q

Central Limit Theorem (sample proportions)

A
  • As with the sample means, the central limit theorem also applies to sample proportions
  • As your sample size increases, the distribution of the sample proportion will be approximately normal.
  • The overall average or mean proportion is located in the center of the curve.
  • If you take a sufficiently large enough sample of teenagers, the sample proportion will be an accurate estimate of the true population proportion.
  • If you survey 1000 teenagers and find that 10% prefer slip on sneakers, this means that your best estimate for the proportion of all teenagers who prefer slip ons is also 10%.
152
Q

Standard Error (sample proportion)

A
  • As with the sample mean, you can use the standard error of the proportion to measure sampling variability.
  • This tells you how much a particular sample proportion is likely to differ from the two population proportion.
  • This is useful to know, because the proportion varies from sample to sample. And any given sample proportion probably won’t be exactly equal to the true population proportion.
  • The true proportion of teenagers who prefer slip on sneakers might be 10%, but the proportion of any given sample might be 12%, 9%, 7% and so on.
  • The more variability in your sample data, the less likely it is that the sample proportion is an accurate estimate of the population proportion.
  • It’s important to understand the accuracy of your estimate, because stakeholder decisions are often based on the estimates you provide.
153
Q

What is the difference between Standard error Vs Standard deviation ?

A
  • Standard Deviation tells you how spread out your data is.
  • Standard Error focuses on the reliability of the sample mean as an estimate of the population mean.