Statistics Flashcards

Question

Percentiles

Answer 1

- A percentile is the measure that tells you what percentage of values in a dataset are less than or equal to a particular value. - Percentiles show the **relative position** or rank of a particular value in a dataset. - If you're in the 75th percentile for height, it means 75% of people are shorter than you, and 25% are taller. - (percentiles used to rank test scores on school exams.)

Answer 2

* A quartile divides the values in a dataset into **four equal parts.** * Quartiles let you **compare values** relative to the four quarters of data.

Answer 3

The first quartile, Q1, is the middle value in the first half of the dataset. Q1 refers to the 25th percentile. 25% of the values in the entire dataset are below Q1, and 75% are above it.

Answer 4

The second quartile, Q2, is the **median** of the dataset. - Q2 refers to the **50th percentile**. 50% of the values in the entire dataset are below Q2, and 50% are above it.

Answer 5

The third quartile, Q3, is the middle value in the second half of the dataset. Q3 refers to the 75th percentile. 75% of the values in the entire dataset are below Q3, and 25% are above it.

Answer 6

- is the distance between the first quartile, Q1, and the third quartile, Q3. - is a **measure of dispersion** because it measures the **spread or the middle half** or middle 50 percent of your data. - IQR is also useful for determining the **relative position** of your **data values.**

Answer 7

* The minimum * The first quartile (Q1) * The median, or second quartile (Q2) * The third quartile (Q3) * The maximum

Answer 8

with boxplot

Answer 9

Low/No outliers

Answer 10

* objective * subjective

Answer 11

probability is based on statistics, experiments, and mathematical measurements. * 2 types * classical * empirical

Answer 12

Classical probability is based on **formal reasoning** about events with **equally likely outcomes.** Example: throw a coin. probably of getting head is 1/2 = 50% always

Answer 13

based on **experimental or historical data**; it represents the likelihood of an event occurring based on the **previous results** of an experiment or **past events.?*

Answer 14

Data professionals rely on empirical probability to help them make accurate predictions based on sample data * For example, in an A/B test of a website, you test a sample of users to make a prediction about the future behavior of all users. Say the sample of users prefer a green add*to*cart button over a blue one. You may infer from this data that the larger population of future users will probably share their preference. An A/B test lets you make a reasonable prediction about future users based on empirical probability.

Answer 15

Subjective probability is based on personal feelings, experience, or judgment.

Answer 16

* Random experiment * Outcome * Event

Answer 17

process whose **outcome cannot be predicted** with certainty. For example, before tossing a coin or rolling a die, you can’t know the result of the toss or the roll. The result of the coin toss might be heads or tails. The result of the die roll might be 3 or 6.

Answer 18

* The experiment can have more than one possible outcome. * You can represent each possible outcome in advance. * The outcome of the experiment depends on chance.

Answer 19

the result of a random experiment. example, if you roll a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6.

Answer 20

a set of one or more outcomes. Using the example of rolling a die, an event might be rolling an even number. The event of rolling an even number consists of the outcomes 2, 4, 6. Or, the event of rolling an odd number consists of the outcomes 1, 3, 5.

Answer 21

The probability that an event will occur is expressed as a number between 0 and 1. Probability can also be expressed as a percent. * If the probability of an event equals 0, there is a 0% chance that the event will occur. * If the probability of an event equals 1, there is a 100% chance that the event will occur.

Answer 22

# of desired outcomes ÷ total # of possible outcomes

Answer 23

The probability of event A

Answer 24

The probability of event B

Answer 25

the probability of any event A is always between 0 and 1.

Answer 26

then event A has a higher chance of occurring than event B.

Answer 27

event A and event B are equally likely to occur.

Answer 28

Two events are **mutually exclusive** if they cannot occur at the same time. For example, you can’t be on the Earth and on the moon at the same time, or be sitting down and standing up at the same time.

Answer 29

Two events are **independent** if the occurrence of one event does not change the probability of the other event. This means that one event does not affect the outcome of the other event. For example, watching a movie in the morning does not affect the weather in the afternoon.

Answer 30

* Complement rule (mutually exclusive events) * Addition rule (mutually exclusive events) * Multiplication rule (independent events)

Answer 31

The complement rule deals with mutually exclusive events. In statistics, the complement of an event is the event not occurring. The complement rule states that the probability that event A does not occur is 1 minus the probability of A. P(A’) = 1 * P(A)

Answer 32

the probability of not A. or probability of event A NOT occurring,

Answer 33

if events A and B are mutually exclusive, then the probability of A or B occuring is the sum of the probabilities of A and B. P(A or B) = P(A) + P(B) P(rolling 2 or rolling 4) = P(rolling 2) + P(rolling 4) = ⅙ + ⅙ = ⅓ So, the probability of rolling either a 2 or a 4 is one out of three, or 33%.

Answer 34

if events A and B are independent, then the probability of both A and B occuring is the probability of A multiplied by the probability of B. P(A and B) = P(A)×P(B) P(rolling 1 on the first roll and rolling 6 on the second roll) = P(rolling 1 on the first roll)×P(rolling 6 on the second roll) = ⅙×⅙ = 1/36 So, the probability of rolling a 1 and then a 6 is one out of thirty*six, or about 2.8%.

Answer 35

applies to two or more dependent events. P(A and B) = P(A) * P(B|A) the vertical bar between the letters B and A indicates dependence, or that the occurrence of event B depends on the occurrence of event A. You can say this as “the probability of B given A.”

Answer 36

two events are dependent if the occurrence of one event changes the probability of the other event. This means that the first event affects the outcome of the second event. For instance, if you want to get a good grade on an exam, you first need to study the course material. Getting a good grade depends on studying.

Answer 37

the probability of B given A. P(B|A) = P(A and B) / P(A) probability of event B given event A equals the probability that both A and B occur divided by the probability of A.

Answer 38

is a math formula for determining conditional probability. For example, let’s say a medical condition is related to age. You can use Bayes’s theorem to more accurately determine the probability that a person has the condition based on age. The prior probability would be the probability of a person having the condition. The posterior, or updated, probability would be the probability of a person having the condition if they are in a certain age group.

Answer 39

refers to the probability of an event before new data is collected.

Answer 40

updated probability of an event based on new data.

Answer 41

P(A|B) = [ P(B|A) * P(A) ] / P(B) P(A): Prior probability, probability of event A. P(A|B): Posterior probability, probability of event A given event B. P(B|A): Likelihood, probability of event B given event A, P(B): Evidence, probability of event B. You want to find out the following: * P(Spam | Money), or posterior probability: the probability that an email is spam given that the word “money” appears in the email * P(Spam), or prior probability: the probability of an email being spam = 0.2, or 20% * P(Money), or evidence: the probability that the word “money” appears in an email = 0.15, or 15% * P(Money | Spam), or likelihood: the probability that the word “money” appears in an email given that the email is spam = 0.4, or 40% P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) = 0.4 * 0.2 / 0.15 = 0.53333, or about 53.3%. So, the probability that an email is spam given that the email contains the word “money” is 53.3%.

Answer 42

represent discrete random variables, or discrete events. Often, the outcomes of discrete events are expressed as whole numbers that can be counted. For example, rolling a die can result in a 2 or a 3, but not a decimal value such as 2.575 or 3.184.

Answer 43

describes the likelihood of the possible outcomes of a random event.

Answer 44

* Uniform * Binomial * Bernoulli * Poisson

Answer 45

describes events whose outcomes are all equally likely, or have equal probability. For example, rolling a die can result in six outcomes: 1, 2, 3, 4, 5, or 6. The probability of each outcome is the same: 1 out of 6, or about 16.7%. applies to both discrete and continuous random variables. There is no skewness present in uniform distribution graphs.

Answer 46

Random initialization: In many machine learning algorithms, such as neural networks and k*means clustering, the initial values of the parameters can have a significant impact on the final result. Uniform distribution is often used to randomly initialize the parameters, as it ensures that all values in the range have an equal probability of being selected. Sampling: Uniform distribution can also be used for sampling. For example, if you have a dataset with an equal number of samples from each class, you can use uniform distribution to randomly select a subset of the data that is representative of all the classes. Data augmentation: In some cases, you may want to artificially increase the size of your dataset by generating new examples that are similar to the original data. Uniform distribution can be used to generate new data points that are within a specified range of the original data. Hyperparameter tuning: Uniform distribution can also be used in hyperparameter tuning, where you need to search for the best combination of hyperparameters for a machine learning model. By defining a uniform prior distribution for each hyperparameter, you can sample from the distribution to explore the hyperparameter space.

Answer 47

models the probability of events with only two possible outcomes: success (1) or failure (0). These outcomes are mutually exclusive and cannot occur at the same time. This definition assumes the following: * Each event is independent, or does not affect the probability of the others. * Each event has the same probability of success.

Answer 48

* a new medication generates side effects * a credit card transaction is fraudulent * a stock price rises in value In machine learning, the binomial distribution is often used to classify data.

Answer 49

The Bernoulli distribution is similar to the binomial distribution as it also models events that have only two possible outcomes (success or failure). The only difference is that the Bernoulli distribution refers to only a single trial of an experiment, while the binomial refers to repeated trials. A classic example of a Bernoulli trial is a single coin toss.

Answer 50

models the probability that a certain number of events will occur during a specific time period. * The number of events in the experiment can be counted. * The mean number of events that occur during a specific time period is known. * Each event is independent.

Answer 51

* Calls per hour for a customer service call center * Customers per day at a shop * Thunderstorms per month in a city * Financial transactions per second at a bank

Answer 52

On a continuous distribution, the x*axis refers to the value of the variable you’re measuring * in this case, cherry tree height. The y*axis refers to probability density. Note that probability density is not the same thing as probability.

Answer 53

mathematical function that provides probabilities for the possible outcomes of a random variable.

Answer 54

* Probability Mass Functions (PMFs) represent discrete Random variables * Probability Density Functions (PDFs) represent continuous Random variables

Answer 55

is a continuous probability distribution that is symmetric about the mean and bell*shaped. * The shape is a bell curve * The mean is located at the center of the curve * The curve is symmetrical on both sides of the mean * The total area under the curve equals 1

Answer 56

values on a normal curve are distributed in a regular pattern, based on their distance from the mean. * 68% of values fall within 1 standard deviation of the mean * 95% of values fall within 2 standard deviations of the mean * 99.7% of values fall within 3 standard deviations of the mean

Answer 57

is a measure of how many standard deviations below or above the population mean a data point is. A z*score gives you an idea of how far from the mean a data point is. Z*scores range from *3 to +3.

Answer 58

standardization is the process of putting different variables on the same scale.

Answer 59

is just a normal distribution with a mean of 0 and a standard deviation of 1. Standardization is useful because it lets you compare scores from different data sets that may have different units, mean values and standard deviations.

Answer 60

use z*scores for anomaly detection, which finds outliers in datasets. Applications of anomaly detection include finding fraud in financial transactions, flaws in manufacturing products, intrusions in computer networks and more.

Answer 61

process of applying mathematical functions to data to change its underlying distribution. Transformations can be critical in statistics and machine learning when you need to work with algorithms that assume a normal distribution. Many statistical methods and machine learning algorithms perform best when the data follows a normal distribution, owing to properties like symmetry, defined mean and standard deviation, and consistent spread.

Answer 62

1. Statistical Assumptions Statistical tests like t*tests, ANOVA, and many regression models assume that the underlying data or residuals (errors) are normally distributed. When the data doesn’t meet this assumption, the results can be biased or misleading. Transformations can help ensure that data fits these assumptions. 2. Improving Algorithm Performance Machine learning algorithms, particularly linear regression and logistic regression, may perform better when the data or residuals are normally distributed. This is because the assumptions underlying these algorithms are closely related to normality. Making the data more normally distributed through transformation can improve the algorithm’s predictive accuracy and reduce bias. 3. Stabilizing Variance When data has unstable variance (heteroscedasticity), it can lead to errors in modeling and reduce the effectiveness of algorithms that expect consistent variance. Transformations can help stabilize variance, making it more constant across different ranges of the data. 4. Reducing Skewness Skewed data can lead to inaccurate conclusions and complicate the interpretation of results. Algorithms that expect symmetric data may perform poorly with skewed inputs. Transformations like log transformation can reduce skewness, bringing data closer to a normal distribution.

Answer 63

Log Transformation: Converts data by taking the natural logarithm, reducing positive skewness. Useful for data with exponential growth or a long right tail. Square Root Transformation: Converts data by taking the square root to reduces skewness, often used for count data or data with variance increasing with the mean. Box*Cox Transformation: A flexible power transformation that can turn a range of non*normal data into a more normal distribution. It requires non*negative data and determines the best power transformation parameter (λ) to achieve normality. It can be mathematically expressed as: Reciprocal Transformation: Involves taking the reciprocal (1/x) to transform the data, reducing positive skewness.

Answer 64

pdf is the derivative of CDF

Answer 65

a subset of a population.

Answer 66

includes every possible element that you are interested in measuring, or the entire dataset that you want to draw conclusions about.

Answer 67

process of selecting a subset of data from a population. sampling can help you make valid inferences about the population as a whole.

Answer 68

* It’s often impossible or impractical to collect data on the whole population due to size, complexity, or lack of accessibility * It’s easier, faster, and more efficient to collect data from a sample * Using a sample saves money and resources * Storing, organizing, and analyzing smaller datasets is usually easier, faster, and more reliable than dealing with extremely large datasets

Answer 69

accurately reflects the characteristics of a population.

Answer 70

the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about a population, make sure your sample is representative of the population.

Answer 71

* help ensure your sample is representative by collecting random samples from the various groups within a population. * reduce sampling bias and * increase the validity of your results.

Answer 72

1. Identify the target population 2. Select the sampling frame 3. Choose the sampling method 4. Determine the sample size 5. Collect the sample data

Answer 73

The target population is the complete set of elements that you’re interested in knowing more about.

Answer 74

sampling frame is a list of all the individuals or items in your target population. * So, if your target population is all the customers who purchased the refrigerator, * your sampling frame could be an alphabetical list of the names of all these customers. The customers in your sample will be selected from this list. * some customers may have changed their contact information since their purchase, and you may be unable to locate or contact them. * Your sampling frame is the accessible part of your target population.

Answer 75

There are two main types of sampling methods: 1. probability sampling (preferred) 2. non*probability sampling

Answer 76

Sample size helps determine the precision of the predictions you make about the population. In general, the larger the sample size, the more precise your predictions. However, using larger samples typically requires more resources.

Answer 77

ready to collect your sample data, which is the final step in the sampling process.

Answer 78

* Simple random sampling * Stratified random sampling * Cluster random sampling * Systematic random sampling

Answer 79

the population is general and the frame is specific.

Answer 80

uses random selection to generate a sample. Because probability sampling methods are based on random selection, every element in the population has an equal chance of being included in the sample. This gives you the best chance to get a representative sample, as your results are more likely to accurately reflect the overall population.

Answer 81

often based on convenience, or the personal preferences of the researcher, rather than random selection. Often, probability sampling methods require more time and resources than non*probability sampling methods.

Answer 82

every member of a population is selected randomly and has an equal chance of being chosen. You can randomly select members using a random number generator, or by another method of random selection.

Answer 83

1. fairly representative, since every member of the population has an equal chance of being chosen. 2. avoid bias, and surveys like these give you more reliable results.

Answer 84

* often expensive and time*consuming to collect large simple random samples. * And if your sample size is not large enough, a specific group of people in the population may be underrepresented in your sample.

Answer 85

divide a population into groups, and randomly select some members from each group to be in the sample. Strata can be organized by age, gender, income, or whatever category you’re interested in studying.

Answer 86

can be difficult to identify appropriate strata for a study if you lack knowledge of a population.

Answer 87

ensure that members from each group in the population are included in the survey. This method helps provide equal representation for underrepresented groups, and allows you to draw more precise conclusions about each of the strata. There may be significant differences in the purchasing habits of a 21*year*old and a 51*year*old. Stratified sampling helps ensure that both perspectives are captured in the sample.

Answer 88

you divide a population into clusters, randomly select certain clusters, and include all members from the chosen clusters in the sample. Clusters are divided using identifying details, such as age, gender, location, or whatever you want to study.

Answer 89

in stratified sampling, you randomly choose some members from each group to be in the sample. In cluster sampling, you choose all members from a group to be in the sample.

Answer 90

gets every member from a particular cluster, which is useful when each cluster reflects the population as a whole. This method is helpful when dealing with large and diverse populations that have clearly defined subgroups. If researchers want to learn more about home ownership in the suburbs of Auckland, New Zealand, they can use several well*chosen suburbs as a representative sample of all the suburbs in the city.

Answer 91

difficult to create clusters that accurately reflect the overall population. For example, for practical reasons, you may only have access to restaurants in England when the franchise has locations all over the world. And employees in England may have different characteristics and values than employees in other countries.

Answer 92

put every member of a population into an ordered sequence. Then, you choose a random starting point in the sequence and select members for your sample at regular intervals. Ex. Starting with number 4, you select every 10th name on the list (4, 14, 24, 34, … ), until you have a sample of 100 students.

Answer 93

* need to know the size of the population that you want to study before you begin. If you don’t have this information, it’s difficult to choose consistent intervals. * Plus, if there’s a hidden pattern in the sequence, you might not get a representative sample. For example, if every 10th name on your list happens to be an honor student, you may only get feedback on the study habits of honor students – and not all students.

Answer 94

* representative of the population, since every member has an equal chance of being included in the sample * quick and convenient when you have a complete list of the members of your population.

Answer 95

use non*random methods of selection, so not all members of a population have an equal chance of being selected. This is why non*probability methods have a high risk of sampling bias.

Answer 96

high risk of sampling bias.

Answer 97

* less expensive and more convenient for researchers to conduct. * can be useful for exploratory studies, which seek to develop an initial understanding of a population, rather than make inferences about the population as a whole.

Answer 98

* Convenience * Voluntary response sampling * Snowball sampling * Purposive sampling

Answer 99

choose members of a population that are easy to contact or reach. For example, to conduct an opinion poll, a researcher might stand at the entrance of a shopping mall during the day and poll people that happen to walk by.

Answer 100

* not reliable * convenience samples often suffer from undercoverage bias. Undercoverage bias occurs when some members of a population are inadequately represented in the sample.

Answer 101

quick and inexpensive,

Answer 102

A voluntary response sample consists of members of a population who volunteer to participate in a study

Answer 103

Voluntary response samples tend to suffer from nonresponse bias, which occurs when certain groups of people are less likely to provide responses. People who voluntarily respond will likely have stronger opinions, either positive or negative, than the rest of the population. In this case, only students who really like or really dislike the food may be motivated to fill out the survey.

Answer 104

researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. Like a snowball, the sample size gets bigger and bigger as more participants join in.

Answer 105

use snowball sampling when the population they want to study is difficult to access.

Answer 106

Snowball sampling can take a lot of time, and researchers must rely on participants to successfully continue the recruiting process and build up the “snowball.” This type of recruiting can also lead to sampling bias. Because initial participants recruit additional participants on their own, it’s likely that most of them will share similar characteristics, and these characteristics might be unrepresentative of the total population under study.

Answer 107

researchers select participants based on the purpose of their study. Because participants are selected for the sample according to the needs of the study, applicants who do not fit the profile are rejected. For example, imagine a game development company wants to conduct market research on a new video game before its public release. The research team only wants to include gaming experts in their sample. So, they survey a group of professional gamers to provide feedback on potential improvements.

Answer 108

Purposive sampling is often used when a researcher wants to gain detailed knowledge about a specific part of a population, or where the population is very small and its members all have similar characteristics.

Answer 109

not effective for making inferences about a large and diverse population.

Answer 110

* A point estimate uses a single value to estimate a population parameter. * ex. A data professional might use the mean weight of the sample of 100 penguins to estimate the mean weight of the population.

Answer 111

characteristic of a population. The mean weight of the total population of 10,000 penguins.

Answer 112

characteristic of a sample. the mean weight of a random sample of 100 penguins

Answer 113

* is a probability distribution of a sample statistic. * a sampling distribution represents the possible outcomes for a sample statistic * Sample statistics are based on randomly sampled data, and their outcomes cannot be predicted with certainty. You can use a sampling distribution to represent statistics such as the mean, median, standard deviation, range, and more.

Answer 114

refers to how much an estimate varies between samples. if your sample is large enough, your sample mean will roughly equal the population mean. * population mean = 3 lbs * sample mean = 3.3 lbs * sample mean = 2.8 lbs * sample mean = 2.4 lbs

Answer 115

* the standard deviation of a sample statistic * The standard error of the mean measures variability among all your sample means. * larger SE > sample means are more spread out > more variability * smaller SE > sample means are closer together > less variability * The less SE, the more likely it is that your sample mean is an accurate estimate of the population mean. * as sample size increase, SE decreases

Answer 116

* The more variability in your sample data, the less likely it is that the sample mean is an accurate estimate of the population mean. * use the standard deviation of the sample means to measure this variability

Answer 117

SE = s ÷ √n s = sample standard deviation n = sample size 2 ÷ √100 = 2 ÷ 10 = 0.2 This means you should expect that the mean length from one sample to the next will vary with a standard deviation of about 0.2 inches.

Answer 118

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. And, as you sample more observations from a population, the sample mean gets closer to the population mean. The central limit theorem can help you infer population parameters like the mean even if you only have available data on a portion of the population. The larger your sample size, the more precise your estimate of the population mean is likely to be.

Answer 119

You don’t need to know the shape of your population distribution in advance to apply the theorem—the distribution could be bell*shaped, skewed, or have another shape. If you collect a large enough sample, the shape of your sampling distribution will follow a normal distribution.

Answer 120

In order to apply the central limit theorem, the following conditions must be met: * Randomization * Independence * 10% * Sample size

Answer 121

Your sample data must be the result of random selection. Random selection means that every member in the population has an equal chance of being chosen for the sample.

Answer 122

Your sample values must be independent of each other. Independence means that the value of one observation does not affect the value of another observation. Typically, if you know that the individuals or items in your dataset were selected randomly, you can also assume independence.

Answer 123

Your sample size should be no larger than 10% of the total population. This applies when the sample is drawn without replacement, which is usually the case.

Answer 124

The sample size needs to be sufficiently large. * Requirements for precision: The larger the sample size, the more closely your sampling distribution will resemble a normal distribution, and the more precise your estimate of the population mean will be. * The shape of the population: If your population distribution is roughly bell*shaped and already resembles a normal distribution, the sampling distribution of the sample mean will be close to a normal distribution even with a small sample size. * In general, many statisticians and data professionals consider a sample size of 30 to be sufficient when the population distribution is roughly bell*shaped, or approximately normal. * However, if the original population is not normal—for example, if it’s extremely skewed or has lots of outliers—data professionals often prefer the sample size to be a bit larger. Exploratory data analysis can help you determine how large of a sample is necessary for a given dataset.

Answer 125

* refers to the percentage of individuals or elements in a population that share a certain characteristic. * Proportions, measure percentages or parts of a whole.

Answer 126

* to estimate the proportion of all visitors to a website who make a purchase before leaving. * Assembly line products that meet quality control standards * voters who support a candidate in an upcoming election.

Answer 127

* As with the sample means, the central limit theorem also applies to sample proportions * As your sample size increases, the distribution of the sample proportion will be approximately normal. * The overall average or mean proportion is located in the center of the curve. * If you take a sufficiently large enough sample of teenagers, the sample proportion will be an accurate estimate of the true population proportion. * If you survey 1000 teenagers and find that 10% prefer slip on sneakers, this means that your best estimate for the proportion of all teenagers who prefer slip ons is also 10%.

Answer 128

* As with the sample mean, you can use the standard error of the proportion to measure sampling variability. * This tells you how much a particular sample proportion is likely to differ from the two population proportion. * This is useful to know, because the proportion varies from sample to sample. And any given sample proportion probably won't be exactly equal to the true population proportion. * The true proportion of teenagers who prefer slip on sneakers might be 10%, but the proportion of any given sample might be 12%, 9%, 7% and so on. * The more variability in your sample data, the less likely it is that the sample proportion is an accurate estimate of the population proportion. * It's important to understand the accuracy of your estimate, because stakeholder decisions are often based on the estimates you provide.

Answer 129

* Standard Deviation tells you how spread out your data is. * Standard Error focuses on the reliability of the sample mean as an estimate of the population mean.

Statistics Flashcards

(153 cards)