Statistics Flashcards

- OpenStax Introductory Statistics - Introduction to Statistics 4E, Freedman

1
Q

Average

A

A number that describes the central tendency of the data

average = sum of entries / number of entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Blinding

A

Not telling participants which treatment a subject is receiving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Categorical Variable

A

Variables that take on values that are names or labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cluster Sampling

A

A method for selecting a random sample and dividing the population into groups (clusters); use simple random sampling to select a set of clusters. Every individual in the chosen clusters is included in the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Continuous Random Variable

A

A random variable (RV) whose outcomes are measured; the height of trees in the forest is a continuous RV.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Control Group

A

A group in a randomized experiment that receives an inactive treatment but is otherwise managed exactly as the other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Convenience Sampling

A

A nonrandom method of selecting a sample; this method selects individuals that are easily accessible and may result in biased data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cumulative Relative Frequency

A

The term applies to an ordered set of observations from smallest to largest. The cumulative relative frequency is the sum of the relative frequencies for all values that are less than or equal to the given value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data

A

A set of observations (a set of possible outcomes); most data can be put into two groups: qualitative (an attribute whose value is indicated by a label) or quantitative (an attribute whose value is indicated by a number). Quantitative data can be separated into two subgroups: discrete and continuous. Data is discrete if it is the result of counting (such as the number of students of a given ethnic group in a class or the number of books on a shelf). Data is continuous if it is the result of measuring (such as distance traveled or weight of luggage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Double-blind experiment

A

An experiment in which both the subjects of an experiment and the researchers who work with the subjects are blinded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Triple-blind experiment

A

An experiment in which both the subjects of an experiment, researchers who work with the subjects, and analysts who analyze the data are blinded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Experimental Unit

A

Any individual or object to be measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explanatory Variable

A

The independent variable in an experiment; the value controlled by researchers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Frequency

A

The number of times a value of the data occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Institutional Review Board

A

A committee tasked with oversight of research programs that involve human subjects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Informed Consent

A

Any human subject in a research study must be cognizant of any risks or costs associated with the study. The subject has the right to know the nature of the treatments included in the study, their potential risks, and their potential benefits. Consent must be given freely by an informed, fit participant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Lurking Variable

A

A variable, not included in experiment, that has an effect on a study even though it is neither an explanatory variable nor a response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confounding Variable

A

Difference between the treatment and control groups - other than the treatment - which affects the responses being studied. A third variable, associated with both the dependent and response variables.

“The idea is a bit subtle: a gene that causes cancer but is unrelated to smoking is not a confounder and is sideways to the argument”

Gene needs to A) cause cancer AND B) get people to smoke

Sometime controlled for by cross-tabulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is a Lurking Variable different from a Confounding Variable?

A

Lurking = Unknown or unconsidered
Confounding = Known but not controlled for

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Nonsampling Error/Systematic Error/Bias

A

An issue that affects the reliability of sampling data other than natural variation; it includes a variety of human errors including poor study design, biased sampling methods, inaccurate information provided by study participants, data entry errors, and poor analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Numerical Variable

A

Variables that take on values that are indicated by numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Population Parameter

A

A number that is used to represent a population characteristic and that generally cannot be determined easily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Placebo

A

An inactive treatment that has no real effect on the explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Population

A

All individuals, objects, or measurements whose properties are being studied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Probability

A

A number between zero and one, inclusive, that gives the likelihood that a specific event will occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Proportion

A

The number of successes divided by the total number in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Qualitative Data

A

Data that has an attribute whose value is indicated by a label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Quantitative Data

A

Quantitative (an attribute whose value is indicated by a number) data can be separated into two subgroups: discrete and continuous.
Data is discrete if it is the result of counting (such as the number of students of a given ethnic group in a class or the number of books on a shelf).
Data is continuous if it is the result of measuring (such as distance traveled or weight of luggage).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Random Assignment

A

The act of organizing experimental units into treatment groups using random methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Random Sampling

A

A method of selecting a sample that gives every member of the population an equal chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Relative Frequency

A

The ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes to the total number of outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Representative Sample

A

A subset of the population that has the same characteristics as the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Response Variable

A

The dependent variable in an experiment; the value that is measured for change at the end of an experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Sample

A

A subset of the population studied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Sampling Error (Chance Variation)

A

The natural variation that results from selecting a sample to represent a larger population; this variation decreases as the sample size increases, so selecting larger samples reduces sampling error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Sampling with Replacement

A

Once a member of the population is selected for inclusion in a sample, that member is returned to the population for the selection of the next individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Sampling without Replacement

A

A member of the population may be chosen for inclusion in a sample only once. If chosen, the member is not returned to the population before the next selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Simple Random Sampling

A

A straightforward method for selecting a random sample; give each member of the population a number. Use a random number generator to select a set of labels. These randomly selected labels identify the members of your sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Sample Statistic

A

A numerical characteristic of the sample; a statistic estimates the corresponding population parameter.

sample estimate = parameter + bias + chance error/sampling error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Stratified Sampling

A

A method for selecting a random sample used to ensure that subgroups of the population are represented adequately; divide the population into groups (strata). Use simple random sampling to identify a proportionate number of individuals from each stratum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Systematic Sampling

A

A method for selecting a random sample; list the members of the population. Use simple random sampling to select a starting point in the population. Let k = (number of individuals in the population)/(number of individuals needed in the sample). Choose every kth individual in the list starting with the one that was randomly selected. If necessary, return to the beginning of the population list to complete your sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Treatments

A

Different values or components of the explanatory variable applied in an experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Variable

A

A characteristic of interest for each person or object in a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Box plot

A

A graph that gives a quick picture of the middle 50% of the data plus the minimum and maximum value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

First Quartile

A

The value that is the median of the of the lower half of the ordered data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Frequency Polygon

A

A graph that shows the frequencies of grouped data. It is a type of frequency diagram that plots the midpoints of the class intervals against the frequencies and then joins up the points with straight lines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Frequency Table

A

A data representation in which grouped data is displayed along with the corresponding frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Histogram

A

A graphical representation in x-y form of the distribution of data in a data set; x represents the data and y represents the frequency, or relative frequency. The graph consists of contiguous rectangles - the area of each representing the percent of the data that block/class represents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Interquartile Range (IQR)

A

The range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Interval

A

Also called a class interval; an interval represents a range of data and is used when displaying large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Mean

A

The sum of all values in the population divided by the number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Median

A

A number that separates ordered data into halves; half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Midpoint

A

The mean of an interval in a frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Mode

A

The value that appears most frequently in a set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Outlier

A

An observation that does not fit the rest of the data.

Standard test: if data is 2 standard deviations or more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Paired Data Set

A

Two data sets that have a one-to-one relationship so that 1) each data set is the same size and 2) each data point in one data set is matched with exactly one data point in the other data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Percentile

A

A number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively.

FORMULA 1: Percentile given Percentile Rank of Value

P(x) = (n/N)*100, where n = values below x

FORMULA 2: Percentile Rank given percentile

n = (P*N)/100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Quartiles

A

The numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Skewed

A

Used to describe data that is not symmetrical; when the right side of a graph looks “chopped off” compared the left side, we say it is “skewed to the left.” When the left side of the graph looks “chopped off” compared to the right side, we say the data is “skewed to the right.” Alternatively: when the lower values of the data are more spread out, we say the data are skewed to the left. When the greater values are more spread out, the data are skewed to the right.

Outliers in the tail pull the mean from the center towards the longer tail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Standard Deviation

A

A number that is equal to the square root of the variance and measures how far data values are from their mean; notation: s for sample standard deviation and σ for population standard deviation.

Measures the average deviation of the values from the average

SD = the rms size of the deviations from the average

Measures how far away values are from the mean on average

Deviation = Value - Average
SD = sqrt(avg(deviations^2))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Variance

A

Mean of the squared deviations from the mean, or the square of the standard deviation; for a set of data, a deviation can be represented as x – x¯ where x is a value of the data and x¯ is the sample mean. The sample variance is equal to the sum of the squares of the deviations divided by the difference of the sample size and one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

AND Event

A

An outcome is in the event A AND B if the outcome is in both A AND B at the same time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Complement Event

A

The complement of event A consists of all outcomes that are NOT in A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Contingency Table

A

The method of displaying a frequency distribution as a table with rows and columns to show how two variables may be dependent (contingent) upon each other; the table provides an easy way to calculate conditional probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Conditional Probability

A

The likelihood that an event will occur given that another event has already occurred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Dependent Events

A

If two events are NOT independent, then we say that they are dependent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Equally Likely

A

Each outcome of an experiment has the same probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Event

A

A subset of the set of all outcomes of an experiment; the set of all outcomes of an experiment is called a sample space and is usually denoted by S. An event is an arbitrary subset in S. It can contain one outcome, two outcomes, no outcomes (empty subset), the entire sample space, and the like. Standard notations for events are capital letters such as A, B, C, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Experiment

A

A planned activity carried out under controlled conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Independent Events

A

The occurrence of one event has no effect on the probability of the occurrence of another event. Events A and B are independent if one of the following is true:

P(A|B) = P(A)
P(B|A) = P(B)
P(A AND B) = P(A)P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Mutually Exclusive

A

Two events are mutually exclusive if the probability that they both happen at the same time is zero. If events A and B are mutually exclusive, then P(A AND B) = 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Or Event

A

An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Outcome

A

A particular result of an experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Sample Space

A

The set of all possible outcomes of an experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Tree Diagram

A

The useful visual representation of a sample space and events in the form of a “tree” with branches marked by possible outcomes together with associated probabilities (frequencies, relative frequencies)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Venn Diagram

A

The visual representation of a sample space and events in the form of circles or ovals showing their intersections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Bernoulli Trials

A

an experiment with the following characteristics:

  1. There are only two possible outcomes called “success” and “failure” for each trial.
  2. The probability p of a success is the same for any trial (so the probability q = 1 − p of a failure is the same for any trial).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Binomial Experiment

A

A statistical experiment that satisfies the following three conditions:

  1. There are a fixed number of trials, n.
  2. There are only two possible outcomes, called “success” and, “failure,” for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial.
  3. The n trials are independent and are repeated using identical conditions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Expected Value

A

Expected arithmetic average when an experiment is repeated many times.

The average value of values generated by a chance process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Binomial Probability Distribution

A

A discrete random variable (RV) that arises from Bernoulli trials; there are a fixed number, n, of independent trials. “Independent” means that the result of any trial (for example, trial one) does not affect the results of the following trials, and all trials are conducted under the same conditions. Under these circumstances the binomial RV X is defined as the number of successes in n trials. The notation is: X equiv B(n,p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Geometric Distribution

A

A discrete random variable (RV) that arises from the Bernoulli trials; the trials are repeated until the first success. The geometric variable X is defined as the number of trials until the first success.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Geometric Experiment

A

A statistical experiment with the following properties:

  1. There are one or more Bernoulli trials with all failures except the last one, which is a success.
  2. In theory, the number of trials could go on forever. There must be at least one trial.
  3. The probability, p, of a success and the probability, q, of a failure do not change from trial to trial.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Hypergeometric Experiment

A

A statistical experiment with the following properties:

  1. You take samples from two groups.
  2. You are concerned with a group of interest, called the first group.
  3. You sample without replacement from the combined groups.
  4. Each pick is not independent, since sampling is without replacement.
  5. You are not dealing with Bernoulli Trials.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Hypergeometric Probability

A

A discrete random variable (RV) that is characterized by:

  1. A fixed number of trials.
  2. The probability of success is not the same from trial to trial.

We sample from two groups of items when we are interested in only one group. X is defined as the number of successes out of the total number of items chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Probability Distribution Function (PDF)

A

A mathematical description of a discrete random variable (RV), given either in the form of an equation (formula) or in the form of a table listing all the possible outcomes of an experiment and the probability associated with each outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Poisson Probability Distribution

A

A discrete random variable (RV) that counts the number of times a certain event will occur in a specific interval; characteristics of the variable:

  1. The probability that the event occurs in a given interval is the same for all intervals.
  2. The events occur with a known mean and independently of the time since the last event.

The Poisson distribution is often used to approximate the binomial distribution, when n is “large” and p is “small” (a general rule is that n should be greater than or equal to 20 and p should be less than or equal to 0.05).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Random Variable (RV)

A

A characteristic of interest in a population being studied; common notation for variables are upper case Latin letters X, Y, Z,…; common notation for a specific value from the domain (set of all possible values of a variable) are lower case Latin letters x, y, and z. For example, if X is the number of children in a family, then x represents a specific integer 0, 1, 2, 3,…. Variables in statistics differ from variables in intermediate algebra in the two following ways.

  1. The domain of the random variable (RV) is not necessarily a numerical set; the domain may be expressed in words; for example, if X = hair color then the domain is {black, blond, gray, green, orange}.
  2. We can tell what specific value x the random variable X takes only after performing the experiment.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

The Law of Large Numbers

A

Sample mean approaches population mean as sample size increases.

As the number of trials in a probability experiment increases, the difference between the theoretical probability of an event and the relative frequency probability approaches zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Decay parameter

A

The decay parameter describes the rate at which probabilities decay to zero for increasing values of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Exponential Distribution

A

A continuous random variable (RV) that appears when we are interested in the intervals of time between some random events, for example, the length of time between emergency arrivals at a hospital;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Memoryless property

A

For an exponential random variable X, the memoryless property is the statement that knowledge of what has occurred in the past has no effect on future probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Poisson Distribution

A

If there is a known average of λ events occurring per unit time, and these events are independent of each other, then the number of events X occurring in one unit of time has the Poisson distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Uniform Distribution

A

A discrete or continuous random variable (RV) that has equally likely outcomes over the domain, a < x < b.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Normal Distribution

A

aka the Gaussian distribution; Visualized as a symmetrical bell-shaped curve. The most frequent values cluster around the center, and the probability of finding values far away from the center tapers off gradually in both directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Standard Normal Distribution / Z-Distribution

A

A normal distribution with mean 0 and standard deviation of 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Z-score

A

Statistical value that describes a specific data point’s relative position within a standard normal distribution (aka Z-Distribution). It tells you how many standard deviations a particular point is away from the mean (average) of the data, expressed in standard deviation units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Central Limit Theorem (CLT)

A

The sampling distribution of mean for a variable approximates a normal distribution with increasing sample size.

Given a random variable (RV) with known mean μ, known standard deviation σ and known sample size of n - if n is sufficiently large, then the distribution of the sample means and the distribution of the sample sums will approximate a normal distributions regardless of the shape of the population. The mean of the sample means will equal the population mean, and the mean of the sample sums will equal n times the population mean.

As sample size, n, increases, the standard deviation of the sampling distribution becomes smaller and because the square root of the sample size is in the denominator. In other words, the sampling distribution clusters more tightly around the mean as sample size increases.

Why is it important?
1. Normality Assumption

Allows us to use hypothesis test that rely on data that is normally distributed when we have nonnormally distributed data

  1. Precision of Estimates

We can make our estimate more precise by increasing or sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Sampling Distribution of the Mean

A

Describes the probability distribution of all possible means you could get if you draw multiple random samples of size n from a population.

Given simple random samples of size n from a given population with a measured characteristic such as mean, proportion, or standard deviation for each sample, the probability distribution of all the measured characteristics is called a sampling distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Standard Error of the Mean

A

The standard deviation of the distribution of the sample means around the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Binomial Distribution

A

A discrete random variable (RV) which arises from Bernoulli trials; there are a fixed number, n, of independent trials. “Independent” means that the result of any trial (for example, trial 1) does not affect the results of the following trials, and all trials are conducted under the same conditions. Under these circumstances the binomial RV X is defined as the number of successes in n trials.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Confidence Interval (CI)

A

An interval estimate for an unknown population parameter. This depends on:

1, the desired confidence level,
2. information that is known about the distribution (for example, known standard deviation),
3. the sample and its size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Confidence Level (CL)

A

The percent expression for the probability that the confidence interval contains the true population parameter; for example, if the CL = 90%, then in 90 out of 100 samples the interval estimate will enclose the true population parameter.

Confidence Level = 1- Alpha(α)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Degrees of Freedom (df)

A

The number of objects in a sample that are free to vary. Usually, df = n -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Error Bound for a Population Mean (EBM) / Margin of Error

A

The margin of error; depends on the confidence level, sample size, and known or estimated population standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

Inferential Statistics

A

Also called statistical inference or inductive statistics; this facet of statistics deals with estimating a population parameter based on a sample statistic. For example, if four out of the 100 calculators sampled are defective we might infer that four percent of the production is defective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

Point Estimate

A

A single number computed from a sample and used to estimate a population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

Student’s t-Distribution

A

Investigated and reported by William S. Gossett in 1908 and published under the pseudonym Student; the major characteristics of the random variable (RV) are:

  1. It is continuous and assumes any real values.
  2. The pdf is symmetrical about its mean of zero. However, it is more spread out and flatter at the apex than the normal distribution.
  3. It approaches the standard normal distribution as n get larger. (n>=30 closely follows normal distribution)
  4. There is a “family” of t–distributions: each representative of the family is completely defined by the number of degrees of freedom, which is one less than the number of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

Hypothesis

A

A statement about the value of a population parameter, in case of two hypotheses, the statement assumed to be true is called the null hypothesis (notation H0) and the contradictory statement is called the alternative hypothesis (notation Ha).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

Hypothesis Testing

A

Statistical analysis that uses sample data to assess two mutually exclusive theories about the properties of a population.

Based on sample evidence, a procedure for determining whether the hypothesis stated is a reasonable statement and should not be rejected or is unreasonable and should be rejected.

  1. Understand Sampling Distribution
  2. Understand test statistic given sampling distribution
  3. Run Test and receive p-value
  4. Evaluate statistical significance and decision

Work by taking the observed test statistic from a sample and using the sampling distribution to calculate the probability of obtaining that test statistic if the null hypothesis is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

Level of Significance of the Test

A

Probability of a Type I error (reject the null hypothesis when it is true). Notation: α. In hypothesis testing, the Level of Significance is called the preconceived α or the preset α.

Q: Is this true? See Hypothesis Testing by Jim Frost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

p-value

A

The probability that an event will happen purely by chance assuming the null hypothesis is true. The smaller the p-value, the stronger the evidence is against the null hypothesis.

Probability of observing a sample statistic that is at least as extreme as our sample statistic when we assume that the null hypothesis is correct

Indicates the strength of the sample evidence against the null hypothesis. Probability we would obtain the observed effect, or larger, if the null hypothesis is correct.

If the p-value is less than or equal to the significance level, you reject the null hypothesis, in favor of your alternative hypothesis, and our results are statistically significant.

  • When the p-value is low, the null must go
  • If the p-value is high, the null will fly

Whenever we see a p-value, we know we are looking at the results of a hypothesis test.

Determines statistical significance for a hypothesis test

P-values are NOT an error rate!

The chance of getting a test statistic assuming the null hypothesis is true. Not the chance of the null hypothesis being correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

Type I Error

A

Alpha (α)

The decision is to reject the null hypothesis when, in fact, the null hypothesis is true.

Ex. Saying there is a change when there is not

Ex. Fire alarm going off when there is no fire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

Type II Error

A

Beta (β)

The decision is not to reject the null hypothesis when, in fact, the null hypothesis is false.

Ex. Saying there is not a change when there is.

Ex. Fire alarm does NOT ring when there is a fire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

Descriptive Statistics

A

Describes/Summarizes a dataset for a particular group of objects, observations, or people. No attempt to generalize beyond the set of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

Statistics

A

Science concerned with collecting, organizing, analyzing, interpreting, and presenting data. It equips us with the tools and methods to extract meaningful insights from information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

Continuous Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

Discrete Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

Interval Scale Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

Ratio Scale Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

Binary Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

Ordinal Data

A

TODO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

Multimodal Distribution

A

A distribution that has multiple peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

4 Categories of Descriptive/Summary Statistics

A
  1. Central Tendency
  2. Spread/Dispersion
  3. Correlation/Dependency
  4. Shape of Distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

Second Quartile

A

aka the Median of the data. Splits the entire ordered dataset into 2 equal parts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

Third Quartile

A

The value that is the median of the of the upper half of the ordered data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

Central Tendency

A

Measure of the center point or typical value of a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

Variability

A

Measure of the amount of dispersion in a dataset. How spread out the values in the dataset are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

Range

A

The difference between the MAX and MIN value in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

Spurious Correlation

A

Situation where two variables appear to have an association, but in reality, a third factor (confounding/lurking variable) causes that association.

130
Q

Probability Distribution

A

Mathematical function that describes the probabilities for all possible outcomes of a random variable

131
Q

Correlation

A

Indicates how much one variable changes (dependent variable) in value in response to change in another variable (independent variable). Value between 0 and 1.

Strength of association between 2 variables.

Strength:
- 1 = perfect relationship
- 0.8 = strong
- 0.6 = moderate
- 0.4 = weak
- 0 = no relationship

Direction:
- Positive = upward slope
- Negative = downward slope

132
Q

Discrete Probability Distribution / Probability Mass Functions

A

Probability distributions for discrete data (assume a set of distinct values)

133
Q

Continuous Probability Distribution / Probability Density Functions

A

Probability distribution for continuous data (assumes infinite number of values between any two values)

134
Q

Negative Binomial Distribution

A

Discrete probability distribution used to calculate the number of trials that are required to observe an event a specific number of times. In other words, given a known probability of an event occurring and the number of events that we specify, this distribution calculates the probability for observing that number of events within N trials.

135
Q

Empirical Rule

A

Can be used to determine the proportion of values that fall within a specific number of standard deviations from the mean for data that is normally distributed.

STDDEV | PERCENTAGE OF DATA
1 | 68%
2 | 95%
3 | 99.7%

136
Q

Standardization

A

Data transformation that creates a standard normal distribution (Z-distribution) with a mean of 0 and standard deviation of 1. Allows for easier comparison between features with different scales.

137
Q

Normalization

A

Data transformation that rescales the data to a specific range, often between 0 and 1 or -1 and 1 (min-max normalization). Ensures that all features are on a similar scale, preventing features with larger ranges from dominating the model during training.

138
Q

Tools for Inferential statistics:

A
  1. Hypothesis Tests
  2. Confidence Intervals
  3. Regression Analysis
139
Q

Precision

A

TODO

140
Q

Random Sampling Methodologies

A
  1. Simple Random Sampling
  2. Stratified Sampling
  3. Cluster Sampling
141
Q

Simple Random Sampling

A

TODO

142
Q

Stratified Sampling

A

TODO

143
Q

Cluster Sampling

A

TODO

144
Q

Independent Variable

A

Variables that we include in the experiment to explain or predict changes in the dependent/response variable.

145
Q

Dependent Variable / Response Variable

A

Variable in the experiment that we want to explain or predict. Values of this variable are dependent on other variables

146
Q

Causality

A

One event (cause) brings about another event (affect)

147
Q

Parametric Statistics

A

Branch of statistics that assumes sample data come from populations that are adequately represented in modeled by probability distributions with a set of parameters.

148
Q

Nonparametric Statistics

A

Branch of statistics that does not assume sample data comes from populations that are adequately represented in modeled by probability distributions with a set of parameters.

149
Q

Null Hypothesis

A

H0; One of the two mutually exclusive theories about the population’s properties. Typically states that there is no effect.

150
Q

Alternative Hypothesis

A

HA; The other of the two mutually exclusive theories about the population’s properties. Typically states that the population parameter does not equal the null hypothesis value. In other words, there is a non-zero effect.

151
Q

Effect / Population Effect / Difference

A

The difference between the population value and the null hypothesis value.

Represents the ‘signal’ in the data

152
Q

Significance Level (Alpha, α)

A

Defines how strong the sample evidence must be to conclude an effect exists in the population. Evidentiary standard set before study begins. Specifies how strongly the sample evidence must contradict the null hypothesis before we can reject the null hypothesis. Standard defined by the probability of rejecting a true null hypothesis. In other words, the probability that we say there is an effect when there is no effect.

Evidentiary standard set to determine whether our sample is strong enough to reject the null hypothesis.

Confidence Level = 1 - Alpha(α)

153
Q

Critical Region

A

Defines sample values on the sampling distribution that are improbable enough to warrant rejecting the null hypothesis and therefore represent statistically significant results.

154
Q

Student’s t-Distribution

A

TODO

155
Q

t-test

A

Type of hypothesis test for the mean and uses the Student’s t-distribution sampling distribution and t-value to determine statistical significance

156
Q

t-statistic / t-value

A

Test statistic for t-test hypothesis test. Used alongside Student’s t-Distribution

157
Q

1-Sample t-Test

A

Hypothesis test where we examine a single population and compare it to a base, hypothesized value.

Null: The population mean equals the hypothesized mean
Alternative: The population mean does not equal the hypothesized mean

Assumptions:
1. Representative, random sample
2. Continuous Data
3. Sample data is normally distributed or Sample Size >= 20

158
Q

Two-Tailed 1-Sample t-Test

A

aka nondirectional or two-sided test because we are testing for effects in both directions. When we perform a two-tailed test, we split the significance level percentage between both tails of the distribution.

Null: The effect equals zero
Alternative: The effect does not equal zero (nothing about direction)

We can detect both positive and negative results. Standard in scientific research where discovering ANY type of effect is usually of interest to researchers.

Default choice

159
Q

One-Tailed 1-Sample t-Test

A

aka directional or one-sided test because we can test for effects in ONLY one direction. When we perform a one-tailed test, the entire significance level percentage goes into one tail of the distribution.

Null: The effect is less than or equal to 0
Alternative: the effect is greater than zero

One-tailed tests have more statistical power to detect an effect in one direction than a two-tailed test with the same design and significance level.

One-tailed tests occur most frequently for studies where one of the following is true:
1. Effects can exist in only one direction
2. Effects can exist in both directions but we only care about an effect in one direction

160
Q

2-Sample t-Test

A

Hypothesis test where we examine two populations and compare them to each other.

Null: The means for the two populations are equal
Alternative: The means for the two populations are not equal

Assumptions:
1. Representative, random sample
2. Continuous Data
3. Sample data is normally distributed or each groups size >= 15
4. Groups are independent
5a. Groups have equal variances
5b. Groups have unequal variances –> Use Welch’s t-test

161
Q

Paired t-Test

A

Hypothesis test where we assess dependent samples, which are two measurements on the same person or item.

Paired t-Test is a 1-Sample t-test where our hypothesized value 0 effect for the difference between the pre- and the post-observations.

Null: The difference between pre- and post-observations are equal
Alternative: The difference between pre- and post-observations are not equal

Assumptions:
1. Representative, random sample
2. Independent observations
3. Continuous data
4. Data is normally distributed or sample size >= 20

162
Q

Why not ACCEPT the null hypothesis?

A

If our test fails to detect an effect, that’s not proof it doesn’t exist. It just means our sample contains an insufficient amount of evidence to conclude that an effect exists.

Lack of proof doesn’t represent proof that something doesn’t exist

163
Q

Test Statistic

A

A value that hypothesis tests calculate from our sample data. This value boils our data down into a single number, which measures the overall DIFFERENCE between our sample data and our null hypothesis.

Represents the signal-to-noise ratio for a particular sample = signal/noise

Measures the difference between the data and what is expected by the null hypothesis

Ex. (hypothesis test, test statistics) = (t-test, t-value)

164
Q

Power

A

Hypothesis Test’s ability to detect an effect that actually exists. Test correctly rejects the a false null hypothesis.

80% is the standard benchmark for studies

Power = 1 - Beta(β) = Opposite of Type II errors
Beta(β) = Type II Error

Factors that affect power:
1. Sample Size - Larger = higher power
2. Variability - Lower = higher power
3. Effect Size - Larger = higher power

165
Q

Chi-Square Distribution

A

TODO

166
Q

Goodness-of-Fit Test

A

To assess whether a data set fits a specific distribution, we can apply the goodness-of-fit hypothesis test that uses the chi-square distribution.

Null hypothesis states that the data comes from the assumed distribution.

Test compares observed values against the values you would expect to have if the data followed the assumed distribution.

Test is right-tailed. Each observation or cell category must have an expected value of at least 5.

167
Q

Test of Independence

A

To assess whether two factors are independent, we can apply the test of independence that uses the chi-square distribution.

Null hypothesis states that the two factors are independent.

Test compares observed values to expected values. Test is right-tailed. Each observation or cell category must have an expected value of at least 5.

168
Q

Test of Homogeneity

A

To assess whether two datasets are derived from the same distribution, which can be unknown, we can apply the test of homogeneity that uses the chi-square distribution.

Null hypothesis states that the populations of the two data sets come from the same distribution.

Test compares the observed values against the expected values if the two populations followed the same distribution.

Test is right-tailed. Each observation or cell category must have an expected value of at least 5.

169
Q

Test of Single Variance

A

TODO

170
Q

Z-test

A

Hypothesis test used if you know the population standard deviation (companion of the t-test which is used when you only have an estimate of the population standard deviation)

Argument by contradiction, designed to show that the null hypothesis will lead to an absurd conclusion and must therefore be rejected.

171
Q

Statistical Significance

A

Indicates that our sample provides enough evidence to conclude that the effect exists in the population (via p-value).

Factors that influence statistical significance:
1. Effect Size
2. Sample Size
3. Variability

172
Q

Practical Significance

A

Asking whether a statistically significant effect is substantial enough to be meaningful.

173
Q

Power Analysis

A

Managing the tradeoffs between effect size, sample size, and variability to settle on a sample size needed for a hypothesis test.

Our goal is to collect a large enough sample to have sufficient power to detect a meaningful effect but not too large to be wasteful.

174
Q

Linear Equation

A

y = mx+b

y = dependent variable
m = slope
x = independent variable
b = y-intercept

175
Q

Slope (Linear Equation)

A

Tells us how much the dependent variable (y) changes for every one unit increase in the independent variable (x), on average.

176
Q

y-Intercept (Linear Equation)

A

Used to describe the dependent variable when independent variable equals zero.

177
Q

Residual

A

TODO

178
Q

Least-Squares Regression

A

TODO

179
Q

Sum of Squared Errors

A

TODO

180
Q

Correlation Coefficient, r

A

Measures the strength of the linear relationship between x and y.

Value between -1 and 1

181
Q

Coefficient of Determination, r^2

A

Equal to the square of the correlation coefficient, r. When expressed as a percent, r^2 represents the percent of variation in the dependent variable, y, that can be explained by variation in the independent variable x using the regression line.

182
Q

Linear Regression Assumptions

A
  1. Linear
  2. Independent
  3. Normal
  4. Equal Variance
  5. Random

TODO - expand

183
Q

ANOVA (Analysis of Variance)

A

Method of testing whether of note the means of three or more populations are equal

TODO - expand

184
Q

One-Way ANOVA

A

Hypothesis test that allows us to compare more group means (3 or more).

Requires one categorical factor for the independent variable and a continuous variable for the dependent variable. The values of the categorical factor divide the continuous data into groups. The test determines whether the mean differences between these groups is statistically significant.

Null: All group means are equal (F-value = 1)
Alternative: Not all group means are equal (F-value != 1)

Assumptions:
1. Random Samples
2. Independent Groups
3. Dependent variable is continuous
4. Independent variable is categorical
5. Sample data is normally distributed or each group has 15 or more observations
6. Group should have roughly equal variances or use Welch’s ANOVA

185
Q

Two-Way ANOVA

A

Used to assess differences between group means that are defined by two categorical factors.

More closely resembles Linear Regression. (Just use linear regression)

Assumptions:
1. Dependent variable is continuous
2. The 2 independent variables are categorical
3. Random residuals with constant variance

186
Q

F-Distribution

A

Distributed used in an F-test.

TODO

187
Q

F-statistic / F-value / F-ratio

A

Test statistic used in an F-test alongside an F-Distribution.

Ratio of two variances, or technically, two mean squares.

188
Q

Test of Two Variances

A

TODO

189
Q

F-test

A

Type of hypothesis test for the mean and uses the F-distribution sampling distribution and F-statistic to determine statistical significance

190
Q

Post Hoc Tests with ANOVA

A

Used to explore difference between multiple group means while controlling the experiment-wise error rate as ANOVA will only tell us IF there exist a difference between ANY of the group means.

Types:
- Tukey’s Method
- Dunnett’s Method
- Hsu’s MCB

191
Q

The statistical lesson: the treatment and control groups should be as _______ as possible, except for the _________.

A

similar, treatment

192
Q

The treatment and control groups must be drawn from the same ________ population in order to remain _______.

A

eligible, unbiased

193
Q

Historical Controls

A

Patients that were part of a control group in the past. These observations are not contemporaneous with the current treatment group

194
Q

Method of Comparison

A

Fundamental statistical method where we can understand the effect of a treatment by establishing a CONTROL and TREATMENT group where the “only” difference is the treatment. Any difference between then between these two groups is associated with/caused by with the treatment.

Caution: Confounding variables

195
Q

Controlled Experiment

A

There is a test and control group in the experiment, but the members of the groups are not randomly assigned from an eligible group.

196
Q

Randomized Controlled Experiment

A

Experiment where members of control and treatment group are randomly drawn from an “eligible” group

197
Q

Contemporaneous Controls

A

Control groups that are drawn at the same time as the treatment group. Opposite of “Historical Controls”.

198
Q

Observational Study

A

A study where researchers do not assign experimental units into the treatment and control groups but rather observe experimental units that assign themselves to the different groups.

Some subjects have the condition whose effects are being studied; this is the treatment group. The other subjects are the controls.

Can only establish association.

199
Q

Association

A

A formal relationship between two variables measured by correlation.

Circumstantial evidence for causation.

200
Q

Difference between Association and Causation?

A

Association = two variables (X, Y) have a correlation. When one variable moves, another moves predictably. Unknown if variable X causes the move in Y or not, we just know Y responses predictably.

Causation = same as association but we have performed a randomized-controlled experiment to confirm causation. X causes the move in Y.

201
Q

Simpson’s Paradox

A

A statistical phenomenon where a trend appears in one set of data but disappears or reverses when the data is aggregated. This can occur when there is a confounding variable that distorts the relationship between the variables of interest.

Ex. Admissions rates between men and women at UC Berkeley

202
Q

In a histogram, the _____ of the blocks represents percentages

A

Area

203
Q

Distribution Table

A

A table that shows the percentage of data found in each class interval

204
Q

To figure out the height of a block, in a histogram, over a class interval, divide the percentage the block represents by the ______ of the class interval

A

Length

205
Q

Density Scale

A

Y-axis on a histogram representing “percent per” horizontal unit

Ex. Percent per Thousand Dollars

206
Q

In a histogram, the height of a block represents ________ - percentage per horizontal unit

A

Crowding

207
Q

With the density scale on the vertical axis, the areas of the blocks come out as a _______. The area under the histogram over an interval equals the percentage of cases in that ________. To total area under the histogram = ___%

A

Percent
Interval
100

208
Q

Root Mean Square (rms)

A

Measures how big the entries of a list are, neglecting signs

rms size of a list = sqrt(avg(entries^2))

Name in reverse to perform rms:
1. SQUARE Entires
2. Take their MEAN
3. ROOT the average

209
Q

Why do statisticians use the rms instead of the average when trying to understand the average size of values in a list?

A

Average is not good if the list entries are both positive AND negative because the positives and the negatives can cancel each other out leaving the average value around 0.

rms gets around this problem by squaring all entries first

210
Q

The Standard Deviation (SD) is the ____ size of the deviations from the _______.

A

rms
average

211
Q

Cross-Sectional Study

A

Study where subjects are compared to each other at one point in time

212
Q

Longitudinal Study

A

Study where subjects are compared to themselves at different points in time

213
Q

What is the difference between a “Cross-Sectional” and “Longitudinal” Study?

A

Cross-Sectional compares many subjects at the same time and longitudinal looks at the same subjects but over many periods of time.

214
Q

When your data is skewed, it is better to use the ______ instead of the _______ as a measure of central tendency for the data.

A

Median
Average

215
Q

__________ can be used to summarize data that does not follow the normal distribution.

A

Percentiles

216
Q

If you ADD the same number, n, to every entry in a list, the average increases by ________

A

n

217
Q

If you ADD the same number, n, to every entry in a list, the standard deviation increases by ________

A
  1. Does not change.
218
Q

If you MULTIPLY the same number, n, to every entry in a list, the average increases by ________

A

AVG*n

219
Q

If you MULTIPLY the same number, n, to every entry in a list, the standard deviation increases by ________

A

SD*n

220
Q

Change of Scale

A

When you change from one unit to another by performing a constant operation on all entries in a list

221
Q

In an ideal world, if the same thing is measured several times, the same result would be obtained each time. In practice, there are differences, and each result is thrown off by _____________, and the error changes from measurement to measurement.

A

Chance Error

222
Q

The SD of a series of repeated measurements estimates the likely size of the _______________ in a single measurement

A

Chance Error

223
Q

Chance Error

A

Refers to the random fluctuations that occur in data due to sampling variability.

The likely size of this value being the SD of a sequence of repeated measurements made under the same conditions.

For a chance model, the chance error is measured by the standard error

224
Q

Measurement Error

A

Occurs when the value obtained from a measurement differs from the true value of the quantity being measured. This error can arise due to various factors, including chance error and systematic error (or bias)

225
Q

Individual Measurement = exact value + _______ + _______

A

Bias
Chance Error

226
Q

_________ affects all measurements the same way, pushing them in the same direction. _____________ changes from measurement to measurement, sometimes up and sometimes down.

A

Bias
Chance Error

227
Q

Slope

A

Rise/Run

The rate at which y increases with x

228
Q

Intercept

A

The height of the line (y-value) at x=0. The x-value of the line when it intersects with the y-axis.

229
Q

If there is a ______ association between two variables, then knowing one helps in predicting the other. However, when there is a _______ association, information about one variable does not help much in determining the other.

A

Strong
Weak

230
Q

Point of Averages

A

The dot on a scatter plot that represents a value with (x-average, y-average)

231
Q

5 Summary Statistics of Scatter Plot

A
  1. Average of x-values
  2. SD of x-values
  3. Average of y-values
  4. SD of y-values
  5. Correlation (r)
232
Q

Correlation Coefficient

A

r = avg( (x in standard units) * (y in standard units) )

Convert each variable to standard units and then take the average product

Measurement of LINEAR association between two variables.

233
Q

SD Line

A

Line on scatter plot that goes through all (x,y) points that are equal number of SDs away from the average, for both variables

Will pass through point of Averages (x=0 SD, y = 0 SD)

slope = (SD of y)/(SD of x)

234
Q

How do you convert to standard units / standardize?

A

Subtract mean from value and then divide by the standard deviation.

z = ( x - avg ) / SD

235
Q

The correlation coefficient is a pure number, without units. It is not affected by:
1. ___________ the two variables
2. ________ the same number to all the variables of one variable
3. _________ all the values of one variable by a constant positive number

A

Interchanging
Adding
Multiplying

236
Q

The correlation coefficient, r, measures clustering not in absolute terms but in relative terms - relative to the ________

A

Standard Deviations

237
Q

r measures ________ association, not association __________.

A

Linear
In General

238
Q

Regression

A

Technique that models the relationship between a dependent variable and one or more independent variables.

Way of using the correlation coefficient to estimate the average value of y for each value of x

Associated with each increase of one SD in x, there is an increase of r * SDs in y, on average

239
Q

Regression Line

A

Estimates the average value of y for a corresponding value of x

240
Q

Regression Effect

A

In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the second test and the top group will, on average, fall back.

241
Q

Regression Fallacy

A

Thinking that the Regression Effect is due to something important and not just spread around the regression line

242
Q

Error / Residual

A

The distance of the predicted value from the actual value.

Error = actual - predicted

243
Q

rms Error for Regression

A

Tells you how much error you can expect from your predictions vs the actual, on average.

Measures spread around the regression line in absolute terms

244
Q

The units for the rms error are the same as the units for the variable ______________

A

Being predicted

245
Q

Homoscedasticity

A

TODO

246
Q

Heteroscedasticity

A

TODO

247
Q

Least Squares Line

A

Among all lines, the one that makes the smallest rms error in prediction y from x

The regression line that minimizes the rms error the most

aka the regression line

248
Q

Method of Least Squares

A

The process of getting least squares estimates until you generate the least squares line

249
Q

Least Squares Estimates

A

Slope and intercept of a regression line. The ones that minimize the rms the most become the least squares line

250
Q

Frequency Theory

A

TODO

251
Q

Addition Rule

A

To find the chance that AT LEAST ONE OF TWO things will happen, check to see if they are mutually exclusive. If they are, add the chances.

252
Q

Multiplication Rule

A

The chance that two independent events will both happen equals the chance that the first will happen, multiplied by the chance that the second will happen given the first has happened.

253
Q

What is the difference between independent and mutually exclusive?

A

Independent says that the probability of event A is not affected by event B while mutually exclusive is the exact opposite where the event B makes event A chance equal 0.

254
Q

If you want to find the chance that AT LEAST ONE OF TWO events will occur, and the events are NOT mutually exclusive, ___________ add the chances: the sum will be ___________.

A

Do not
Too Big

255
Q

If you are having trouble working out the chance of an event, try to figure out the chance of its ____________; then subtract from 100%

A

Opposite

256
Q

Chance Variability

A

Refers to the random fluctuations that occur in data due to sampling or measurement errors. It’s the inherent uncertainty that exists in any data set, even when collected and analyzed carefully.

257
Q

Box Model

A

A model that can assist in modeling a population

You have a box with a series of tickets in it that you are drawing at random. An operation is then performed on the draws (sum, average, etc.)

258
Q

Chance Process

A

TODO

259
Q

What is the difference between the standard deviation and the standard error of a box model?

A

Standard Deviation = rms of value deviations from the average

Standard Error = difference between the sample average and the expected value (population average)

260
Q

Square Root Law

A

Principle that states that the standard error of a sample proportion, or mean, decreases in proportion to the square root of the sample size.

In simpler terms, this means that as you increase the sample size, the accuracy of your estimate (whether it’s a proportion or a mean) improves, but not at a linear rate. Instead, the improvement gets smaller and smaller as the sample size gets larger.

Part of a formula used to compute the SE for a sum of draws made at random with replacement (independent) from a box.

The square root law is the mathematical explanation for the law of averages.

261
Q

When should you use a z-Test vs a t-Test?

A

TODO

262
Q

Box Model: Expected Value for the Sum

A

= number of draws * avg of box

263
Q

Box Model: Sum

A

= expected value + chance error

In general, the sum is likely to be around its expected value, give or take the chance error (measured with standard error)

264
Q

Box Model: Standard Error (SE) for the Sum

A

How big is the chance error around the expected value of the sum likely to be?

= sqrt(num draws) * SD_of_box

Each draw adds some extra variability to the sum, because you don’t know how it is going to turn out. As the number of draws goes up, the sum gets harder to predict, the chance error (absolute) gets bigger, and so does the standard error

Here we can see the square root law in effect

265
Q

For a box model, the sum of draws is likely to be around the ___________, give or take a __________ or so.

A

Expected value for the sum
Standard Error for the sum

266
Q

Observed Values

A

Values that are returned by the chance process/ box model

= expected value + chance error

267
Q

Box Model Attributes

A

Number
Standard Deviation (SD)
Expected Value (of sum) (EV)
Standard Error (of sum) (SE)
-

268
Q

What is the difference between standardization and normalization?

A

TODO

269
Q

Box Model: Counting

A

Instead of the chance process producing a value between A and B, it will produce either a 0 or a 1.

Our same formulas from before can be used.

270
Q

Provided the number of draws is sufficiently large the _____________can be used to figure chances for the sum of draws.

How large is sufficient?

A

normal curve

30 observations

271
Q

Probability Histogram

A

A new king of graph that represents chance rather than data. Each rectangle represents the chance of a given interval occurring. Sums to 100%.

x-Axis = Sum of Box Model
y-Axis = Percent per standard unit

With enough draws, the probability histogram for the sum will be close to the normal curve

272
Q

Empirical

A

Experimentally Observed

273
Q

Converge

A

Gets closer and closer to

274
Q

As the number of samples increases, the probability histogram will get closer and closer to the ________________.

A

Normal Curve

275
Q

Selection/Sample Bias

A

The systematic tendency on the part of the sampling procedure to execute one kind of person or another from the sample.

Taking a larger sample does not help this problem, it just repeats the basic mistake on a larger scale

276
Q

Non-Response Bias

A

Bias for members of the sample to be people that respond rather than all members selected to be in the sample initially

Non-respondents can be very different from respondents.

277
Q

Quota Sampling

A

Sampling members of a population until the desired sample size is met

278
Q

With a simple random sample, the expected value for the sample percentage equals the ____________________. However, the sample percentage will be off by a chance error.

A

Population Percentage

279
Q

Standard Error (SE) for Sample Percentage

A

= (SE for # / Size of sample)*100%

First get the SE for the corresponding responding number; then convert to percentage, relative to the size of the sample

280
Q

Multiplying the size of the sample by some factor divides the SE for a percentage not by the whole factor - but by its ________________.

A

Square Root

281
Q

The SE for the sample number goes _____ like the ______________ of the sample size

A

Up
Square Root

282
Q

The SE for the sample percentage goes ______ like the _____________ of the sample size/

A

Down
Square Root

283
Q

When drawing at random from a box of 0’s and 1’s, the percentage of 1’s among the draws is likely to be around ___________, give or take _____________ or so.

A

Expected Value
SE for the Sample Percentage

284
Q

Box Model Classifying & Counting: Standard Deviation

A

SD = sqrt( p * (p-1) )
SD = sqrt(percentage of 1s * percentage of 0s)

285
Q

When estimating percentages, it is the _______________ of the sample which determines accuracy, not the ________________ to the population. This is true only if the sample is a small part of the population (which is the usual case).

A

Absolute Size
Size Relative

286
Q

Chance error is not affected by the _______________ but rather the _______________ of the sample and the _________ of the box.

A

Population Size
Absolute Size of the Sample
Standard Error (via SD)
(20.4)

287
Q

Box Model: Population Percentage Coverage

A

= Sample Percentage +- N SE’s

288
Q

A ____________________ is used when estimating an unknown parameter from sample data. The interval gives a range for the parameter and a confidence level that the range covers the true value.

A

Confidence Interval

289
Q

_______________ are used when you reason forward, from the box to the draws; ________________ are used when reasoning backwards, from draws to the box.

A

Probabilities
Confidence levels

290
Q

A sample percentage will be off the ____________________, due to chance error. The _______ tells you the likely size of the amount. Confidence levels were introduced to make this idea more quantitative.

A

Population Percentage
Standard Error (SE)

291
Q

With a simple random sample, the ___________ percentage is used to estimate the ____________ percentage

A

Sample
Population

292
Q

A __________________ for the population percentage is obtained by going the right number of ______________ either way from the sample percentage. The _______________ is read off the normal curve.

A

Confidence Interval
SEs
Confidence Level

293
Q

When ___________ operates more of less evenly across the sample, it can not be detected just by looking at the data.

A

Bias

294
Q

Box Model: Expected Value of Average

A

= Average of Box

295
Q

Box Model: Standard Error (SE) for the Average

A

= SE for sum / # of draws

Tells you how far off the Expected Value of the Sample Average is from the population average

296
Q

Q: What is the difference between the SD of the sample and the SE for the sample average? (23.2)

A
  • The SD says how far values are from the average - on average
  • The SE says how far sample average are from the population average - on average
297
Q

Q: Why is it OK to use the normal curve in figuring confidence levels? (23.2)

A

Because with sample random sampling, we assume that the chance error is normally distributed around the observed value.

298
Q

Box Model: Expected Value of Count

A

Sum of 1s / # of draws

299
Q

Box Model: Standard Error (SE) of Count

A

SE for sum, from a 0-1 box

300
Q

Box Model: Expected Value of Percentage

A

Sum of 1s / # of draws

301
Q

Box Model: Standard Error (SE) of Percentage

A

= (SE for count / # of draws)*100%

302
Q

With a simple random sample, the __________ of the sample can be used to estimate the SD of the box. A __________ for the average of the box can be found by going to the right the average of the draws. The __________ is read off the normal curve.

A

SD
Confidence Interval
Confidence Level

303
Q

Q: How should an experiment be designed to test the effectiveness of a treatment? (1)

A

A randomized-controlled double/triple blind experiment

304
Q

Q: How do we measure the “center” of the data?

A

Average

305
Q

Q: How do we measure the “spread” of a dataset?

A

Standard Deviation

= rms of deviations from average

306
Q

Q: How big is the chance error likely to be? (6)

A

We can estimate it using the SD of a sequence of repeated measurements made under the same conditions

307
Q

Q: How much chance error is likely to cancel out in the average? (24)

A

More precise than any individual estimate, by a factor equal to the square root of the number of measurements.

308
Q

Q: Why is the correlation coefficient a pure number?

A

Because the first step to calculating it is to convert to standard units

309
Q

Q: How do we determine if the results of an experiment are true or just because of chance?

A

We can run tests of significance to understand how likely our results are given the assumption of “no effect” (null hypothesis)

310
Q

The __________ corresponds to the idea that an observed difference is due to chance. To make a test of significance, the null hypothesis has to be set up as a box model for the data. The __________ is another statement about the box, corresponding to the idea that the observed difference is real.

A

Null Hypothesis
Alternative Hypothesis

311
Q

Observed Significance Level (P-value)

A

Chance of getting a test statistic as extreme as, or more extreme than, the observed one. The chance is computed on the basis that the null hypothesis is true. The small this chance is, the stronger the evidence against the null.

312
Q

Test of Significance

A

Procedure used to determine whether an observed effect or difference is statistically significant or due to chance. It helps us decide if our findings are meaningful or if they could have occurred by chance.

Steps:
1. Setup Null & Alternative Hypothesis
2. Pick a test statistic, to measure the difference between the observed data and what is expected by the null hypothesis.
3. Compute the observed significance level P

313
Q

Q: How do we find the expected value and the standard error for the difference between two sample averages?

A
  • ## Compute EV and SE for each average independently
314
Q

Standard Error for the difference between two independent quantities

A

= sqrt(a^2 + b^2)
a = SE for average_A
b = SE for average_B

315
Q

The two-sample z-statistic is computed from:

A
  1. the sizes of the two samples
  2. the averages of the two samples
  3. the SDs of the two samples
    Assumes two independent random samples
316
Q

Expected value for the difference between two quantities

A

= EV_A - EV_B

317
Q

Chi-Square Test (X^2-test)

A

Statistical hypothesis test used to determine if there is a significant difference between observed and expected frequencies in one or more categories. It’s commonly used to analyze categorical data.

a) Categorical Data
b) Chance model
c) Frequency Table
d) X^2 Statistic
e) Degrees of Freedom
f) Observed Significance Level (P)

318
Q

X^2-Statistic

A

sum of ( (observed freq. - expected freq)^2 / expected freq. )

319
Q

The __________ test says whether the data are like the result of drawing at random from a box whose CONTENTS are given

A

X^2

320
Q

The __________ says whether the data are like the result of drawing at random for a box whose AVERAGE is given.

A

z-test

321
Q

The P-value of a test depends on the _________. With a __________ sample, even a small difference can be “statistically significant”. Conversely, an important difference may be statistically insignificant if the sample is too __________.

A

Sample Size
Large
Small