Statistics Flashcards
- OpenStax Introductory Statistics - Introduction to Statistics 4E, Freedman
Average
A number that describes the central tendency of the data
average = sum of entries / number of entries
Blinding
Not telling participants which treatment a subject is receiving
Categorical Variable
Variables that take on values that are names or labels
Cluster Sampling
A method for selecting a random sample and dividing the population into groups (clusters); use simple random sampling to select a set of clusters. Every individual in the chosen clusters is included in the sample.
Continuous Random Variable
A random variable (RV) whose outcomes are measured; the height of trees in the forest is a continuous RV.
Control Group
A group in a randomized experiment that receives an inactive treatment but is otherwise managed exactly as the other groups
Convenience Sampling
A nonrandom method of selecting a sample; this method selects individuals that are easily accessible and may result in biased data.
Cumulative Relative Frequency
The term applies to an ordered set of observations from smallest to largest. The cumulative relative frequency is the sum of the relative frequencies for all values that are less than or equal to the given value.
Data
A set of observations (a set of possible outcomes); most data can be put into two groups: qualitative (an attribute whose value is indicated by a label) or quantitative (an attribute whose value is indicated by a number). Quantitative data can be separated into two subgroups: discrete and continuous. Data is discrete if it is the result of counting (such as the number of students of a given ethnic group in a class or the number of books on a shelf). Data is continuous if it is the result of measuring (such as distance traveled or weight of luggage)
Double-blind experiment
An experiment in which both the subjects of an experiment and the researchers who work with the subjects are blinded
Triple-blind experiment
An experiment in which both the subjects of an experiment, researchers who work with the subjects, and analysts who analyze the data are blinded
Experimental Unit
Any individual or object to be measured
Explanatory Variable
The independent variable in an experiment; the value controlled by researchers
Frequency
The number of times a value of the data occurs
Institutional Review Board
A committee tasked with oversight of research programs that involve human subjects
Informed Consent
Any human subject in a research study must be cognizant of any risks or costs associated with the study. The subject has the right to know the nature of the treatments included in the study, their potential risks, and their potential benefits. Consent must be given freely by an informed, fit participant.
Lurking Variable
A variable, not included in experiment, that has an effect on a study even though it is neither an explanatory variable nor a response variable
Confounding Variable
Difference between the treatment and control groups - other than the treatment - which affects the responses being studied. A third variable, associated with both the dependent and response variables.
“The idea is a bit subtle: a gene that causes cancer but is unrelated to smoking is not a confounder and is sideways to the argument”
Gene needs to A) cause cancer AND B) get people to smoke
Sometime controlled for by cross-tabulation
How is a Lurking Variable different from a Confounding Variable?
Lurking = Unknown or unconsidered
Confounding = Known but not controlled for
Nonsampling Error/Systematic Error/Bias
An issue that affects the reliability of sampling data other than natural variation; it includes a variety of human errors including poor study design, biased sampling methods, inaccurate information provided by study participants, data entry errors, and poor analysis.
Numerical Variable
Variables that take on values that are indicated by numbers
Population Parameter
A number that is used to represent a population characteristic and that generally cannot be determined easily
Placebo
An inactive treatment that has no real effect on the explanatory variable
Population
All individuals, objects, or measurements whose properties are being studied
Probability
A number between zero and one, inclusive, that gives the likelihood that a specific event will occur
Proportion
The number of successes divided by the total number in the sample
Qualitative Data
Data that has an attribute whose value is indicated by a label
Quantitative Data
Quantitative (an attribute whose value is indicated by a number) data can be separated into two subgroups: discrete and continuous.
Data is discrete if it is the result of counting (such as the number of students of a given ethnic group in a class or the number of books on a shelf).
Data is continuous if it is the result of measuring (such as distance traveled or weight of luggage).
Random Assignment
The act of organizing experimental units into treatment groups using random methods
Random Sampling
A method of selecting a sample that gives every member of the population an equal chance of being selected.
Relative Frequency
The ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes to the total number of outcomes
Representative Sample
A subset of the population that has the same characteristics as the population
Response Variable
The dependent variable in an experiment; the value that is measured for change at the end of an experiment
Sample
A subset of the population studied
Sampling Error (Chance Variation)
The natural variation that results from selecting a sample to represent a larger population; this variation decreases as the sample size increases, so selecting larger samples reduces sampling error.
Sampling with Replacement
Once a member of the population is selected for inclusion in a sample, that member is returned to the population for the selection of the next individual.
Sampling without Replacement
A member of the population may be chosen for inclusion in a sample only once. If chosen, the member is not returned to the population before the next selection.
Simple Random Sampling
A straightforward method for selecting a random sample; give each member of the population a number. Use a random number generator to select a set of labels. These randomly selected labels identify the members of your sample.
Sample Statistic
A numerical characteristic of the sample; a statistic estimates the corresponding population parameter.
sample estimate = parameter + bias + chance error/sampling error
Stratified Sampling
A method for selecting a random sample used to ensure that subgroups of the population are represented adequately; divide the population into groups (strata). Use simple random sampling to identify a proportionate number of individuals from each stratum.
Systematic Sampling
A method for selecting a random sample; list the members of the population. Use simple random sampling to select a starting point in the population. Let k = (number of individuals in the population)/(number of individuals needed in the sample). Choose every kth individual in the list starting with the one that was randomly selected. If necessary, return to the beginning of the population list to complete your sample.
Treatments
Different values or components of the explanatory variable applied in an experiment
Variable
A characteristic of interest for each person or object in a population.
Box plot
A graph that gives a quick picture of the middle 50% of the data plus the minimum and maximum value.
First Quartile
The value that is the median of the of the lower half of the ordered data set
Frequency Polygon
A graph that shows the frequencies of grouped data. It is a type of frequency diagram that plots the midpoints of the class intervals against the frequencies and then joins up the points with straight lines.
Frequency Table
A data representation in which grouped data is displayed along with the corresponding frequencies
Histogram
A graphical representation in x-y form of the distribution of data in a data set; x represents the data and y represents the frequency, or relative frequency. The graph consists of contiguous rectangles - the area of each representing the percent of the data that block/class represents.
Interquartile Range (IQR)
The range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile.
Interval
Also called a class interval; an interval represents a range of data and is used when displaying large data sets
Mean
The sum of all values in the population divided by the number of values
Median
A number that separates ordered data into halves; half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data.
Midpoint
The mean of an interval in a frequency table
Mode
The value that appears most frequently in a set of data
Outlier
An observation that does not fit the rest of the data.
Standard test: if data is 2 standard deviations or more
Paired Data Set
Two data sets that have a one-to-one relationship so that 1) each data set is the same size and 2) each data point in one data set is matched with exactly one data point in the other data set.
Percentile
A number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively.
FORMULA 1: Percentile given Percentile Rank of Value
P(x) = (n/N)*100, where n = values below x
FORMULA 2: Percentile Rank given percentile
n = (P*N)/100
Quartiles
The numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data.
Skewed
Used to describe data that is not symmetrical; when the right side of a graph looks “chopped off” compared the left side, we say it is “skewed to the left.” When the left side of the graph looks “chopped off” compared to the right side, we say the data is “skewed to the right.” Alternatively: when the lower values of the data are more spread out, we say the data are skewed to the left. When the greater values are more spread out, the data are skewed to the right.
Outliers in the tail pull the mean from the center towards the longer tail.
Standard Deviation
A number that is equal to the square root of the variance and measures how far data values are from their mean; notation: s for sample standard deviation and σ for population standard deviation.
Measures the average deviation of the values from the average
SD = the rms size of the deviations from the average
Measures how far away values are from the mean on average
Deviation = Value - Average
SD = sqrt(avg(deviations^2))
Variance
Mean of the squared deviations from the mean, or the square of the standard deviation; for a set of data, a deviation can be represented as x – x¯ where x is a value of the data and x¯ is the sample mean. The sample variance is equal to the sum of the squares of the deviations divided by the difference of the sample size and one.
AND Event
An outcome is in the event A AND B if the outcome is in both A AND B at the same time.
Complement Event
The complement of event A consists of all outcomes that are NOT in A.
Contingency Table
The method of displaying a frequency distribution as a table with rows and columns to show how two variables may be dependent (contingent) upon each other; the table provides an easy way to calculate conditional probabilities.
Conditional Probability
The likelihood that an event will occur given that another event has already occurred
Dependent Events
If two events are NOT independent, then we say that they are dependent.
Equally Likely
Each outcome of an experiment has the same probability.
Event
A subset of the set of all outcomes of an experiment; the set of all outcomes of an experiment is called a sample space and is usually denoted by S. An event is an arbitrary subset in S. It can contain one outcome, two outcomes, no outcomes (empty subset), the entire sample space, and the like. Standard notations for events are capital letters such as A, B, C, and so on.
Experiment
A planned activity carried out under controlled conditions
Independent Events
The occurrence of one event has no effect on the probability of the occurrence of another event. Events A and B are independent if one of the following is true:
P(A|B) = P(A)
P(B|A) = P(B)
P(A AND B) = P(A)P(B)
Mutually Exclusive
Two events are mutually exclusive if the probability that they both happen at the same time is zero. If events A and B are mutually exclusive, then P(A AND B) = 0.
Or Event
An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B.
Outcome
A particular result of an experiment
Sample Space
The set of all possible outcomes of an experiment
Tree Diagram
The useful visual representation of a sample space and events in the form of a “tree” with branches marked by possible outcomes together with associated probabilities (frequencies, relative frequencies)
Venn Diagram
The visual representation of a sample space and events in the form of circles or ovals showing their intersections
Bernoulli Trials
an experiment with the following characteristics:
- There are only two possible outcomes called “success” and “failure” for each trial.
- The probability p of a success is the same for any trial (so the probability q = 1 − p of a failure is the same for any trial).
Binomial Experiment
A statistical experiment that satisfies the following three conditions:
- There are a fixed number of trials, n.
- There are only two possible outcomes, called “success” and, “failure,” for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial.
- The n trials are independent and are repeated using identical conditions.
Expected Value
Expected arithmetic average when an experiment is repeated many times.
The average value of values generated by a chance process
Binomial Probability Distribution
A discrete random variable (RV) that arises from Bernoulli trials; there are a fixed number, n, of independent trials. “Independent” means that the result of any trial (for example, trial one) does not affect the results of the following trials, and all trials are conducted under the same conditions. Under these circumstances the binomial RV X is defined as the number of successes in n trials. The notation is: X equiv B(n,p)
Geometric Distribution
A discrete random variable (RV) that arises from the Bernoulli trials; the trials are repeated until the first success. The geometric variable X is defined as the number of trials until the first success.
Geometric Experiment
A statistical experiment with the following properties:
- There are one or more Bernoulli trials with all failures except the last one, which is a success.
- In theory, the number of trials could go on forever. There must be at least one trial.
- The probability, p, of a success and the probability, q, of a failure do not change from trial to trial.
Hypergeometric Experiment
A statistical experiment with the following properties:
- You take samples from two groups.
- You are concerned with a group of interest, called the first group.
- You sample without replacement from the combined groups.
- Each pick is not independent, since sampling is without replacement.
- You are not dealing with Bernoulli Trials.
Hypergeometric Probability
A discrete random variable (RV) that is characterized by:
- A fixed number of trials.
- The probability of success is not the same from trial to trial.
We sample from two groups of items when we are interested in only one group. X is defined as the number of successes out of the total number of items chosen.
Probability Distribution Function (PDF)
A mathematical description of a discrete random variable (RV), given either in the form of an equation (formula) or in the form of a table listing all the possible outcomes of an experiment and the probability associated with each outcome.
Poisson Probability Distribution
A discrete random variable (RV) that counts the number of times a certain event will occur in a specific interval; characteristics of the variable:
- The probability that the event occurs in a given interval is the same for all intervals.
- The events occur with a known mean and independently of the time since the last event.
The Poisson distribution is often used to approximate the binomial distribution, when n is “large” and p is “small” (a general rule is that n should be greater than or equal to 20 and p should be less than or equal to 0.05).
Random Variable (RV)
A characteristic of interest in a population being studied; common notation for variables are upper case Latin letters X, Y, Z,…; common notation for a specific value from the domain (set of all possible values of a variable) are lower case Latin letters x, y, and z. For example, if X is the number of children in a family, then x represents a specific integer 0, 1, 2, 3,…. Variables in statistics differ from variables in intermediate algebra in the two following ways.
- The domain of the random variable (RV) is not necessarily a numerical set; the domain may be expressed in words; for example, if X = hair color then the domain is {black, blond, gray, green, orange}.
- We can tell what specific value x the random variable X takes only after performing the experiment.
The Law of Large Numbers
Sample mean approaches population mean as sample size increases.
As the number of trials in a probability experiment increases, the difference between the theoretical probability of an event and the relative frequency probability approaches zero.
Decay parameter
The decay parameter describes the rate at which probabilities decay to zero for increasing values of x.
Exponential Distribution
A continuous random variable (RV) that appears when we are interested in the intervals of time between some random events, for example, the length of time between emergency arrivals at a hospital;
Memoryless property
For an exponential random variable X, the memoryless property is the statement that knowledge of what has occurred in the past has no effect on future probabilities.
Poisson Distribution
If there is a known average of λ events occurring per unit time, and these events are independent of each other, then the number of events X occurring in one unit of time has the Poisson distribution.
Uniform Distribution
A discrete or continuous random variable (RV) that has equally likely outcomes over the domain, a < x < b.
Normal Distribution
aka the Gaussian distribution; Visualized as a symmetrical bell-shaped curve. The most frequent values cluster around the center, and the probability of finding values far away from the center tapers off gradually in both directions.
Standard Normal Distribution / Z-Distribution
A normal distribution with mean 0 and standard deviation of 1
Z-score
Statistical value that describes a specific data point’s relative position within a standard normal distribution (aka Z-Distribution). It tells you how many standard deviations a particular point is away from the mean (average) of the data, expressed in standard deviation units.
Central Limit Theorem (CLT)
The sampling distribution of mean for a variable approximates a normal distribution with increasing sample size.
Given a random variable (RV) with known mean μ, known standard deviation σ and known sample size of n - if n is sufficiently large, then the distribution of the sample means and the distribution of the sample sums will approximate a normal distributions regardless of the shape of the population. The mean of the sample means will equal the population mean, and the mean of the sample sums will equal n times the population mean.
As sample size, n, increases, the standard deviation of the sampling distribution becomes smaller and because the square root of the sample size is in the denominator. In other words, the sampling distribution clusters more tightly around the mean as sample size increases.
Why is it important?
1. Normality Assumption
Allows us to use hypothesis test that rely on data that is normally distributed when we have nonnormally distributed data
- Precision of Estimates
We can make our estimate more precise by increasing or sample size
Sampling Distribution of the Mean
Describes the probability distribution of all possible means you could get if you draw multiple random samples of size n from a population.
Given simple random samples of size n from a given population with a measured characteristic such as mean, proportion, or standard deviation for each sample, the probability distribution of all the measured characteristics is called a sampling distribution.
Standard Error of the Mean
The standard deviation of the distribution of the sample means around the population mean.
Binomial Distribution
A discrete random variable (RV) which arises from Bernoulli trials; there are a fixed number, n, of independent trials. “Independent” means that the result of any trial (for example, trial 1) does not affect the results of the following trials, and all trials are conducted under the same conditions. Under these circumstances the binomial RV X is defined as the number of successes in n trials.
Confidence Interval (CI)
An interval estimate for an unknown population parameter. This depends on:
1, the desired confidence level,
2. information that is known about the distribution (for example, known standard deviation),
3. the sample and its size.
Confidence Level (CL)
The percent expression for the probability that the confidence interval contains the true population parameter; for example, if the CL = 90%, then in 90 out of 100 samples the interval estimate will enclose the true population parameter.
Confidence Level = 1- Alpha(α)
Degrees of Freedom (df)
The number of objects in a sample that are free to vary. Usually, df = n -1
Error Bound for a Population Mean (EBM) / Margin of Error
The margin of error; depends on the confidence level, sample size, and known or estimated population standard deviation.
Inferential Statistics
Also called statistical inference or inductive statistics; this facet of statistics deals with estimating a population parameter based on a sample statistic. For example, if four out of the 100 calculators sampled are defective we might infer that four percent of the production is defective.
Point Estimate
A single number computed from a sample and used to estimate a population parameter
Student’s t-Distribution
Investigated and reported by William S. Gossett in 1908 and published under the pseudonym Student; the major characteristics of the random variable (RV) are:
- It is continuous and assumes any real values.
- The pdf is symmetrical about its mean of zero. However, it is more spread out and flatter at the apex than the normal distribution.
- It approaches the standard normal distribution as n get larger. (n>=30 closely follows normal distribution)
- There is a “family” of t–distributions: each representative of the family is completely defined by the number of degrees of freedom, which is one less than the number of data.
Hypothesis
A statement about the value of a population parameter, in case of two hypotheses, the statement assumed to be true is called the null hypothesis (notation H0) and the contradictory statement is called the alternative hypothesis (notation Ha).
Hypothesis Testing
Statistical analysis that uses sample data to assess two mutually exclusive theories about the properties of a population.
Based on sample evidence, a procedure for determining whether the hypothesis stated is a reasonable statement and should not be rejected or is unreasonable and should be rejected.
- Understand Sampling Distribution
- Understand test statistic given sampling distribution
- Run Test and receive p-value
- Evaluate statistical significance and decision
Work by taking the observed test statistic from a sample and using the sampling distribution to calculate the probability of obtaining that test statistic if the null hypothesis is correct.
Level of Significance of the Test
Probability of a Type I error (reject the null hypothesis when it is true). Notation: α. In hypothesis testing, the Level of Significance is called the preconceived α or the preset α.
Q: Is this true? See Hypothesis Testing by Jim Frost.
p-value
The probability that an event will happen purely by chance assuming the null hypothesis is true. The smaller the p-value, the stronger the evidence is against the null hypothesis.
Probability of observing a sample statistic that is at least as extreme as our sample statistic when we assume that the null hypothesis is correct
Indicates the strength of the sample evidence against the null hypothesis. Probability we would obtain the observed effect, or larger, if the null hypothesis is correct.
If the p-value is less than or equal to the significance level, you reject the null hypothesis, in favor of your alternative hypothesis, and our results are statistically significant.
- When the p-value is low, the null must go
- If the p-value is high, the null will fly
Whenever we see a p-value, we know we are looking at the results of a hypothesis test.
Determines statistical significance for a hypothesis test
P-values are NOT an error rate!
The chance of getting a test statistic assuming the null hypothesis is true. Not the chance of the null hypothesis being correct.
Type I Error
Alpha (α)
The decision is to reject the null hypothesis when, in fact, the null hypothesis is true.
Ex. Saying there is a change when there is not
Ex. Fire alarm going off when there is no fire
Type II Error
Beta (β)
The decision is not to reject the null hypothesis when, in fact, the null hypothesis is false.
Ex. Saying there is not a change when there is.
Ex. Fire alarm does NOT ring when there is a fire
Descriptive Statistics
Describes/Summarizes a dataset for a particular group of objects, observations, or people. No attempt to generalize beyond the set of observations.
Statistics
Science concerned with collecting, organizing, analyzing, interpreting, and presenting data. It equips us with the tools and methods to extract meaningful insights from information.
Continuous Data
TODO
Discrete Data
TODO
Interval Scale Data
TODO
Ratio Scale Data
TODO
Binary Data
TODO
Ordinal Data
TODO
Multimodal Distribution
A distribution that has multiple peaks
4 Categories of Descriptive/Summary Statistics
- Central Tendency
- Spread/Dispersion
- Correlation/Dependency
- Shape of Distribution
Second Quartile
aka the Median of the data. Splits the entire ordered dataset into 2 equal parts
Third Quartile
The value that is the median of the of the upper half of the ordered data set
Central Tendency
Measure of the center point or typical value of a dataset
Variability
Measure of the amount of dispersion in a dataset. How spread out the values in the dataset are.
Range
The difference between the MAX and MIN value in a dataset.