Exam 1 Terms Flashcards
What is a case?
An individual unit that is often a person, place, or thing. A row of data usually represents a case.
What are variables?
Variables are a characteristic or measurement that describes the cases. Typically, a column of data represents a variable. Examples are height, weight, age, temperature, time, etc.
What are the two main types of variables?
categorical and quantitative
What is a categorical variable?
A variable comprised of 2 or more categories. (ex. gender)
What is a quantitative variable?
A variable that measures a numerical quantity. (ex. GPA, pulse rate, height)
What are the two subsets of quantitative variables?
Continuous and discrete
What is a continuous variable?
A type of quantitative variable that can take on an infinite set of values within some range. (ex. temperature, life expectancy, food calories)
What is a discrete variable?
A type of quantitative variable that has a finite set of possible values. (ex. number of babies born in a pregnancy, number of courses you are taking next semester)
What is a population?
The entire set of cases.
What is a sample?
A subset of the population. We collect data for the sample.
What is a parameter?
Describes the population. (ex. population mean, GPA for an entire class)
What is a statistic?
Describes the sample. (ex. sample mean, GPA for a selection of students in a class)
What is statistical inference?
The process of using data from a sample to gain information about the population.
Why should we take random samples?
A random sample should be selected from a population, otherwise it may be prone to bias. The goal is to obtain a sample that is representative of the population.
What is a representative sample?
A subset of the population from which data are collected that accurately reflects the population.
What is bias?
The systematic favoring of certain outcomes.
What is sampling bias?
Systematic favoring of certain outcomes due to the methods employed to obtain the sample.
What is simple random sampling? Why do we do it?
A method of obtaining a sample where every member of the population has an equal chance of being selected (similar to drawing names from a hat). Samples are selected without replacement.
SRS is done to avoid sampling bias and to obtain a sample that’s representative of a population.
ex. if we wanted to research how long PSU students sleep at night, it would be best to randomly select students for the sample rather than only surveying students in an 8 AM class.
What is a convenience sample?
A method of obtaining a sample by ease of accessibility. These samples are NOT random and they may NOT represent the intended population.
Besides convenience sampling, what are other sources of bias?
- non-response bias
- response bias
What is non-response bias?
Individuals who do not participate in a study differ from those who do participate.
- inability to contact individual
- individual chose not to participate
What is response bias?
Individuals participate, but do not respond truthfully.
- may do so to align with social norms
- may do so to appease the researcher
What is a confounding variable?
A third variable that may explain the association between two other variables.
Ex. when ice cream sales increase, so do shark attacks. This is is association only, not causation. Temperature is a confounding variable here because as it increases, so do ice cream sales/going to the beach
What are the two main types of studies?
observational and experimental
What is an observational study?
Researchers simply observe the data as they occur. We cannot say that there is a cause and effect based on this type of study because there can be confounding variables.
These studies almost always have confounding variables.
Observational studies can almost never be used to establish causation.
ex. Question: Does coffee cause hyperactivity in college students?
A researcher randomly samples students and surveys them about their coffee intake and hyperactivity
What is an experimental study?
Researchers actively control one or more of the variables of interest. These studies can be used to prove cause and effect by manipulating the parameters of a study.
Ex. Question: Does coffee cause hyperactivity in college students?
A researcher randomly samples students and randomly assigns them to drink coffee with or without caffeine.
How can confounding variables be avoided?
By using a randomized experiment.
What is a randomized experiment?
When the treatment for each case is randomly assigned.
What are the two types of randomized experiments?
Comparative experiments and matched pair experiments
What is a comparative experiment?
Cases are randomly assigned to different treatment groups
What is a matched pair experiment?
Each case gets BOTH treatments
What is a control group?
A group of cases that do not receive treatment; serve as a comparison group
What is a placebo?
A fake treatment; used to control placebo effect
What is a single-blind study?
When participants do not know to which group they belong
What is a double-blind study?
When participants and researchers interacting with the participants BOTH do not know which participants were assigned to which group.
How can we summarize one categorical variable?
- can use a frequency table
- can take a proportion (relative frequency)
- can make a relative frequency table (does not include counts)
- bar chart
- pie chart
What is a proportion?
A relative frequency
Proportion = count for category of interest/ total counts in sample
How can we summarize two categorical variables?
- use a two way table
- use a segmented (stacked) bar chart
- use a side-by-side bar chart
How can we summarize 1 quantitative variable?
Can use a …
- dotplot
- histogram
When are histograms ideal?
This is the ideal graph when there are 30 or more cases.
What shapes can histograms be?
- bell shaped/symmetric
- left-skewed
- right-skewed
What is the mean?
The mean, or average, is the sum of data values/ number of values.
What is the median?
The middle value when the data are ordered.
Describe the mean and median when the data is symmetric.
mean roughly equals median
Describe the mean and median when the data is right skewed.
Mean > median
right tail pulls data in that direction
Describe the mean and median when the data is skewed to the left.
Mean < median
When is the mean meaningless?
When the data is skewed in a certain direction.
What is an outlier?
A data point that is notably distant from the other values in a data set.
What is resistance?
A statistic is resistant if it is relatively unaffected by extreme values such as outliers.
Is the median resistant to outliers?
yes
Is the mean resistant to outliers?
no
What is standard deviation?
A measure of how spread out the data are.
Notated by “s.”
What does a larger standard deviation mean?
The larger the standard deviation, the more variability there is, and the more spread out the data are.
Is standard deviation resistant to outliers?
No, because it uses the mean in its calculation.
What is the 95% rule?
For a bell shaped distribution, about 95% of the data falls within two standard deviations of the mean. (i.e. are between x bar - 2s and x bar + 2s)
What is a z-score?
The number of standard deviations a value is from the mean. A higher magnitude z-score means the particular data point is more unlike the mean.
How can we estimate standard deviation by looking at a histogram?
Pick two broad values, subtract them and divide by 4.
What is a percentile?
The percentile is the value that is greater than p% of the data.
Ex. if your height is the 40th percentile, 40% of people are shorter than you
What does the five number summary include?
minimum, Q1, median, Q3, maximum
What is Q1 (first quartile)
Median of values below the median (25th percentile)
What is Q3 (third quartile)
Median of values above the median (75th percentile)
What is the range?
Maximum - minimum
Is the range resistant to outliers?
No, because the range could be calculated WITH outliers.
What is IQR?
Interquartile Range
Q3 - Q1
Is IQR resistant to outliers?
Yes, because it is NOT calculated with outliers. The IQR only captures the middle 50% of data.
When is the five number summary preferred?
Preferred for skewed distributions (rather than the mean and standard deviation)
What do boxplots display?
Boxplots are used for one quantitative variable and they display the five number summary.
How do we represent data with both quantitative AND categorical variables?
- side-by-side histogram
- side-by-side dotplot
- side-by-side boxplot