Term Test Flashcards
hierarchical scales
simplify process of developing a statistical study design
1. sampling unit
2. sample
3. observation unit
4. statistical population
5. population of interest
sampling unit
unit being selected at random (can be same as observation unit)
sample
collection of sampling units that you randomly selected
observation unit
scale for data collection
- subject of the study
statistical population
collection of all sampling units that could’ve been in your sample
- is defined by your study design
population of interest
collection of sampling units that you hope to draw a conclusion about
- defined by your research question
- same as statistical population, but often population of interest is larger
Ex. of hierarchy for design (street address)
pop. of interest: all people of voting age in kingston
statistical population: all addresses in kingston
sampling unit: street address
sample: 100 random street adresses
observation unit: a person
measurement variable: voting intent
measurement unit: none cause measurement is categorical
measurement variable
what we want to measure about the obervation unit (height, age)
measurement unit
scale of measurement variable (cm for height, years for age)
descriptive statistics
characterize data in your sample (quantitative)
- averages, tables & graphs
inferential statistics
uses information from sample to make a probabilistic statement about statistical population (qualitative)
- confidence intervals
***takes uncertainty into account
4 steps to statistical framework
- sampling
- measuring
- calculating descriptive statistics
- calculating inferential statistics
inferential vs descriptive statistics
inferential:use info from data to make statement about STATISTICAL POPULATION
descriptive: use info from data to make statement about OUR SAMPLE
subgroups
divide the population in groups
sampling design
describe how to sample a statistical population in a fair way
4 goals of an ideal sampling design
- all sampling units are selectable
- selection is unbiased
- selection is independent
- all samples are possible
- all sampling units are selectable
every sampling unit has probability of being included
- selection is unbiased
probability of selecting certain sampling units cannot depend on any attribute of that sampling unit
- selection is independent
selection of sampling unit must not decrease or increase the probability that any other sampling unit is selected
- all samples are possible
all samples that could be created from statistical population are possible
bias
over-or-under estimate of a value from an average sample compared to a statistical population
observational studies
based on observations of a statistical population where researchers do not have any control over the variables which impact our conclusions
- ex. cant control confounding variable so relationships aren’t causal
goal of observational studies
characterize something about an existing statistical population that allows us to investigate relationships among variables
limitations of observational studies
cannot make statements about whether a factor causes the response you’re interested in
response variable
response you are interested in
- ex. tobacco
explanatory variable
factor you investigate
- ex. lung cancer
confounding variables
unobserved variables that affect a response variable
spurious relationship
when relationship between explanatory and response variables is thought to be driven by confounding variable
simple random survey
sampling units are selected at random from the statistical population where each sampling unit has the same probability of being in your sample
stratified survey
researcher creates strata then takes samples within each strata
strata
name given to a subgroup within the statistical population in a stratified survey
cluster survey
used to remove diversity in the statistical population thats not relevant to research question
- cluster= sampling unit
- nesting inside the cluster=observational unit
one-stage clusters
data are collected from all observation units in a cluster
two-stage clusters
a subset of observation units are randomly selected within each cluster
case-control survey
used to compare data between two groups
2 groups:
- case
-control
***strong risk of spurious relationship
case group (first group)
contains sampling unit WITH a particular response variable
control group (second group)
contains sampling unit WITHOUT response variable of the case group
cohort survey
sampling unit are selected and followed over time
- use simple random survey and then observe their fate over time
retrospective studies
where outcome is already known (increases risk of spurious relationships)
ex. case-control studies
prospective studies
where the outcome is not yet known (require more effort, but decrease risk of spurious relationships)
ex. cohort studies
cross-sectional studies
study a response variable at only a single snapshot in time
longitudinal studies
study a response variable at multiple points in time
experimental studies
based on creating treatments where the researcher controls one or more variable
goals of experimental studies
study effect of one or more manipulated variables on one or more random variables
- establishes cause and effect
factor
each manipulated variable has two levels/groups
replicates
number of times treatment is repeated on randomly selected units
- number of replicates is the number of sampling units in an experimental study
pseudoreplication
an error in the design of an experimental studies where the observation units are analyzed rather than sampling units
levels
different values of the factor
control treatments
contains everything except the treatment
blocking
used to control for variation among sampling unit thats not of interest that alter experimental variable
***PREDEFINED
blinded
a design where the sampling unit (usually a person) does not know what treatment they are being exposed to
single blind design
sampling unit does not know the treatment they are assigned
double blind design
both the researcher and sampling unit do not know what treatment they are assigned to
***removes accidental bias
placebo
method used for control treatment that helps accomplish a blinded design
- substance or treatment that has no effect on response variable
sham treatment
aims to account for the effect of delivery of a treatment thats not of interest of researcher
multiple factors
one factor could be drug type and another is diet type
interaction
when two explanatory variables have effects that are different than the simple sum of each variable in isolation
variable
any measurable characteristic of an observation unit (varies among sampling units)
3 pieces of information a variable contains
- what the variable represents
- measurement unit
- description of the observation units
data
value of a variable you measure
continuous numerical variable
can take on continuous numbers (fractional numbers)
ex. weight =107.23kg
discrete numerical variable
can take on only whole numbers (integers)
categorical
data is a qualitative description
- no measurement units
ordinal categorical variable
categorical (qualitative) variables that have ORDERED levels
ex. use emojis to describe how you feel
nominal categorical variable
can take on qualitative values but where values do not have any particular order
central tendency
describes the typical value in your sample (ex. mean)
dispersion
describes the spread of the values (ex. variance)
counts
number of sampling units in each category
proportion
share of the total sampling unit in each category
variance
measure of the amount of variation in your sample
standard deviation
square root of variance
quartiles
specific values of the variable that divide your data into ranked groups
median
central tendency is given by the second quartile
dispersion
describes how much variation there is in a sample
interquartile range
range between 1st and 3rd quartiles
when are quartiles sensitive?
when data set is small
pros to quartiles
median and IQR are robust to extreme values
cons to quartiles
median and IQR become quite variable for samples with a small number of observations
what are means sensitive to?
outliers
pros to means
mean and standard deviation are more robust when theres a small number of observations
cons to means
mean and standard deviation are sensitive to extreme values
effect size
used to evaluate whether changes in response variables is meaningful
absolute effect size
simple change in mean value between groups
- can be calculated as a difference or ratio
difference
differences in mean values among groups
- has advantage of retaining original scale
ratio
ratio of mean values among groups
- has advantage of indicating a relative change, but loses the original scale
contingency table
summarizes data from categorial variables
- shows frequency or proportion of sampling units in each level of a categorial variable
frequency
number of sampling units that falls in each level
contingency tables as proportions
help with visualizing the relative distribution of sampling units among levels
one-way contingency tables
observe 1 categorial variable
two-way contingency tables
observe 2 categorical variables
marginal distributions
calculate row and column
- they are frequencies to see the overall pattern
row of contingency table
sum frequencies across all columns for each row
column of contingency table
sum frequencies across all rows for each column
distribution
refers to categorical variables rather than the table
conditional distributions
relative frequencies of one categorical variable within the other
- shows interaction between two variables
bar graphs
used to visualize both single variable and two variable categorical data
- NOT USED FOR NUMERICAL DATA
- can be vertical or horizontal
vertical vs. horizontal
depends on research question
- most relevant information should be on the HORIZONTAL axis
grouping variable
forms base of the figure
- typically use ordinal categorical variables
grouped bar chart
levels of variable are shown beside each other
- levels of grouping variable are separated by LARGE gap
- levels of other variable are separated by SMALL gap
stacked bar graph
levels of variable are stacked on top of each other
- colour is used to separate levels
histograms
split numerical data into bins and display number of sampling units in each bin
advantage to histogram
provide great way to visualize the pattern
disadvantage to histogram
complicated to display histograms when your dataset also has multiple levels of a categorical variable
what happens when theres too many bins in a histogram
pattern is lost cause theres little variation in frequency
what happens when theres too few bins in a histogram
pattern is lost cause of excessive aggregation
box plots
shows how the median value differs among groups, and how much variation of data
single box plot
based on quartiles and contains…
1. min
2. max
3. median
4. 1st quartile
5. 3rd quartile
therefore IQR
parts of a single box plot
- a box
- solid line
- whiskers
- extreme value
extreme threshold
pair of imaginary lines drawn above and below box
box plots in observational studies
categorical group would be a measured categorical variable
box plot in experimental studies
categorical group would be the treatment factors
grouped box plot
two categorical groups
pros of histograms
- provide richest information about how your data is distributed
- illustrates shape of the distribution
con of histogram
difficult to look at a numerical variable across categorical groups
pro of box plot
it is easy to compare across multiple categorical groups
con of box plot
convey much less about shape of distribution
scatter plots
used to show pattern between two numerical variables collected from DIFFERENT sampling units
*HR against age for group of winner
line plots
used when data is collected repeatedly from SAME sampling units
- data points are NOT INDEPENDENT of one another
*HR during a run
x-axis
horizontal
- independent variable
y-axis
vertical
- dependent variable
independent variable
experimental treatment that is manipulated
dependent variable
measured response under those treatments
covariates
when both numerical variable are measured quantities from sampling unit
- evaluating patten, so not causal
association
correlation between two variables
- typically covariates
prediction
one variable predicts another
- x-axis=predictor variable
- y-axis=response variable
probability
frequency of a particular outcome or event
random trial
any process that has multiple outcomes but the result on any particular trial is unknown
- can be discrete or continuous
sample space
the list or set of all possible outcomes
- shown with {}
an event
outcome you are interested in
- can be single element in sample space
- can be any subset of the sample space
measurement variable
value of any particular measurement is unknown prior to making the observation
law of large numbers
random trial must be repeated many times to estimate probability
Ex. of probability (rolling a one)
- random trial: rolling die
- sampling space: s={1,2,3,4,5,6}
- event: E={1}
- probability= is 1/6 cause every side has an equal chance
probability distributions
functions that describe the probability over a range of events
properties of probability distributions
- describe probability for entire sample space
- area under probability distribution always sum to one
- are used to describe both continuous and discrete random variables
discrete distributions
prob distributions for discrete random variable
ex. number of times children ask for ice cream on a hot day
continuous distribution
prob distributions for continuous random variables
ex. mass of an ice cream cone in grams
how is a discrete distribution shown
series of vertical bars with no space between them
- vertical axis=probability mass
how is a continuous distribution shown
single curve as a function of continuous event
- vertical mass=probability density
what are distributions used for
estimating a range, or calculate a probability
properties of standard normal distribution
- mean of SND is zero
- standard deviation of SND is one
- x-axis is called the z-score
z-score
a scale that measures number of standard deviations from the mean
range vs. probabilities
probability and range are calculated as opposites
population parameters
describe attributes of the statistical population
sampling distributions
distribution of some descriptive statistic that only occurs if you repeatedly draw samples from statistical population
bimodal vs. unimodal
bimodal: two peaks
unimodal: one peak
similarity between sampling distribution and stat. pop.
have the same mean value
difference between sampling distribution and stat. pop
sampling distribution is narrower than stat. pop.
characteristics of sampling distribution
- shape of sampling distribution is independent of stat. pop. as long as sample size is large
- variance decreases as number of sampling units increases
what shape is sampling distribution
smooth bell-shaped distribution (symmetrical)
central limit theorem
- sampling distribution tends towards a normal distribution as sample size increases
- mean of a sampling distribution is the same as mean of stat. pop.
- sampling error can be calculated from sd of stat. pop and sample size
standard error
standard deviation of a sampling distribution
chain of inference adding to shape independence
the descriptive statistics of a sample provide an estimate of stat. pop. parameters and therefore sampling distribution
student t’s distribution
similar to normal distribution but has a shape that depends on the sample size
what happens to t distribution when sample size is small
it has fatter tails than normal distribution to account for uncertainty
- larger size= more certainty= t-distribution looks more like normal distribution
what is observed directly?
sample
what is not observed directly?
statistical population and sampling distribution (inference, not used in practice)
confidence intervals
describe range over x-axis of a sampling distribution that brackets a certain probability of where new samples may be found
purpose of confidence intervals
provide gauge for how much uncertainty there is in a descriptive statistic
what is the difference between experimental and observational studies?
experimental: causal
observational: correlative
standard error
is unavoidable - helps make the statistical inference