KNPE 251 1/2 Flashcards
5 hierarchal scales
-Sampling Unit
-Sample
-Observation Unit
-Statistical Population
-Population of Interest
Sampling Unit
unit being selected at random
Sample
collection of sampling units randomly selected
Observation Unit
scale for data collection
Statistical Population
collection of all sampling units that could have been in sample
Population of interest
collection of sampling units that you hope to draw a conclusion about (Scope of research question)
Measurement variable
what we want to measure about the observation unit
Measurement unit
scale of measurement variable (cm, years etc)
What is the measurement unit if the data is categorical
no measurement unit
what is descriptive statistics used for?
describe the data in the sample
What is inferential statistics used for?
describe statistical population population based on sample
steps to carry out study
- Sampling
- Measure
- Calculate Descriptive Statistics
- calculate inferential statistics
Goals of ideal sampling designs
-all sampling units must have some probability of being included in sample (p>0)
-Selection of sampling units are unbiased
-selection of sampling units are independent
-Each possible sample has equal chance of being selected
what is an observational study
based on observations of a statistical population
*researchers do not have any control over the variables
Primary goal of observational study
characterize something about an existing population
Limitations of observational study
cannot make statements about whether a factor CAUSES the response you are interested
Response Variable
response you are interested in
Explanatory variable
factor you are investigating
Confounding Variable
unobserved variables that affect the response variable
Spurious
when the relstionship between and explanatory variable and response variable is thought to be driven by a confounding variable
Simple Random Survey
starts by identifying every sampling unit in the statistical population and then selecting a random subset for those samples
Stratified Survey
-used when there are subgroups within the statistical population that can influence the results
-break statistical population into strata then sample within each strata
**strata must be defined ahead of time by researcher
**each strata has equal weighing in sample
Cluster Survey
-used to remove diversity in the statistical population that is not relevant to research question
-create groups where the non-relevant diversity is contained within each group
-can be done in one or two stage designs
One stage cluster design
data is collected from ALL observation units in a cluster
Two stage cluster design
a subset of observation units are randomly selected within each cluster
Case-Control Survey
-used to compare data between 2 groups
-1st group is the “case” and contains sampling units with a particular response variable
-2nd group is the “control” and contains sampling units without the response variable of the case grou
-purposely biased as it aims to select sampling units for the case group based on a measured resposne variable and compare that to the control group
***high spurious chance
**retrospective
Cohort survey
-follow sampling units over time, looking for development of a particular response variable
-goal is to select a random set of sampling units and observe over time
**outcomes unknown when sampling units selected
**prospective
Retrospective
outcome is already known, looking back in time
*increase risk of spurious relationships
Prospective
outcome is not yet known, looking forward in time
Cross-sectional
ones that study a response variable at only a single snapshot in time
Longitudinal
studying a response variable at multiple points in time
experimental studies
-treatment only starts once put in the category
-based on creating treatments where the researcher controls one or more variables
-establish cause-effect among variables
-each manipulated variable is called a factor (each factor has levels)
the 2 steps when sampling units are selected at random in experimental studies
- Selection
- Assignment
Replication
the idea that a treatment will be repeated a number of times to see how reputable a measured outcome is
Pseudoreplication
where the observation units are analyzed rather than the sampling units
*this is an error in the design of an experimental study
Types of experimental study designs
-control treatment
-blocking
-blinding
-placebo
-sham treatment
Control treatment
contains everything except the actual treatment; reference to compare treatment levels against
Blocking
predefined groups where treatments are applied within each group; you can randomly allocate your sampling units to the treatments, but cant do it across groups
Blinding
sampling unit does not know what treatments are applied within each group
(double blind: researcher does not know either)
Placebo
given substance/treatment that has no affect on the response variable
Sham Treatment
controls for treatments that require handling the sampling unit
(aims to account for effect of delivery of a treatment that is not of interest)
3 pieces of information a variable contains
-what the variable represents
-the measurement unit
-description of the observation unit
4 subtypes of variables
continuous, discrete, ordinal and nominal
continuous variable
can take on continuous numbers (any value including fractions)
discrete variable
can only take on whole numbers (integers)
ordinal categorical variable
can take on qualitative values but where values are from a ranked scale
nominal categorical variables
can take on qualitative values but where values have no particular order
what is central tendency
describes typical values in a sample
*2nd quartile, the median
what is dispersion
describes the spread of values
*range of inner-most 50% of the data, 3rd to 1st quartile (IQR)
what do central tendency and dispersion depend on
whether the variable is numerical or categorical
two ways to characterize categorical data
counts and proportions
what are counts
the number of sampling units in each category
what are proportions
the share of the total sampling units in each category
(frequency/ total)
what do counts and proportions indicate
the central tendency of categorical data
what is range used to indicate
dispersion
what is mean used to describe
central tendency
what is variance used to indicate
dispersion
Steps to calculating the mean
- sum of all values in a sample
- divide by the number of data point in a samples
steps to calculating standard deviation
- calculate the mean for a sample
- calculate the difference between each data point and the mean of those and square
- sum the squares of differences and divide by the number of observations
*dividing by the number of observations, we are calculating population variance
what do quartiles show us
central tendency and dispersion
steps to calculating quartiles
- Sort data lowest to highest
- find the second quartile by splitting data in half
- find the first quartile by subsetting the lower-valued half of the observations, then find middle value (the second quartile IS included if the # of observations is odd)
- find the third quartile by repeating step 3 for the upper valued half
when to use quartiles over mean
-for larger data sets
**because theya re sensitive when ad dataset is small due to major median change
When to use mean over quartiles
for smaller data sets
**sensitive to outliers that change the mean a lot
What is effect size
the change in mean value of response variable among groups
*used to evaluate whether change in response variables is meaningful
what does effect size allow for
allows us to put study results into context and look at change across groups
two types of effect size
absolute and relative
how can effect size be calacuated
as either a difference or a ratio
**depends on study
calculating effect size using difference method advantage
retains original scale
calculating effect size using ratio method advantge
indicates relative change but loses original scale
what are contingency tables
tables of data frequencies or proportions within different levels of categorical variable
what do contingency tables show
frequency or proportion of sampling units in each level of a categorical variable
what is frequency
number of sampling units that falls in each level
one way vs two way categorical variables
one way: one categorical variable
two way: two categorical variables
what are marginal distributions
-allow you to see patterns in contingency tables
-the row and column sums of a two-way contingency table
what are conditional distributions
two way tables that show the interaction between the two variables in a contingency table
difference between marginal and conditional distributions
conditional distributions look at the relative proportion of sampling units across the levels of one variable but within a single level of another variable
When to use a bar graph
used to visualize categorical data (NOT for numerical data)
emphasizing vertical vs horizontal bar graphs
vertical: emphasizes categorical variable
horizontal: focuses on the number of sampling units
only time you can use a bar graph for numerical data
when each categorical level has a single numerical value
(data must be statistical in nature because they cant represent a subsample from a larger pop)
bar graphs with two categorical variables
-shows if one is impacting the other
*can be stacked or grouped
-1st variable is “grouping” ; usually ordinal
-2nd variable is secondary
Histograms
split numerical data into bins and display the number of sampling units in each bin
drawback of histograms
if we have a dataset that has a numerical variable and also has a categorical variable with many levels, it can be cumbersome to show histograms for each level of the other categorical variable
what are Box plots used for
visualizing numerical data across groups
5 descriptive statistics that box plots show
-min
-max
-median
-1st quartile
3rd quartile
BOx plot vs histograms
box plots: easy to compare across multiple categorical groups but mask shape distribution
histograms: richest info on data distribution and shape distribution, but difficult to look at numerical variable across categorical groups
Scatter Plot
shows pattern between 2 numerical variables that are collected from different sampling units
*each point is a sampling unit
axis naming conventions based on descriptive statistics
depend on whether the data are from experimental or observations studies, and whether the treatment variable is displayed.
axis naming conventions based on inferential statistics
depend on whether the statistical analysis is looking at association between the variables or prediction.
Axis naming descriptive statistics: when figure is intended to showcase sample data and experimental study showing treatment
x axis is IV, y axis is DV
Axis naming descriptive statistics: when figure is intended to showcase sample data and experimental study not showing treatment or observational study
x-axis and y-axis are covariates (no causation)
Axis namin for inferential statistics: when figure is intended to showcase inference and association (correlation test)
x-axis and y-axis are covariates
Axis naming for inferential statistics: when figure is intended to showcase inference and prediction (regression test)
x-axis is predictor variable and y-axis is response variable
Line plot
data is collected repeatedly from same sampling unit (o2 numerical variables)
*each line represents a sampling unit
what is probability
the proportion of times an event would occur if a random trial was repeated many times
*used to describe confidence in an outcome or anticipated frequency
what is a random trial
any process with multiple outcomes but where the outcome on any particular trial is unknown
what is sample space
a list of all possible outcomes
what is event
the outcome of interest
Frequentist statistics
random trial must be repeated many times to estimate probability (depends on how accurate you want probability to be)
Probability distributions
are functions that describe probability of all events and a tool for calculating
where can probability be found on a graph
the area under the function
three properties of probability distributions
- describe the probability for the entire sample space
- area under entire curve always sums to 1
- are for both continuous and discrete variables
2 types of probability distributions
discrete and continuous
discrete probability distributions
-typically shown as vertical bars with no space
-y axis is probability mass
continuous probability distributions
-typically shown as a line graph
-y axis is probability density
what is the probability of a single event in continuous distribution
zero
term when a continuous distribution has two peaks
bimodal
standard normal distribution
used to answer any question that is based o probabilities from a normal distribution
*must convert to standard form
when to use forwards and backwards equation of conversion to standard form
forwards: to estimate probabilities when given a range
backwards: to estimate ranges from probability
descriptive statistics
-used to describe attributes of a sample
-quantifiable characteristic of a sample
-values are NOT fixed (can change each time a sample is taken)
Population parameters
-any quantifiable characeteristic of a statistical population
-each measurement variable has its own set of population parameters
-values are FIXED (consistent each time a sample is taken)
what is estimation in sampling distributions
descriptive statistics provide an estimate of population parameter
what is sampling distribution
the probability of a descriptive statistic that would emerge if a statistical population was sampled repeatedly a large number of times
is the shape of the sampling distribution influenced by the shape of a statistical population
No. shape of a sampling distribution is independent and does not rely on the shape of the statistical population
what does variance depend on?
variance depends on sampling size, the larger the sample, the less variance (inverse relationship)
Central Limit theorem
the development of the principles behind the two key characteristics of sampling distributions
- a sampling distribution has a bell shape; independant of statistical population
- the variance of a sampling distribution decreases as sample size increases
what does the central limit theorem add to shape independence?
-sampling distribution becomes a normal distribution
-mean of sampling distribution is the same as statistical population
-standard error can be calculated using standard deviation and sample size
standard deviation of sampling distribution is called…
standard error
calculated by: standard deviation of statistical population divided by the square root of sample size
the chain of inference reinforces that…
statistical population and sampling distribution are not directly observed
steps in chain of inference
- sample
- estimate statistical population
- calculate sampling distribution
**statistical population and sampling distribution are both based off an estimate
what does the central limit theorem assume
that we know statistical population perfectly
solution to uncertainty in estimation
students t-distribution
what is students t-distribution?
used to describe the sampling distribution when the paramteres of the statistical population are estimated from a sample
attributes of t-distribution
-looks like normal distribution but tails are a bit fatter to account for uncertainty in estimate
- sample size influences shape (larger the sample size, the better the estimate and the more it looks like a normal distribution)
confidence intervals
the range over a sampling distribution that brackets the centre-most probability of interest
what is the purpose of confidence intervals
-used to convey uncertainty in descriptive statistics of a sample
*derrived from sampling distributions
**the range over the x-axis of a sampling distribution that brackets where new samples may be found with a certain probability
interpreting confidence intervals
-they are an estimate
-give a sense of variation from sampling error