Midterm Flashcards
data
observations collected from field notes, surveys, experiments, etc
what is the backbone of statistical investigation
data
statistics
study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion
classic challenge in statistics
evaluating the efficacy of medical treatment
summary statistic
a single number summarizing a large amount of data
variables
characteristic
data matrix
a way to organize data
numerical variable
wide range of numerical values, sensible to add/subtract/take averages
types of numerical variables
discrete, continuous
discrete
can only take numerical values with jumps (eg number of siblings)
continuous
can take numerical values without jumps (eg height)
categorical
responses are categories
types of categorical
ordinal, nominal
ordinal variable
categorical but have a natural ordering (eg Likert scale)
nominal variable
categorical and no natural ordering (eg favourite ice cream
negative, positive, independent association
bleh
population vs sample
bleh
anecdotal evidence
data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics
random sampling
avoid adding bias
simple random sample
most basic random sample, using raffle; every case in population has equal chance of being included
non response bias
response rates can influence bias from a random sample
convenience sample
individuals who are easily accessible are more likely to be included in the sample
explanatory variables
independent variable
response variables
dependent variable
observational studies
collection of data in way that doesn’t directly interfere with how the data arises
eg: collecting surveys, ethnography, etc
randomized experiment
when individuals are randomly assigned to a group
confounding variable
variable correlated with both the explanatory and response variables
aka: lurking variable, confounding factor, confounder
prospective study
identifies individuals and collects information as events unfold
eg: medical researchers may identify and follow a group of similar individuals over many years
retrospective study
collects data after events have taken place
eg: researchers may review past events in medical records
simple random sampling
every case in population has equal chance of being included
stratified sampling
divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum
eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely
when is stratified sampling useful?
when cases in each stratum are very similar with respect to the outcome of interest
cluster sample
break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples
eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools
multistage sample
like cluster sample, but collect random sample within each selected cluster
pros and cons of cluster and multistage sample
+cluster/multistage can be more economical than alternative sampling techniques
+most useful when there’s a lot of case-to-case variability within cluster but clusters themselves don’t look very different from one another
eg: neighbourhoods when they are very diverse
-more advanced analysis techniques are typically required
scatter plots (and it’s strength)
provides case by case view of two numerical variables
+helpful in quickly spotting associations relating variables, trends, etc
dot plots
provides most basic of displays for one variable; like a one-variable dot plot
mean
common way to measure centre of distribution of data
- add up and divide by n
- often labeled as x-bar
μ
population mean
μx
used to represent which variable to population mean refers to
histograms
doesn’t show value of each observation
each value blongs to bin
binned counts are plotted as bars on histogram
provide view of data density
pros and cons of histogram
convenient for describing shape of data distribution
doesn’t show mode
skewness
right skew (longer right tail) left skew (longer left tail) symmetric (equal tails)
one, two, three prominent peaks
unimodal, bimodal, multimodal
two measures of variability
varaince, standard deviation