Midterm Flashcards
data
observations collected from field notes, surveys, experiments, etc
what is the backbone of statistical investigation
data
statistics
study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion
classic challenge in statistics
evaluating the efficacy of medical treatment
summary statistic
a single number summarizing a large amount of data
variables
characteristic
data matrix
a way to organize data
numerical variable
wide range of numerical values, sensible to add/subtract/take averages
types of numerical variables
discrete, continuous
discrete
can only take numerical values with jumps (eg number of siblings)
continuous
can take numerical values without jumps (eg height)
categorical
responses are categories
types of categorical
ordinal, nominal
ordinal variable
categorical but have a natural ordering (eg Likert scale)
nominal variable
categorical and no natural ordering (eg favourite ice cream
negative, positive, independent association
bleh
population vs sample
bleh
anecdotal evidence
data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics
random sampling
avoid adding bias
simple random sample
most basic random sample, using raffle; every case in population has equal chance of being included
non response bias
response rates can influence bias from a random sample
convenience sample
individuals who are easily accessible are more likely to be included in the sample
explanatory variables
independent variable
response variables
dependent variable
observational studies
collection of data in way that doesn’t directly interfere with how the data arises
eg: collecting surveys, ethnography, etc
randomized experiment
when individuals are randomly assigned to a group
confounding variable
variable correlated with both the explanatory and response variables
aka: lurking variable, confounding factor, confounder
prospective study
identifies individuals and collects information as events unfold
eg: medical researchers may identify and follow a group of similar individuals over many years
retrospective study
collects data after events have taken place
eg: researchers may review past events in medical records
simple random sampling
every case in population has equal chance of being included
stratified sampling
divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum
eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely
when is stratified sampling useful?
when cases in each stratum are very similar with respect to the outcome of interest
cluster sample
break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples
eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools
multistage sample
like cluster sample, but collect random sample within each selected cluster
pros and cons of cluster and multistage sample
+cluster/multistage can be more economical than alternative sampling techniques
+most useful when there’s a lot of case-to-case variability within cluster but clusters themselves don’t look very different from one another
eg: neighbourhoods when they are very diverse
-more advanced analysis techniques are typically required
scatter plots (and it’s strength)
provides case by case view of two numerical variables
+helpful in quickly spotting associations relating variables, trends, etc
dot plots
provides most basic of displays for one variable; like a one-variable dot plot
mean
common way to measure centre of distribution of data
- add up and divide by n
- often labeled as x-bar
μ
population mean
μx
used to represent which variable to population mean refers to
histograms
doesn’t show value of each observation
each value blongs to bin
binned counts are plotted as bars on histogram
provide view of data density
pros and cons of histogram
convenient for describing shape of data distribution
doesn’t show mode
skewness
right skew (longer right tail) left skew (longer left tail) symmetric (equal tails)
one, two, three prominent peaks
unimodal, bimodal, multimodal
two measures of variability
varaince, standard deviation
variance
the average squared deviation
σ2, standard deviation squared
standard deviation
σ
describes how far way the typical observation is from the mean
deviation
distance of an observation from its mean
box plots
•summarizes data set using five statistics while also plotting unusual observations
•step 1: draw dark line denoting the median, which splits data in half
•step 2: draw rectangle to represent the middle 50% of the data
⁃aka interquartile range aka IQR
⁃measure of variability in data
⁃the more variable the data, the larger the standard deviation and IQR
⁃two boundaries are called first quartile and third quartile
⁃Q1 and Q3 respectively
⁃IQR = Q3 — Q1
•step 3: whiskers attempt to capture data outside of the box
⁃reach is never allowed to be more than 1.5 x IQR
•step 4: any observations beyond the whiskers are identified as outliers
•robust estimates: extreme observations have little effect on value
⁃median and IQR are robust estimates
Mapping Data
colours are used to show higher and lower values of a variable
not helpful for getting precise values
helpful for seeing geographic trends and generating interesting research questions
contingency tables
summarized data for two categorical variables
-each value in table represents number of times a particular combination of variable outcomes occurred
row totals
total counts across each row
column totals
total counts down each column
relative frequency table
replace counts with percentages or proportions
row proportions
computed as counts divided by row totals
segmented bar plot
graphical display of contingency table information
mosaic plot
graphical display of contingency table information
-use areas to represent number of observations
probability
proportion of times the outcome would occur if we observed the random process an infinite number of times
law of large numbers
as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome
disjoint outcomes
aka mutually exclusive
when two outcomes cannot happen at the same time
probability distributions
table of all disjoint outcomes and their associated probabilities
complement of an event
all outcomes not in the event
sample space
set of all possible outcomes
independence
when knowing the outcome of one process provides no useful information about the outcome of the other
marginal probability
if a probability is based on a single varaible
joint probability
probability of outcomes is based on two or more variables
defining conditional probability
two parts: outcome of interest and condition
condition
information we know to be true
conditional probability
the outcome of interests A given condition B
tree diagrams
organize outcomes and probabilities around the structure of data
when are tree diagrams most useful?
when two or more processes occur in a sequence and each process is conditioned on its predecessors
expected value of X
average outcome of X
denoated by E(X)
deductive
reasoning
inductive
experience and rasong
wheel of science
/\ deduction
| theory |
| / \ |
| / \ |
| empirical hypothesis |
| generalizations / |
| \ / |
| \ / |
| observations |
induction \/
measurement
downward part of wheel of science
conceptualization vs operationalize
“lack of money” vs “lack of opportunity” are two conceptualizations of poverty
“do you have enough money to feed your family?” operationalizes the conceptualization of poverty
different conceptualizations often require different operationalizations
quantitative vs qualitative
a little about a lot of people vs a lot about a few people
administrative data
growing source
digitial data that is collected in process of administering other social goals
everything from information attached to social health number to credit card number
hard to make generalizations beyond the population
eg using database dealing with health cards is hard to generalize to all of Canada because people who didn’t use health cards would be completely ignored
survey research
designed to ask research questions
responses distilled into data that we work with
measurement necessitates some simplification because we need to compare across different groups of people
population vs sample
group we want to make a generalization about vs the group we actually have information about
census
rare kind of sample that covers an entire population, can be very expensive
basically the opposite of an annecdote
snowball sampling is often used for?
vulnerable communities like illegal immigrant workers in America
experiments
typicaly create artificial situtions that are designed to isolate variables of interest and their effects
R
increasingly popular open source client
accessible because it’s free
SPSS
popular for undergrads and certain fields
designed for doing experiment research
Stata
popular among sociologists and economists
stacked dot plot
higher bars represent areas where there are more observations
makes it easier to judge the centre and shape of the distribution
questionaire
contains actual phrasing of question and options for the responses
codebook
summarize the data set; tells us what the dataset names mean like dictionary
CANSIM
micro data, summary statistics (overall estimates)
ODESI
contains confidential information
we can use the public-use parts of ODESI, in which everything is anonymized and variables have been “tweaked” a little in order to make sure that information can’t be traced back to respondents
Rsearch Data Centres
stuff you can’t find on PUMFs
measures of central tendency
mode, median, mean
pros and cons of mode
+can be used for all types of measures, relatively quick/simple measure
-doesn’t ues much information, most common doesn’t necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren’t other ages)
how to calculate median
odd: middle observation
even: average of two middle observations
pros and cons of mode
+capture actual centre of distribution, less suceptible to outliers
-computationally awkward, cannot be estimated for unordered categorical variables
percentiles
general concept, closely related to median (median = 50th percentile)
100 percetniles
interquartile range
between 25th and 75th
90th percentile
90% of observations are lower, 10% are higher
25th percentile
25% of observation are lower, 75% are higher
mean cons
more susceptible to outliers
measures of dispersion
aim to give us a sense of breath of distribution
range
interval between smallest and largest values
pros and cons for range
+good for quick check
-only takes into account two observations, very sensitive, only useful for numeric variables
pros and cons of SD
+variance and SD take into account all scores, accurately describes “typical” deviation, easily interpreted
-sensitive to outliers, can only be calculated for numerical variables
proportions
frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases
frequency cons
working with them is tough when trying to conceptualize comparisons
-this can be fixed by changing them into percentages
cumulative percentage
the percentage in the category + the category under it
only works for ordinal variables
random process
a process where we know what outcomes can happen, but we don’t know which particular outcome will happen
rules for probability distribution
- outcomes listed are disjoint
- each probability must equal between 0 and 1
- all probabilities must total 1
algebra of possibility
if we know the possibility of their component outcomes, we can know the probability of two events
continuous distribution
another way of summarizing information
-more advanced mathematical concept than bar graph
the line is called probability density function
-describes information in graph
-has interesting properties
-can be used to infer probability of any outcome
-never loops back (line only moves from left to right)
-always less than one
-the area under the curve adds up to 1
the area equals P
the area under the curve gives the probability of people falling in that range