Final Flashcards

1
Q

data

A

observations collected from field notes, surveys, experiments, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the backbone of statistical investigation

A

data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

statistics

A

study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

classic challenge in statistics

A

evaluating the efficacy of medical treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

summary statistic

A

a single number summarizing a large amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

variables

A

characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data matrix

A

a way to organize data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

numerical variable

A

wide range of numerical values, sensible to add/subtract/take averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

types of numerical variables

A

discrete, continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

discrete

A

can only take numerical values with jumps (eg number of siblings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

continuous

A

can take numerical values without jumps (eg height)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

categorical

A

responses are categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

types of categorical

A

ordinal, nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ordinal variable

A

categorical but have a natural ordering (eg Likert scale)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

nominal variable

A

categorical and no natural ordering (eg favourite ice cream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

negative, positive, independent association

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

population vs sample

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

anecdotal evidence

A

data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

random sampling

A

avoid adding bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

simple random sampling

A

most basic random sample, using raffle; every case in population has equal chance of being included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

non response bias

A

response rates can influence bias from a random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

convenience sampling

A

individuals who are easily accessible are more likely to be included in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

independent variable

A

explanatory variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

response variables

A

dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

observational studies

A

collection of data in way that doesn’t directly interfere with how the data arises
eg: collecting surveys, ethnography, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

randomized experiment

A

when individuals are randomly assigned to a group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

confounding variable

A

variable correlated with both the explanatory and response variables
aka: lurking variable, confounding factor, confounder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

prospective study

A

identifies individuals and collects information as events unfold
eg: medical researchers may identify and follow a group of similar individuals over many years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

retrospective study

A

collects data after events have taken place

eg: researchers may review past events in medical records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

simple random sampling

A

every case in population has equal chance of being included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

stratified sampling

A

divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum
eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

when is stratified sampling useful?

A

when cases in each stratum are very similar with respect to the outcome of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

cluster sampling

A

break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples
eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

multistage sampling

A

like cluster sample, but collect random sample within each selected cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

pros and cons of multistage sampling

A

+cluster/multistage can be more economical than alternative sampling techniques
+most useful when there’s a lot of case-to-case variability within cluster but clusters themselves don’t look very different from one another
eg: neighbourhoods when they are very diverse
-more advanced analysis techniques are typically required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

scatter plots and its strength

A

provides case by case view of two numerical variables

+helpful in quickly spotting associations relating variables, trends, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

dot plots

A

provides most basic of displays for one variable; like a one-variable dot plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

mean

A

common way to measure centre of distribution of data

  • add up and divide by n
  • often labeled as x-bar
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

μ

A

population mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

μx

A

used to represent which variable to population mean refers to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

histograms

A

doesn’t show value of each observation
each value blongs to bin
binned counts are plotted as bars on histogram
provide view of data density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

pros and cons of histogram

A

convenient for describing shape of data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

skewness

A
right skew (longer right tail)
left skew (longer left tail) 
symmetric (equal tails)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

one, two, three prominent peaks

A

unimodal, bimodal, multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

two measures of variability

A

varaince, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

variance

A

the average squared deviation

σ2, standard deviation squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

standard deviation

A

σ

describes how far way the typical observation is from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

deviation

A

distance of an observation from its mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

box plots

A

•summarizes data set using five statistics while also plotting unusual observations
•step 1: draw dark line denoting the median, which splits data in half
•step 2: draw rectangle to represent the middle 50% of the data
⁃aka interquartile range aka IQR
⁃measure of variability in data
⁃the more variable the data, the larger the standard deviation and IQR
⁃two boundaries are called first quartile and third quartile
⁃Q1 and Q3 respectively
⁃IQR = Q3 — Q1
•step 3: whiskers attempt to capture data outside of the box
⁃reach is never allowed to be more than 1.5 x IQR
•step 4: any observations beyond the whiskers are identified as outliers
•robust estimates: extreme observations have little effect on value
⁃median and IQR are robust estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

mapping data

A

colours are used to show higher and lower values of a variable
not helpful for getting precise values
helpful for seeing geographic trends and generating interesting research questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

contingency tables

A

summarized data for two categorical variables

-each value in table represents number of times a particular combination of variable outcomes occurred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

row totals

A

total counts across each row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

column totals

A

total counts down each column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

relative frequency table

A

replace counts with percentages or proportions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

row proportions

A

computed as counts divided by row totals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

segmented bar plots

A

graphical display of contingency table information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

mosaic plot

A

graphical display of contingency table information

-use areas to represent number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

probability

A

proportion of times the outcome would occur if we observed the random process an infinite number of times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

law of large numbers

A

as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome

60
Q

disjoint outcomes

A

aka mutually exclusive

when two outcomes cannot happen at the same time

61
Q

probability distributions

A

table of all disjoint outcomes and their associated probabilities

62
Q

complement of event

A

all outcomes not in the event

63
Q

sample space

A

set of all possible outcomes

64
Q

independence

A

when knowing the outcome of one process provides no useful information about the outcome of the other

65
Q

marginal probability

A

if a probability is based on a single varaible

66
Q

joint probability

A

probability of outcomes is based on two or more variables

67
Q

defining conditional probability

A

two parts: outcome of interest and condition

68
Q

condition

A

information we know to be true

69
Q

conditional probability

A

the outcome of interests A given condition B

70
Q

tree diagrams

A

organize outcomes and probabilities around the structure of data

71
Q

when are tree diagrams most useful?

A

when two or more processes occur in a sequence and each process is conditioned on its predecessors

72
Q

expected value of X

A

average outcome of X

denoated by E(X)

73
Q

deductive

A

reasoning

74
Q

inductive

A

experience and reasoning

75
Q

wheel of science

A

/\ deduction
| theory |
| / \ |
| / \ |
| empirical hypothesis |
| generalizations / |
| \ / |
| \ / |
| observations |
induction \/

76
Q

measurement

A

downward part of wheel of science

77
Q

conceptualization vs operationalize

A

“lack of money” vs “lack of opportunity” are two conceptualizations of poverty
“do you have enough money to feed your family?” operationalizes the conceptualization of poverty
different conceptualizations often require different operationalizations

78
Q

quantitative vs qualitative

A

a little about a lot of people vs a lot about a few people

79
Q

administrative data

A

growing source
digitial data that is collected in process of administering other social goals
everything from information attached to social health number to credit card number
hard to make generalizations beyond the population
eg using database dealing with health cards is hard to generalize to all of Canada because people who didn’t use health cards would be completely ignored

80
Q

survey research

A

designed to ask research questions
responses distilled into data that we work with
measurement necessitates some simplification because we need to compare across different groups of people

81
Q

population vs sample

A

group we want to make a generalization about vs the group we actually have information about

82
Q

census

A

rare kind of sample that covers an entire population, can be very expensive
basically the opposite of an annecdote

83
Q

what is snowball sampling often used for?

A

vulnerable communities like illegal immigrant workers in America

84
Q

complex random sampling

A

sample is still random, but we tweak things so that some cases are less/more likely to be selected

85
Q

three sources of bais

A

non-response
voluntary response
convenience response

86
Q

experiments

A

typicaly create artificial situtions that are designed to isolate variables of interest and their effects

87
Q

pros and cons of observational studies?

A

+make meaningful connection

-hard to make assumptions of causation

88
Q

R

A

increasingly popular open source client

accessible because it’s free

89
Q

SPSS

A

popular for undergrads and certain fields

designed for doing experiment research

90
Q

Stata

A

popular among sociologists and economists

91
Q

stacked dot plot

A

higher bars represent areas where there are more observations
makes it easier to judge the centre and shape of the distribution

92
Q

shape of distribution is determined by….

A

modality (how mnay humps?)
skewness (one side of distribution looks very different from other side)
outliers (one or two variables are unusual)

93
Q

questionaire

A

contains actual phrasing of question and options for the responses

94
Q

codebook

A

summarize the data set; tells us what the dataset names mean like dictionary

95
Q

CANSIM

A

micro data, summary statistics (overall estimates)

96
Q

ODESI

A

contains confidential information
we can use the public-use parts of ODESI, in which everything is anonymized and variables have been “tweaked” a little in order to make sure that information can’t be traced back to respondents

97
Q

RDC

A

Research Data Centre; stuff you can’t find on PUMFs

98
Q

measures of central tendency

A

mode, median, mean; ie where does the modality tend to accumulate?

99
Q

pros and cons of mode

A

+can be used for all types of measures, relatively quick/simple measure
-doesn’t ues much information, most common doesn’t necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren’t other ages)

100
Q

how to calculate median

A

odd: middle observation
even: average of two middle observations

101
Q

pros and cons of median

A

+capture actual centre of distribution, less suceptible to outliers
-computationally awkward, cannot be estimated for unordered categorical variables

102
Q

percentiles

A

general concept, closely related to median (median = 50th percentile)
100 percetniles

103
Q

itnerquartile range

A

between 25th and 75th

104
Q

90th percentile

A

90% of observations are lower, 10% are higher

105
Q

25th percentile

A

25% of observation are lower, 75% are higher

106
Q

mean cons

A

more susceptible to outliers

107
Q

measures of dispersion

A

aim to give us a sense of breath of distribution

e.g. compare temperature in Saskatoon vs Vancouver

108
Q

range

A

interval between smallest and largest values

109
Q

pros and cons of range

A

+good for quick check

-only takes into account two observations, very sensitive, only useful for numeric variables

110
Q

pros and cons of standard deviation

A

+variance and SD take into account all scores, accurately describes “typical” deviation, easily interpreted
-sensitive to outliers, can only be calculated for numerical variables

111
Q

proportions

A

frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases

112
Q

frequency cons

A

working with them is tough when trying to conceptualize comparisons
-this can be fixed by changing them into percentages

113
Q

cumulative percentage

A

the percentage in the category + the category under it

only works for ordinal variables

114
Q

random process

A

a process where we know what outcomes can happen, but we don’t know which particular outcome will happen

115
Q

rules for probability distribution

A
  1. outcomes listed are disjoint
  2. each probability must equal between 0 and 1
  3. all probabilities must total 1
116
Q

algebra of possibility

A

if we know the possibility of their component outcomes, we can know the probability of two events

117
Q

continuous distribution

A

another way of summarizing information
-more advanced mathematical concept than bar graph
the line is called probability density function
-describes information in graph
-has interesting properties
-can be used to infer probability of any outcome
-never loops back (line only moves from left to right)
-always less than one
-the area under the curve adds up to 1

118
Q

area equals p

A

the area under the curve gives the probability of people falling in that range

119
Q

frequency table (and a disadvantage)

A

lists all the qualities variables can take on and how many people answered to each quality
-impractical for continuous variables because data gets too unwieldy

120
Q

pie charts

A
they suck
don't use pie charts
they're misleading
only really great for visualness and public information 
only work for things that sum to 100
121
Q

bar charts

A

display simple information well
can chart frequencies and proportions
information doesn’t need to sum to 100

122
Q

law of large numbers

A

as more observations of a random process are collected, the proportion of occurences with a particular outcome converges to the probability of that outcome

123
Q

normal distribution

A

unimodal, symmetric, bell shaped curve

many variables are nearly normal, but none are exactly normal

124
Q

what are normal distributions defined by?

A

the mean (where they sit on the number line) and SD (peakness)

125
Q

z scores

A

how many standard deviations does x fall from the mean?

every z score corresponds to a specific percentile

126
Q

inferential statistics

A

saying things about society as a whole without futile attempt to examine the whole society

127
Q

parameters

A

hypothetical number that exists somewhere

any characteristic of a population can be defined by a parameter

128
Q

sampling error

A

the difference between estimate and actual parameter

unless we survey every case in the population, we will always have sampling error

129
Q

sampling distribution

A

hypothetical distribution if we could sample our population an infite number of times

130
Q

standard error

A

typical or expected error (standard deviation) based on sampling distribution
aka standard deviation of sampling distribution
-no obvious way to estimate SE from single sample

131
Q

central limit theorem

A

if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model
as n becomes large, the sampling distribution approaches normality and it has less and less error in it
standard error will be bigger if the population has a larger population
we can decrease our standard error if we make a bigger sample

132
Q

recipe for statistical inference

A

estimate
standard error
desired confidence level

133
Q

confidence intervals

A

a plausible range of the population paramter
“what is the porbability that the population mean falls within a certain range”
trades off with confidence

134
Q

narrowing intervals

A

we can narrow confidence interval without reducing confidence by reducing our standard error

135
Q

p values

A

probability of observing data favourable to the alternative hypothesis if null is true
p values are controversial
the greater the p value, the more likely the null is true
isn’t a quantifier, only a probability

136
Q

hypothesis testing

A

comparing world we actually observe to what we think the world should be like
if our evidence looks nothing like the null, we can reject the null

137
Q

why null?

A

we don’t want to say how certain we are because we can never collect all the information, therefore there is always a possibility of one case out there proving us wrong. So we try to improve our chances that hypothesis is right. A type of process of elimination

138
Q

why double negatives?

A

because we accept the hypothesis condtionally, with some probability, but not absolute certainty

139
Q

alpha level

A

expresses same information as confidence level, except alpha level shows how unconfident you are. e.g. if confidence level is 95%, alpha level is 0.05

140
Q

single tail tests

A

how far away does x-bar distribution need to be? if we get a z-score of <1.29
when we test whether X-bar is greater than or less than population mean, but not both
only common in psychology

141
Q

why don’t we use single tail tests that often?

A

because there’s a way of framing single tail tests that make it accidentally easier to rejet the null, therefore more likely to find positive research findings, and lowers the quality of the results

142
Q

hypothesis testing framework

A

(1) write the hypothesis in plain language, then in mathematical notion
(2) identify an appropriate point estimate of the parameter of interest (mean)
(3) verify conditions to ensure the standard error estimate is reasonable and the point estimate is nearly normal and unbiased
(4) compute standard error. draw a picture depicting the distribution of the estimate under the idea that H0 is true
- shade areas representing the p-vlaue
(5) using the picture, compute the test statistic i.e. Z-score and identity the p-value to evaluate hypothesis
(6) write conclusion in plain language

143
Q

two tail tests

A

we distribute critical region, we don’t assume whether sampling distribution is above or below, just about whether it falls outside or inside
we have to have a larger x-value to reject the hypothesis

144
Q

type 1 vs type 2 error

A

type 1: falsely rejecting the null

type 2: falsely accepting the null

145
Q

writing null vs writing alternative

A

H0 = null hypothesis
-skeptical perspective or claim to be tested
-always write the null hypothesis as an equality
HA = alternative hypothesis
-alternative or new claim under consideration

146
Q

testing appropriateness of normal model

A

(1) fit simple histogram over normal curve

(2) examine normal probability plot

147
Q

bin size

A

adding more bins provides greater detail
when sample is large, smaller bins still work well
smaller sample sizes, small bins are very volatile