Midterm Flashcards

1
Q

data

A

observations collected from field notes, surveys, experiments, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the backbone of statistical investigation

A

data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

statistics

A

study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

classic challenge in statistics

A

evaluating the efficacy of medical treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

summary statistic

A

a single number summarizing a large amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

variables

A

characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data matrix

A

a way to organize data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

numerical variable

A

wide range of numerical values, sensible to add/subtract/take averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

types of numerical variables

A

discrete, continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

discrete

A

can only take numerical values with jumps (eg number of siblings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

continuous

A

can take numerical values without jumps (eg height)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

categorical

A

responses are categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

types of categorical

A

ordinal, nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ordinal variable

A

categorical but have a natural ordering (eg Likert scale)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

nominal variable

A

categorical and no natural ordering (eg favourite ice cream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

negative, positive, independent association

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

population vs sample

A

bleh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

anecdotal evidence

A

data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

random sampling

A

avoid adding bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

simple random sample

A

most basic random sample, using raffle; every case in population has equal chance of being included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

non response bias

A

response rates can influence bias from a random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

convenience sample

A

individuals who are easily accessible are more likely to be included in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

explanatory variables

A

independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

response variables

A

dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

observational studies

A

collection of data in way that doesn’t directly interfere with how the data arises
eg: collecting surveys, ethnography, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

randomized experiment

A

when individuals are randomly assigned to a group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

confounding variable

A

variable correlated with both the explanatory and response variables
aka: lurking variable, confounding factor, confounder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

prospective study

A

identifies individuals and collects information as events unfold
eg: medical researchers may identify and follow a group of similar individuals over many years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

retrospective study

A

collects data after events have taken place

eg: researchers may review past events in medical records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

simple random sampling

A

every case in population has equal chance of being included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

stratified sampling

A

divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum
eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

when is stratified sampling useful?

A

when cases in each stratum are very similar with respect to the outcome of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

cluster sample

A

break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples
eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

multistage sample

A

like cluster sample, but collect random sample within each selected cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

pros and cons of cluster and multistage sample

A

+cluster/multistage can be more economical than alternative sampling techniques
+most useful when there’s a lot of case-to-case variability within cluster but clusters themselves don’t look very different from one another
eg: neighbourhoods when they are very diverse
-more advanced analysis techniques are typically required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

scatter plots (and it’s strength)

A

provides case by case view of two numerical variables

+helpful in quickly spotting associations relating variables, trends, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

dot plots

A

provides most basic of displays for one variable; like a one-variable dot plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

mean

A

common way to measure centre of distribution of data

  • add up and divide by n
  • often labeled as x-bar
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

μ

A

population mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

μx

A

used to represent which variable to population mean refers to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

histograms

A

doesn’t show value of each observation
each value blongs to bin
binned counts are plotted as bars on histogram
provide view of data density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

pros and cons of histogram

A

convenient for describing shape of data distribution

doesn’t show mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

skewness

A
right skew (longer right tail)
left skew (longer left tail) 
symmetric (equal tails)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

one, two, three prominent peaks

A

unimodal, bimodal, multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

two measures of variability

A

varaince, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

variance

A

the average squared deviation

σ2, standard deviation squared

47
Q

standard deviation

A

σ

describes how far way the typical observation is from the mean

48
Q

deviation

A

distance of an observation from its mean

49
Q

box plots

A

•summarizes data set using five statistics while also plotting unusual observations
•step 1: draw dark line denoting the median, which splits data in half
•step 2: draw rectangle to represent the middle 50% of the data
⁃aka interquartile range aka IQR
⁃measure of variability in data
⁃the more variable the data, the larger the standard deviation and IQR
⁃two boundaries are called first quartile and third quartile
⁃Q1 and Q3 respectively
⁃IQR = Q3 — Q1
•step 3: whiskers attempt to capture data outside of the box
⁃reach is never allowed to be more than 1.5 x IQR
•step 4: any observations beyond the whiskers are identified as outliers
•robust estimates: extreme observations have little effect on value
⁃median and IQR are robust estimates

50
Q

Mapping Data

A

colours are used to show higher and lower values of a variable
not helpful for getting precise values
helpful for seeing geographic trends and generating interesting research questions

51
Q

contingency tables

A

summarized data for two categorical variables

-each value in table represents number of times a particular combination of variable outcomes occurred

52
Q

row totals

A

total counts across each row

53
Q

column totals

A

total counts down each column

54
Q

relative frequency table

A

replace counts with percentages or proportions

55
Q

row proportions

A

computed as counts divided by row totals

56
Q

segmented bar plot

A

graphical display of contingency table information

57
Q

mosaic plot

A

graphical display of contingency table information

-use areas to represent number of observations

58
Q

probability

A

proportion of times the outcome would occur if we observed the random process an infinite number of times

59
Q

law of large numbers

A

as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome

60
Q

disjoint outcomes

A

aka mutually exclusive

when two outcomes cannot happen at the same time

61
Q

probability distributions

A

table of all disjoint outcomes and their associated probabilities

62
Q

complement of an event

A

all outcomes not in the event

63
Q

sample space

A

set of all possible outcomes

64
Q

independence

A

when knowing the outcome of one process provides no useful information about the outcome of the other

65
Q

marginal probability

A

if a probability is based on a single varaible

66
Q

joint probability

A

probability of outcomes is based on two or more variables

67
Q

defining conditional probability

A

two parts: outcome of interest and condition

68
Q

condition

A

information we know to be true

69
Q

conditional probability

A

the outcome of interests A given condition B

70
Q

tree diagrams

A

organize outcomes and probabilities around the structure of data

71
Q

when are tree diagrams most useful?

A

when two or more processes occur in a sequence and each process is conditioned on its predecessors

72
Q

expected value of X

A

average outcome of X

denoated by E(X)

73
Q

deductive

A

reasoning

74
Q

inductive

A

experience and rasong

75
Q

wheel of science

A

/\ deduction
| theory |
| / \ |
| / \ |
| empirical hypothesis |
| generalizations / |
| \ / |
| \ / |
| observations |
induction \/

76
Q

measurement

A

downward part of wheel of science

77
Q

conceptualization vs operationalize

A

“lack of money” vs “lack of opportunity” are two conceptualizations of poverty
“do you have enough money to feed your family?” operationalizes the conceptualization of poverty
different conceptualizations often require different operationalizations

78
Q

quantitative vs qualitative

A

a little about a lot of people vs a lot about a few people

79
Q

administrative data

A

growing source
digitial data that is collected in process of administering other social goals
everything from information attached to social health number to credit card number
hard to make generalizations beyond the population
eg using database dealing with health cards is hard to generalize to all of Canada because people who didn’t use health cards would be completely ignored

80
Q

survey research

A

designed to ask research questions
responses distilled into data that we work with
measurement necessitates some simplification because we need to compare across different groups of people

81
Q

population vs sample

A

group we want to make a generalization about vs the group we actually have information about

82
Q

census

A

rare kind of sample that covers an entire population, can be very expensive
basically the opposite of an annecdote

83
Q

snowball sampling is often used for?

A

vulnerable communities like illegal immigrant workers in America

84
Q

experiments

A

typicaly create artificial situtions that are designed to isolate variables of interest and their effects

85
Q

R

A

increasingly popular open source client

accessible because it’s free

86
Q

SPSS

A

popular for undergrads and certain fields

designed for doing experiment research

87
Q

Stata

A

popular among sociologists and economists

88
Q

stacked dot plot

A

higher bars represent areas where there are more observations
makes it easier to judge the centre and shape of the distribution

89
Q

questionaire

A

contains actual phrasing of question and options for the responses

90
Q

codebook

A

summarize the data set; tells us what the dataset names mean like dictionary

91
Q

CANSIM

A

micro data, summary statistics (overall estimates)

92
Q

ODESI

A

contains confidential information
we can use the public-use parts of ODESI, in which everything is anonymized and variables have been “tweaked” a little in order to make sure that information can’t be traced back to respondents

93
Q

Rsearch Data Centres

A

stuff you can’t find on PUMFs

94
Q

measures of central tendency

A

mode, median, mean

95
Q

pros and cons of mode

A

+can be used for all types of measures, relatively quick/simple measure
-doesn’t ues much information, most common doesn’t necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren’t other ages)

96
Q

how to calculate median

A

odd: middle observation
even: average of two middle observations

97
Q

pros and cons of mode

A

+capture actual centre of distribution, less suceptible to outliers
-computationally awkward, cannot be estimated for unordered categorical variables

98
Q

percentiles

A

general concept, closely related to median (median = 50th percentile)
100 percetniles

99
Q

interquartile range

A

between 25th and 75th

100
Q

90th percentile

A

90% of observations are lower, 10% are higher

101
Q

25th percentile

A

25% of observation are lower, 75% are higher

102
Q

mean cons

A

more susceptible to outliers

103
Q

measures of dispersion

A

aim to give us a sense of breath of distribution

104
Q

range

A

interval between smallest and largest values

105
Q

pros and cons for range

A

+good for quick check

-only takes into account two observations, very sensitive, only useful for numeric variables

106
Q

pros and cons of SD

A

+variance and SD take into account all scores, accurately describes “typical” deviation, easily interpreted
-sensitive to outliers, can only be calculated for numerical variables

107
Q

proportions

A

frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases

108
Q

frequency cons

A

working with them is tough when trying to conceptualize comparisons
-this can be fixed by changing them into percentages

109
Q

cumulative percentage

A

the percentage in the category + the category under it

only works for ordinal variables

110
Q

random process

A

a process where we know what outcomes can happen, but we don’t know which particular outcome will happen

111
Q

rules for probability distribution

A
  1. outcomes listed are disjoint
  2. each probability must equal between 0 and 1
  3. all probabilities must total 1
112
Q

algebra of possibility

A

if we know the possibility of their component outcomes, we can know the probability of two events

113
Q

continuous distribution

A

another way of summarizing information
-more advanced mathematical concept than bar graph
the line is called probability density function
-describes information in graph
-has interesting properties
-can be used to infer probability of any outcome
-never loops back (line only moves from left to right)
-always less than one
-the area under the curve adds up to 1

114
Q

the area equals P

A

the area under the curve gives the probability of people falling in that range