Final Flashcards

Question 1

Q

data

Answer

A

observations collected from field notes, surveys, experiments, etc

Question 2

Q

what is the backbone of statistical investigation

Question 3

Q

statistics

Answer

A

study of how to collect, analyze, draw conclusions, analyze the data, form a conclusion

Question 4

Q

classic challenge in statistics

Answer

A

evaluating the efficacy of medical treatment

Question 5

Q

summary statistic

Answer

A

a single number summarizing a large amount of data

Question 6

Q

variables

Answer

A

characteristic

Question 7

Q

data matrix

Answer

A

a way to organize data

Question 8

Q

numerical variable

Answer

A

wide range of numerical values, sensible to add/subtract/take averages

Question 9

Q

types of numerical variables

Answer

A

discrete, continuous

Question 10

Q

discrete

Answer

A

can only take numerical values with jumps (eg number of siblings)

Question 11

Q

continuous

Answer

A

can take numerical values without jumps (eg height)

Question 12

Q

categorical

Answer

A

responses are categories

Question 13

Q

types of categorical

Answer

A

ordinal, nominal

Question 14

Q

ordinal variable

Answer

A

categorical but have a natural ordering (eg Likert scale)

Question 15

Q

nominal variable

Answer

A

categorical and no natural ordering (eg favourite ice cream

Question 16

Q

negative, positive, independent association

Question 17

Q

population vs sample

Question 18

Q

anecdotal evidence

Answer

A

data collected in haphazard fashion from individual cases, usually composed of unusual cases that we recall based on their striking characteristics

Question 19

Q

random sampling

Answer

A

avoid adding bias

Question 20

Q

simple random sampling

Answer

A

most basic random sample, using raffle; every case in population has equal chance of being included

Question 21

Q

non response bias

Answer

A

response rates can influence bias from a random sample

Question 22

Q

convenience sampling

Answer

A

individuals who are easily accessible are more likely to be included in the sample

Question 23

Q

independent variable

Answer

A

explanatory variables

Question 24

Q

response variables

Answer

A

dependent variable

Question 25

Q

observational studies

Answer

A

collection of data in way that doesn’t directly interfere with how the data arises
eg: collecting surveys, ethnography, etc

Question 26

Q

randomized experiment

Answer

A

when individuals are randomly assigned to a group

Question 27

Q

confounding variable

Answer

A

variable correlated with both the explanatory and response variables
aka: lurking variable, confounding factor, confounder

Question 28

Q

prospective study

Answer

A

identifies individuals and collects information as events unfold
eg: medical researchers may identify and follow a group of similar individuals over many years

Question 29

Q

retrospective study

Answer

A

collects data after events have taken place

eg: researchers may review past events in medical records

Question 30

Q

simple random sampling

Answer

A

every case in population has equal chance of being included

Question 31

Q

stratified sampling

Answer

A

divide-and-conquer; population is divided into strata (which are chosen so similar cases are grouped together), then a second sampling method (usually simple random) is employed within each stratum
eg: who in Canada goes to theme parks? intentionally oversampling PEI because if we didn’t, most of the respondents would probably be from other provinces like Ontario, and PEI might be skipped entirely

Question 32

Q

when is stratified sampling useful?

Answer

A

when cases in each stratum are very similar with respect to the outcome of interest

Question 33

Q

cluster sampling

Answer

A

break up population into clusters, then sample a fixed number of clusters and include all observations from each of the samples
eg: surveying Saskatchewan children by sampling Saskatchewan schools randomly, then simple random sampling kids from the selected schools

Question 34

Q

multistage sampling

Answer

A

like cluster sample, but collect random sample within each selected cluster

Question 35

Q

pros and cons of multistage sampling

Answer

A

+cluster/multistage can be more economical than alternative sampling techniques
+most useful when there’s a lot of case-to-case variability within cluster but clusters themselves don’t look very different from one another
eg: neighbourhoods when they are very diverse
-more advanced analysis techniques are typically required

Question 36

Q

scatter plots and its strength

Answer

A

provides case by case view of two numerical variables

+helpful in quickly spotting associations relating variables, trends, etc

Question 37

Q

dot plots

Answer

A

provides most basic of displays for one variable; like a one-variable dot plot

Question 38

Q

mean

Answer

A

common way to measure centre of distribution of data

add up and divide by n
often labeled as x-bar

Question 39

Q

μ

Answer

A

population mean

Question 40

Q

μx

Answer

A

used to represent which variable to population mean refers to

Question 41

Q

histograms

Answer

A

doesn’t show value of each observation
each value blongs to bin
binned counts are plotted as bars on histogram
provide view of data density

Question 42

Q

pros and cons of histogram

Answer

A

convenient for describing shape of data distribution

Question 43

Q

skewness

Answer

A

right skew (longer right tail)
left skew (longer left tail) 
symmetric (equal tails)

Question 44

Q

one, two, three prominent peaks

Answer

A

unimodal, bimodal, multimodal

Question 45

Q

two measures of variability

Answer

A

varaince, standard deviation

Question 46

Q

variance

Answer

A

the average squared deviation

σ2, standard deviation squared

Question 47

Q

standard deviation

Answer

A

σ

describes how far way the typical observation is from the mean

Question 48

Q

deviation

Answer

A

distance of an observation from its mean

Question 49

Q

box plots

Answer

A

•summarizes data set using five statistics while also plotting unusual observations
•step 1: draw dark line denoting the median, which splits data in half
•step 2: draw rectangle to represent the middle 50% of the data
⁃aka interquartile range aka IQR
⁃measure of variability in data
⁃the more variable the data, the larger the standard deviation and IQR
⁃two boundaries are called first quartile and third quartile
⁃Q1 and Q3 respectively
⁃IQR = Q3 — Q1
•step 3: whiskers attempt to capture data outside of the box
⁃reach is never allowed to be more than 1.5 x IQR
•step 4: any observations beyond the whiskers are identified as outliers
•robust estimates: extreme observations have little effect on value
⁃median and IQR are robust estimates

Question 50

Q

mapping data

Answer

A

colours are used to show higher and lower values of a variable
not helpful for getting precise values
helpful for seeing geographic trends and generating interesting research questions

Question 51

Q

contingency tables

Answer

A

summarized data for two categorical variables

-each value in table represents number of times a particular combination of variable outcomes occurred

Question 52

Q

row totals

Answer

A

total counts across each row

Question 53

Q

column totals

Answer

A

total counts down each column

Question 54

Q

relative frequency table

Answer

A

replace counts with percentages or proportions

Question 55

Q

row proportions

Answer

A

computed as counts divided by row totals

Question 56

Q

segmented bar plots

Answer

A

graphical display of contingency table information

Question 57

Q

mosaic plot

Answer

A

graphical display of contingency table information

-use areas to represent number of observations

Question 58

Q

probability

Answer

A

proportion of times the outcome would occur if we observed the random process an infinite number of times

Question 59

Q

law of large numbers

Answer

A

as more observations are colelcted, the proportion p^n occurences with a particular outcome converges to the probability p of that outcome

Question 60

Q

disjoint outcomes

Answer

A

aka mutually exclusive

when two outcomes cannot happen at the same time

Question 61

Q

probability distributions

Answer

A

table of all disjoint outcomes and their associated probabilities

Question 62

Q

complement of event

Answer

A

all outcomes not in the event

Question 63

Q

sample space

Answer

A

set of all possible outcomes

Question 64

Q

independence

Answer

A

when knowing the outcome of one process provides no useful information about the outcome of the other

Answer 62

A

if a probability is based on a single varaible

Answer 63

A

probability of outcomes is based on two or more variables

Answer 64

A

two parts: outcome of interest and condition

Answer 65

A

information we know to be true

Answer 66

A

the outcome of interests A given condition B

Answer 67

A

organize outcomes and probabilities around the structure of data

Answer 68

A

when two or more processes occur in a sequence and each process is conditioned on its predecessors

Answer 69

A

average outcome of X

denoated by E(X)

Answer 70

A

reasoning

Answer 71

A

experience and reasoning

Answer 72

A

Answer 73

A

downward part of wheel of science

Answer 74

A

“lack of money” vs “lack of opportunity” are two conceptualizations of poverty
“do you have enough money to feed your family?” operationalizes the conceptualization of poverty
different conceptualizations often require different operationalizations

Answer 75

A

a little about a lot of people vs a lot about a few people

Answer 76

A

growing source
digitial data that is collected in process of administering other social goals
everything from information attached to social health number to credit card number
hard to make generalizations beyond the population
eg using database dealing with health cards is hard to generalize to all of Canada because people who didn’t use health cards would be completely ignored

Answer 77

A

designed to ask research questions
responses distilled into data that we work with
measurement necessitates some simplification because we need to compare across different groups of people

Answer 78

A

group we want to make a generalization about vs the group we actually have information about

Answer 79

A

rare kind of sample that covers an entire population, can be very expensive
basically the opposite of an annecdote

Answer 80

A

vulnerable communities like illegal immigrant workers in America

Answer 81

A

sample is still random, but we tweak things so that some cases are less/more likely to be selected

Answer 82

A

non-response
voluntary response
convenience response

Answer 83

A

typicaly create artificial situtions that are designed to isolate variables of interest and their effects

Answer 84

A

+make meaningful connection

-hard to make assumptions of causation

Answer 85

A

increasingly popular open source client

accessible because it’s free

Answer 86

A

popular for undergrads and certain fields

designed for doing experiment research

Answer 87

A

popular among sociologists and economists

Answer 88

A

higher bars represent areas where there are more observations
makes it easier to judge the centre and shape of the distribution

Answer 89

A

modality (how mnay humps?)
skewness (one side of distribution looks very different from other side)
outliers (one or two variables are unusual)

Answer 90

A

contains actual phrasing of question and options for the responses

Answer 91

A

summarize the data set; tells us what the dataset names mean like dictionary

Answer 92

A

micro data, summary statistics (overall estimates)

Answer 93

A

contains confidential information
we can use the public-use parts of ODESI, in which everything is anonymized and variables have been “tweaked” a little in order to make sure that information can’t be traced back to respondents

Answer 94

A

Research Data Centre; stuff you can’t find on PUMFs

Answer 95

A

mode, median, mean; ie where does the modality tend to accumulate?

Answer 96

A

+can be used for all types of measures, relatively quick/simple measure
-doesn’t ues much information, most common doesn’t necessarily mean typical (eg: 53 year old is mode, but there are plenty of people who aren’t other ages)

Answer 97

A

odd: middle observation
even: average of two middle observations

Answer 98

A

+capture actual centre of distribution, less suceptible to outliers
-computationally awkward, cannot be estimated for unordered categorical variables

Answer 99

A

general concept, closely related to median (median = 50th percentile)
100 percetniles

Answer 100

A

between 25th and 75th

Answer 101

A

90% of observations are lower, 10% are higher

Answer 102

A

25% of observation are lower, 75% are higher

Answer 103

A

more susceptible to outliers

Answer 104

A

aim to give us a sense of breath of distribution

e.g. compare temperature in Saskatoon vs Vancouver

Answer 105

A

interval between smallest and largest values

Answer 106

A

+good for quick check

-only takes into account two observations, very sensitive, only useful for numeric variables

Answer 107

A

+variance and SD take into account all scores, accurately describes “typical” deviation, easily interpreted
-sensitive to outliers, can only be calculated for numerical variables

Answer 108

A

frequencies are convoluted, make comparisons difficult, so proportions standardize frequency by number of cases

Answer 109

A

working with them is tough when trying to conceptualize comparisons
-this can be fixed by changing them into percentages

Answer 110

A

the percentage in the category + the category under it

only works for ordinal variables

Answer 111

A

a process where we know what outcomes can happen, but we don’t know which particular outcome will happen

Answer 112

A

outcomes listed are disjoint
each probability must equal between 0 and 1
all probabilities must total 1

Answer 113

A

if we know the possibility of their component outcomes, we can know the probability of two events

Answer 114

A

another way of summarizing information
-more advanced mathematical concept than bar graph
the line is called probability density function
-describes information in graph
-has interesting properties
-can be used to infer probability of any outcome
-never loops back (line only moves from left to right)
-always less than one
-the area under the curve adds up to 1

Answer 115

A

the area under the curve gives the probability of people falling in that range

Answer 116

A

lists all the qualities variables can take on and how many people answered to each quality
-impractical for continuous variables because data gets too unwieldy

Answer 117

A

they suck
don't use pie charts
they're misleading
only really great for visualness and public information 
only work for things that sum to 100

Answer 118

A

display simple information well
can chart frequencies and proportions
information doesn’t need to sum to 100

Answer 119

A

as more observations of a random process are collected, the proportion of occurences with a particular outcome converges to the probability of that outcome

Answer 120

A

unimodal, symmetric, bell shaped curve

many variables are nearly normal, but none are exactly normal

Answer 121

A

the mean (where they sit on the number line) and SD (peakness)

Answer 122

A

how many standard deviations does x fall from the mean?

every z score corresponds to a specific percentile

Answer 123

A

saying things about society as a whole without futile attempt to examine the whole society

Answer 124

A

hypothetical number that exists somewhere

any characteristic of a population can be defined by a parameter

Answer 125

A

the difference between estimate and actual parameter

unless we survey every case in the population, we will always have sampling error

Answer 126

A

hypothetical distribution if we could sample our population an infite number of times

Answer 127

A

typical or expected error (standard deviation) based on sampling distribution
aka standard deviation of sampling distribution
-no obvious way to estimate SE from single sample

Answer 128

A

if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model
as n becomes large, the sampling distribution approaches normality and it has less and less error in it
standard error will be bigger if the population has a larger population
we can decrease our standard error if we make a bigger sample

Answer 129

A

estimate
standard error
desired confidence level

Answer 130

A

a plausible range of the population paramter
“what is the porbability that the population mean falls within a certain range”
trades off with confidence

Answer 131

A

we can narrow confidence interval without reducing confidence by reducing our standard error

Answer 132

A

probability of observing data favourable to the alternative hypothesis if null is true
p values are controversial
the greater the p value, the more likely the null is true
isn’t a quantifier, only a probability

Answer 133

A

comparing world we actually observe to what we think the world should be like
if our evidence looks nothing like the null, we can reject the null

Answer 134

A

we don’t want to say how certain we are because we can never collect all the information, therefore there is always a possibility of one case out there proving us wrong. So we try to improve our chances that hypothesis is right. A type of process of elimination

Answer 135

A

because we accept the hypothesis condtionally, with some probability, but not absolute certainty

Answer 136

A

expresses same information as confidence level, except alpha level shows how unconfident you are. e.g. if confidence level is 95%, alpha level is 0.05

Answer 137

A

how far away does x-bar distribution need to be? if we get a z-score of <1.29
when we test whether X-bar is greater than or less than population mean, but not both
only common in psychology

Answer 138

A

because there’s a way of framing single tail tests that make it accidentally easier to rejet the null, therefore more likely to find positive research findings, and lowers the quality of the results

Answer 139

A

(1) write the hypothesis in plain language, then in mathematical notion
(2) identify an appropriate point estimate of the parameter of interest (mean)
(3) verify conditions to ensure the standard error estimate is reasonable and the point estimate is nearly normal and unbiased
(4) compute standard error. draw a picture depicting the distribution of the estimate under the idea that H0 is true
- shade areas representing the p-vlaue
(5) using the picture, compute the test statistic i.e. Z-score and identity the p-value to evaluate hypothesis
(6) write conclusion in plain language

Answer 140

A

we distribute critical region, we don’t assume whether sampling distribution is above or below, just about whether it falls outside or inside
we have to have a larger x-value to reject the hypothesis

Answer 141

A

type 1: falsely rejecting the null

type 2: falsely accepting the null

Answer 142

A

H0 = null hypothesis
-skeptical perspective or claim to be tested
-always write the null hypothesis as an equality
HA = alternative hypothesis
-alternative or new claim under consideration

Answer 143

A

(1) fit simple histogram over normal curve

(2) examine normal probability plot

Answer 144

A

adding more bins provides greater detail
when sample is large, smaller bins still work well
smaller sample sizes, small bins are very volatile