Midterm 1 Flashcards

1
Q

population
sample
producing data

A

–pop. - the entire group of individuals that is the target of our interest

–sample - a subgroup of the pop.

–choosing a sample and analyzing date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

individual

variable

A

an entity that is observed (human, classroom, mouse)

–var. - characteristic that is measured on each individual (height)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

quantitative variable
categorical variable
measurement

A

quant. - var. whose possible values are meaningful numbers (cost, height)
cat. - var. whose possible values are non-quantitative categories (gender, opinion)

measurement - value of a variable for an individual (textbook cost for Nathan)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

single variable pattern distribution of a variable

A

summary of data one variable at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

sample survey

A

observational study where indiv. report variables values themselves, freq. opinions - make (uncertain) assumptions about the pop. from sample

  1. explicitly describe pop.
  2. explicitly describe variable
  3. select representative sample

larger samples have LESS uncertainty
–sample facts only approx. pop. (uncertainty)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Parameter

statistic

A

Parameter - numerical fact about the var. in the Pop. (never exactly equal but should be a good representation)

Statistic - the corresponding numerical fact in the SAMPLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. BAD SAMPLING
A
  1. Convenience sampling
    - -easiest way (stop ppl in will, pick the 1st truck, 1st 25 chickens)
  2. Volunteer response sampling - (television polls, online, rate prof.) - indiv. select themselves
  3. quota sampling - force sample to meet specified quotas
    - -ex. recruit 200 females and 300 males btw 45-65
    - -not random, not good representation

bias - sample favors certain outcomes and not good representative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. SRS
A

GOOD sampling

  1. SRS - probability sample - simple random sample
    - -each indiv. has a known probability of being selected
    - -names in a hat
    - -random digit table
    - -random # generator
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Cluster sample
A

GOOD sampling

  1. cluster - “all of some”
    - –used when pop. is naturally divided into groups called clusters (YSA wards, households in city blocks)
    - -each cluster is rep. of pop. as a whole
    - -random sample of clusters taken - all indiv. inside clusters are included in the sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Stratified random sample
A

GOOD sampling

  1. Stratified random sample - “some of all”
    - –classify pop. into groups (Strata) that are diff. from each other (Age, gender)
    - -indiv. within a group (Stratum) share a similar characteristic (all males, all children)
    - -select SRS from EVERY group - then combine SRSs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. Multistage sample
A

GOOD sampling

  1. Multistage sample - “some of some of some”
    - -V.S. = states (SRS and choose 5) —> counties (SRS choose 5) —> people (SRS choose 5)
    - -church - areas - stakes - wards - members (SRS randomly choose certain number each time break it down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Samples have problems due to BIAS

  • -under coverage
  • -non-response
  • -misleading response
  • -interviewer influences response
  • -question ordering
  • -question structure
  • -wordking of ?
A

under coverage
–ind. with no chance of being selected (homeless, phone less)

non-response
–selected ind. refuse to answer (hangups, on vacation, don’t mail back)

misleading response
–ind. give inaccurate answer - have you cheated? do you wash hands? (private surveys avoid this better)

interviewer influences response
–rude, intimidating, subtle clues

question ordering
–happiness question precedes debt question - vise versa

question structure
–open ended (unlimited answers - what is your fav. music?), closed question (limits responses - What is your fav. music btw country and rap?)

wording of ?
–leading phrases, loaded words, ambiguities that influence response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

problems with observational studies 2

A
  • -subjects choose which treatment to rec. or which group to belong to
  • -lurking variables - influence the response variable
  • -passive data collection: observing, measuring, counting, subjects are undisturbed
  • -media often improperly attribute cause-effect conclusions to these
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Experiment vocabulary

–subject

A

impose treatments on people rather than observing
–we determine if treatments cause change in response

subject - being tested - indiv. to which treatment is applied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

response variable

explanatory variable

A

response - characteristic measured on each subject (whether has cancer or not)

explanatory - used to predict or explain changes in the response variable (drug to test to see of works on cancer patients)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

factor

treatment

A

factor - planned explanatory variable

treatment - experimental condition applied to subject = value of factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

lurking variables
control
confounding

A

lurking variables - variables that affect the response variable

control - effort to REDUCE effects of lurking variables

confounding - situation in which effects of lurking variables cannot be distinguished from effects of factors (if there is a lurking variable = confounding exists)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Historical comparison experiment

A
  1. a study involves only one treatment
  2. treated subjects compared to untreated subjects from another study
    - -not good bc LOTS of lurking variables and diff. time periods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Unreplicated experiment

A
  1. Study assigns 1 subject to treatment A and 1 to treatment B (only ONE in A and ONE in B)
  2. variation among subjects confounded with treatments; can’t evaluate magnitude of variation (confounding problem)
20
Q

confounded experiment

A

subject A treated Digg. that B - but multiple subjects in each group

21
Q

4 principles of valid experiements

A
  1. control/comparison
    - -control lurking variables - homogenous subjects - used to measure placebo (ex. vaccine and placebo)
    - -need at least 2 groups
  2. Randomization
    - -neutralize lurking variables by randomly assigning
  3. Replication
    - -multiple subjects in each group treated
  4. Double binding
    - -neither subjects nor ppl who evaluate know which treatment the group received
22
Q

diagnostic bias

lack of realism

A

diagnostic bias
–diagnosis of subjects - bc if doctor created the pill - wants you to believe it worked

lack of realism

  • -realism is often compromised by controlled study conditions
  • -solution: awareness of hidden bias and admit limitations of experiments
23
Q

Hawthorne effect

non-compliance

A

Hawthorne effect
–ppl behave diff. in an experiment bc you are aware you are being watched

non-compliance

  • -fail to submit to assigned treatment
  • -refusal to follow protocol
24
Q

Valid experimental design

A
  1. randomized controlled (RCE)
    - -all subjects randomly assigned to treatments (any of treatments)
    - -names in a hat
  2. randomized block design (RBD)
    - -create blocks (ind. with same characteristics) - then within each block randomly assign treatments (like stratified sample but with experiments)
    - -good bc reduces variation
    - -STANDARD FOR GOOD EXPERIMENT: Comparison (2 groups), randomization, replication, double-binding
  3. MATCHED PAIRS
    - -special case of block
    - -block: pair of ind. and pair of measurements
    - -ex. twins - each rec. a treatment, 2 treatments on each ind., measure before and after treatment for each ind.
    - -ex. test diet with exercise and diet without exercise
    - —–block: pair of identical twins, exp. var. - whether dieting includes exercise, response variable: cholesterol level
    - —–comparison: 2 treatments, random, replication (20 sets of twins)
25
Q

Histogram

A
  • -horizontal line to cover range of date
  • -divide range into classes of width
  • -count # ind. in each class - construction bar over each class with HEIGHT being a percentage of total (frequency)

Advantage over stem plots? data can be any SIZE

26
Q

Stem and leaf plot

dot plot

A

Stem (all but last digit) vertical, leaf to right

dot plot

  • -have x axis of values and dot by frequency
  • -usually used for discrete quantitative
27
Q

To interpret visual displays:

A

Shape

  • –symmetric and bell shaped
  • –right/left skewed (left skewed has long tail to left and hump on the right)
  • –bimodal (two peaks)
  • –flat or uniform (flat and same across)

center

  • –the median (half both sides) (not influenced by outliers)
  • –mean
  • -mode

spread
–how varied is data? - RANGE - look at min. and max.

28
Q

Center measurements

A

summarize QUANTITATIVE variables
center
—mode = value at “peak” - value with HIGHEST freq.
—median = the middle value, denote by M, 1/2 area to right and 1/2 to left
—mean = the center of gravity

consumer alert - on news either median or mean can be called the “average”

  • -if symmetrical they are approx. equal
  • -median is “resistant” to outliers and long tails
  • -mean has desirable properties for inference
  • -use median of skewed or outliers are present and use mean if roughly symmetrical

notation to find mean
x(bar) = 1/n*(sum of xi)

29
Q

Spread measurements

A

Range = max. - min. - HIGHLY affected by outliers
IQR - range occupied by middle 50% of data (3rd Q - 1st Q)
—highly clustered if small - if large compared to range it is less clustered (Q3 - Q1)
–IQR is resistant to outliers

IQR

  • -1st Quartile - with approx. 25% of observations and 75% above
  • -2nd quartile = the median
30
Q

flag outliers - reasons

A
  1. if distribution is long tailed and value is legitimate: keep outlier
  2. if values produced under diff. conditions than rest of data set: remove outlier
  3. if value is mistake or typo:
    - -correct if possible, otherwise remove

1.5 X IQR - then do Q1 - (total) and Q3 + total to see if outliers

31
Q

box plot

A
5 # summary
--min.
--Q1
--median
--Q3
--Max
(if Median to Q1 is SMALLER than median to Q3 can know it is right skewed)

box plot is made from 5th summary

  • -whiskers go to non-flagged values (no outliers)
  • -flagged = outliers

advantage of box plot is can be used to compare several distributions next to each other easily

32
Q

St. Dev.

A

measures for both overall spread and clustering

  • –quantifies spread by measuring how far from mean
  • -NOT resistant to outliers
  • -should be paired with the mean

s = radical((sum of (x - mean)^2) / n-1)

33
Q

Normal distributions

A

within 1 s = 68%
within 2 s = 95%
within 3 s = 99.7%

so means:

.15, 2.35, 13.5, 34, 34, 13.5, 2.35, .15

34
Q

Probability

A
  • -use prob. to take sample data and make inference

- -game with car and goats - 2x more likely to win if switch (bc odds of winning greater with less options)

35
Q

random phenomenon

A

ind. outcome unpredictable but outcome from large # of reps follows a regular pattern (rolling a die)

36
Q

sample space

event

A

sample space
–set of all possible outcomes (for # of dots on die)

event
–collected of possible outcomes (we can write event “rolling an odd die”

37
Q

probability of outcome

A

proportion of times that an outcome occurs in many, many repetitions of random phenomenon

P(A) = 0 will not happen
P(A) = .5 1/2 chance will to 1/2 chance won't
P(A) = 1 = will happen
38
Q

Empirical probability

law of large numbers

A

Empirical

  • -approximate by playing the game many times and OBSERVING the frequency of occurrence
  • –find by DOING

law of large numbers
–as # trials (Repetitions) of experiment/game increase, the relative frequency gets closer to the theoretical prob. of the event

39
Q

Probability distribution

A

set of possible outcomes in sample space and Prob! associated with each outcome (as a percent)

  • -prob. must sum to 1
  • -can be represented by table, formula or graph
40
Q

random variable
continuous random variable
discrete random variable

A

random variable
–characteristics measured on each indiv. (cost, height, gender)

cont. random var.
- -variable that can take on any value in an interval so that all possible values cannot be listed (time, height, temp.)

discrete random var.
–var. whose possible values are a list of distinct values (gender, opinion, # arrests, shoe size)

41
Q

2 types of prob. distributions for discrete variables

A
  1. discrete categorical
    —random var. = major
    –distribution table - bar graphs also used
    (list the majors and the prob. of grad students in them underneath like in math 118)
    –or in a bar graph with list of categorical majors on x axis and prob. on y axis
    –DO compare percentages for outcomes…DON’T calc. measures of center/spread (no mean, med., st. dev. IQR)
  2. discrete quantitative variable (distributions)
    - -random variable = household size
    - -same in table but with numbers instead of categories and percentage prob. on the second line
    - -or histogram - x axis is number persons in households and y axis is a percentage prob.
    - -compare % and CAN calc. measures of center or spread
42
Q

distribution for cont. random variable

A
  • -can take on any value within range of variable with no gaps
  • -focus on prob. that value is in a specific interval
  • -(Ex. prob. height is btw 67.5 in. and 68.5 in.)
  • -histogram with intervals on x axis and % prob. rate on Y axis
  • -cant compare % and calc. measures of center and spread
43
Q

Prob. density curve

A
  • -model prob. distribution with smooth curve
  • -smooth curve is a model on or above the horizontal x axis
  • -area under curve is 1
  • -where crib os HIGH - data values are more dense
  • -more accurate estimates of prob. that using histogram of sample data

prob. that x occurs in any interval is equal to AREA under curve for that interval
- -LOOK AT NOTES AGAIN

median of density curve - the value that divides the area of the density curve in half

44
Q

normal distribution names and notation

A

center

  • -name - mean
  • -for a density curve - mu (the M-looking symbol)
  • -histogram notation is - x(bar)

spread

  • -name - st. dev.
  • -for density curve - (sigma - o with line)
  • -for histogram notations is - s
45
Q

standardization

A

allows us to compare diff. normal distributions
–math conversion of normally distributed variable to STANDARD normal variable

z = (x - mean) / st. dev.
–z score gives # of st. dev. above or below the mean of normal distribution

ex. z = 4122.5 - 3485 / 425 = 1.5 (means 1.5 st. dev. above mean)

–> z score means. better score on the test

46
Q

4 big steps in Big Picture of stats

A
  1. Producing data
  2. exploratory data analysis
  3. Probability
  4. inference