Midterm 1 Flashcards
population
sample
producing data
–pop. - the entire group of individuals that is the target of our interest
–sample - a subgroup of the pop.
–choosing a sample and analyzing date
individual
variable
an entity that is observed (human, classroom, mouse)
–var. - characteristic that is measured on each individual (height)
quantitative variable
categorical variable
measurement
quant. - var. whose possible values are meaningful numbers (cost, height)
cat. - var. whose possible values are non-quantitative categories (gender, opinion)
measurement - value of a variable for an individual (textbook cost for Nathan)
single variable pattern distribution of a variable
summary of data one variable at a time
sample survey
observational study where indiv. report variables values themselves, freq. opinions - make (uncertain) assumptions about the pop. from sample
- explicitly describe pop.
- explicitly describe variable
- select representative sample
larger samples have LESS uncertainty
–sample facts only approx. pop. (uncertainty)
Parameter
statistic
Parameter - numerical fact about the var. in the Pop. (never exactly equal but should be a good representation)
Statistic - the corresponding numerical fact in the SAMPLE
- BAD SAMPLING
- Convenience sampling
- -easiest way (stop ppl in will, pick the 1st truck, 1st 25 chickens) - Volunteer response sampling - (television polls, online, rate prof.) - indiv. select themselves
- quota sampling - force sample to meet specified quotas
- -ex. recruit 200 females and 300 males btw 45-65
- -not random, not good representation
bias - sample favors certain outcomes and not good representative
- SRS
GOOD sampling
- SRS - probability sample - simple random sample
- -each indiv. has a known probability of being selected
- -names in a hat
- -random digit table
- -random # generator
- Cluster sample
GOOD sampling
- cluster - “all of some”
- –used when pop. is naturally divided into groups called clusters (YSA wards, households in city blocks)
- -each cluster is rep. of pop. as a whole
- -random sample of clusters taken - all indiv. inside clusters are included in the sample
- Stratified random sample
GOOD sampling
- Stratified random sample - “some of all”
- –classify pop. into groups (Strata) that are diff. from each other (Age, gender)
- -indiv. within a group (Stratum) share a similar characteristic (all males, all children)
- -select SRS from EVERY group - then combine SRSs
- Multistage sample
GOOD sampling
- Multistage sample - “some of some of some”
- -V.S. = states (SRS and choose 5) —> counties (SRS choose 5) —> people (SRS choose 5)
- -church - areas - stakes - wards - members (SRS randomly choose certain number each time break it down)
Samples have problems due to BIAS
- -under coverage
- -non-response
- -misleading response
- -interviewer influences response
- -question ordering
- -question structure
- -wordking of ?
under coverage
–ind. with no chance of being selected (homeless, phone less)
non-response
–selected ind. refuse to answer (hangups, on vacation, don’t mail back)
misleading response
–ind. give inaccurate answer - have you cheated? do you wash hands? (private surveys avoid this better)
interviewer influences response
–rude, intimidating, subtle clues
question ordering
–happiness question precedes debt question - vise versa
question structure
–open ended (unlimited answers - what is your fav. music?), closed question (limits responses - What is your fav. music btw country and rap?)
wording of ?
–leading phrases, loaded words, ambiguities that influence response
problems with observational studies 2
- -subjects choose which treatment to rec. or which group to belong to
- -lurking variables - influence the response variable
- -passive data collection: observing, measuring, counting, subjects are undisturbed
- -media often improperly attribute cause-effect conclusions to these
Experiment vocabulary
–subject
impose treatments on people rather than observing
–we determine if treatments cause change in response
subject - being tested - indiv. to which treatment is applied
response variable
explanatory variable
response - characteristic measured on each subject (whether has cancer or not)
explanatory - used to predict or explain changes in the response variable (drug to test to see of works on cancer patients)
factor
treatment
factor - planned explanatory variable
treatment - experimental condition applied to subject = value of factor
lurking variables
control
confounding
lurking variables - variables that affect the response variable
control - effort to REDUCE effects of lurking variables
confounding - situation in which effects of lurking variables cannot be distinguished from effects of factors (if there is a lurking variable = confounding exists)
Historical comparison experiment
- a study involves only one treatment
- treated subjects compared to untreated subjects from another study
- -not good bc LOTS of lurking variables and diff. time periods
Unreplicated experiment
- Study assigns 1 subject to treatment A and 1 to treatment B (only ONE in A and ONE in B)
- variation among subjects confounded with treatments; can’t evaluate magnitude of variation (confounding problem)
confounded experiment
subject A treated Digg. that B - but multiple subjects in each group
4 principles of valid experiements
- control/comparison
- -control lurking variables - homogenous subjects - used to measure placebo (ex. vaccine and placebo)
- -need at least 2 groups - Randomization
- -neutralize lurking variables by randomly assigning - Replication
- -multiple subjects in each group treated - Double binding
- -neither subjects nor ppl who evaluate know which treatment the group received
diagnostic bias
lack of realism
diagnostic bias
–diagnosis of subjects - bc if doctor created the pill - wants you to believe it worked
lack of realism
- -realism is often compromised by controlled study conditions
- -solution: awareness of hidden bias and admit limitations of experiments
Hawthorne effect
non-compliance
Hawthorne effect
–ppl behave diff. in an experiment bc you are aware you are being watched
non-compliance
- -fail to submit to assigned treatment
- -refusal to follow protocol
Valid experimental design
- randomized controlled (RCE)
- -all subjects randomly assigned to treatments (any of treatments)
- -names in a hat - randomized block design (RBD)
- -create blocks (ind. with same characteristics) - then within each block randomly assign treatments (like stratified sample but with experiments)
- -good bc reduces variation
- -STANDARD FOR GOOD EXPERIMENT: Comparison (2 groups), randomization, replication, double-binding - MATCHED PAIRS
- -special case of block
- -block: pair of ind. and pair of measurements
- -ex. twins - each rec. a treatment, 2 treatments on each ind., measure before and after treatment for each ind.
- -ex. test diet with exercise and diet without exercise
- —–block: pair of identical twins, exp. var. - whether dieting includes exercise, response variable: cholesterol level
- —–comparison: 2 treatments, random, replication (20 sets of twins)
Histogram
- -horizontal line to cover range of date
- -divide range into classes of width
- -count # ind. in each class - construction bar over each class with HEIGHT being a percentage of total (frequency)
Advantage over stem plots? data can be any SIZE
Stem and leaf plot
dot plot
Stem (all but last digit) vertical, leaf to right
dot plot
- -have x axis of values and dot by frequency
- -usually used for discrete quantitative
To interpret visual displays:
Shape
- –symmetric and bell shaped
- –right/left skewed (left skewed has long tail to left and hump on the right)
- –bimodal (two peaks)
- –flat or uniform (flat and same across)
center
- –the median (half both sides) (not influenced by outliers)
- –mean
- -mode
spread
–how varied is data? - RANGE - look at min. and max.
Center measurements
summarize QUANTITATIVE variables
center
—mode = value at “peak” - value with HIGHEST freq.
—median = the middle value, denote by M, 1/2 area to right and 1/2 to left
—mean = the center of gravity
consumer alert - on news either median or mean can be called the “average”
- -if symmetrical they are approx. equal
- -median is “resistant” to outliers and long tails
- -mean has desirable properties for inference
- -use median of skewed or outliers are present and use mean if roughly symmetrical
notation to find mean
x(bar) = 1/n*(sum of xi)
Spread measurements
Range = max. - min. - HIGHLY affected by outliers
IQR - range occupied by middle 50% of data (3rd Q - 1st Q)
—highly clustered if small - if large compared to range it is less clustered (Q3 - Q1)
–IQR is resistant to outliers
IQR
- -1st Quartile - with approx. 25% of observations and 75% above
- -2nd quartile = the median
flag outliers - reasons
- if distribution is long tailed and value is legitimate: keep outlier
- if values produced under diff. conditions than rest of data set: remove outlier
- if value is mistake or typo:
- -correct if possible, otherwise remove
1.5 X IQR - then do Q1 - (total) and Q3 + total to see if outliers
box plot
5 # summary --min. --Q1 --median --Q3 --Max (if Median to Q1 is SMALLER than median to Q3 can know it is right skewed)
box plot is made from 5th summary
- -whiskers go to non-flagged values (no outliers)
- -flagged = outliers
advantage of box plot is can be used to compare several distributions next to each other easily
St. Dev.
measures for both overall spread and clustering
- –quantifies spread by measuring how far from mean
- -NOT resistant to outliers
- -should be paired with the mean
s = radical((sum of (x - mean)^2) / n-1)
Normal distributions
within 1 s = 68%
within 2 s = 95%
within 3 s = 99.7%
so means:
.15, 2.35, 13.5, 34, 34, 13.5, 2.35, .15
Probability
- -use prob. to take sample data and make inference
- -game with car and goats - 2x more likely to win if switch (bc odds of winning greater with less options)
random phenomenon
ind. outcome unpredictable but outcome from large # of reps follows a regular pattern (rolling a die)
sample space
event
sample space
–set of all possible outcomes (for # of dots on die)
event
–collected of possible outcomes (we can write event “rolling an odd die”
probability of outcome
proportion of times that an outcome occurs in many, many repetitions of random phenomenon
P(A) = 0 will not happen P(A) = .5 1/2 chance will to 1/2 chance won't P(A) = 1 = will happen
Empirical probability
law of large numbers
Empirical
- -approximate by playing the game many times and OBSERVING the frequency of occurrence
- –find by DOING
law of large numbers
–as # trials (Repetitions) of experiment/game increase, the relative frequency gets closer to the theoretical prob. of the event
Probability distribution
set of possible outcomes in sample space and Prob! associated with each outcome (as a percent)
- -prob. must sum to 1
- -can be represented by table, formula or graph
random variable
continuous random variable
discrete random variable
random variable
–characteristics measured on each indiv. (cost, height, gender)
cont. random var.
- -variable that can take on any value in an interval so that all possible values cannot be listed (time, height, temp.)
discrete random var.
–var. whose possible values are a list of distinct values (gender, opinion, # arrests, shoe size)
2 types of prob. distributions for discrete variables
- discrete categorical
—random var. = major
–distribution table - bar graphs also used
(list the majors and the prob. of grad students in them underneath like in math 118)
–or in a bar graph with list of categorical majors on x axis and prob. on y axis
–DO compare percentages for outcomes…DON’T calc. measures of center/spread (no mean, med., st. dev. IQR) - discrete quantitative variable (distributions)
- -random variable = household size
- -same in table but with numbers instead of categories and percentage prob. on the second line
- -or histogram - x axis is number persons in households and y axis is a percentage prob.
- -compare % and CAN calc. measures of center or spread
distribution for cont. random variable
- -can take on any value within range of variable with no gaps
- -focus on prob. that value is in a specific interval
- -(Ex. prob. height is btw 67.5 in. and 68.5 in.)
- -histogram with intervals on x axis and % prob. rate on Y axis
- -cant compare % and calc. measures of center and spread
Prob. density curve
- -model prob. distribution with smooth curve
- -smooth curve is a model on or above the horizontal x axis
- -area under curve is 1
- -where crib os HIGH - data values are more dense
- -more accurate estimates of prob. that using histogram of sample data
prob. that x occurs in any interval is equal to AREA under curve for that interval
- -LOOK AT NOTES AGAIN
median of density curve - the value that divides the area of the density curve in half
normal distribution names and notation
center
- -name - mean
- -for a density curve - mu (the M-looking symbol)
- -histogram notation is - x(bar)
spread
- -name - st. dev.
- -for density curve - (sigma - o with line)
- -for histogram notations is - s
standardization
allows us to compare diff. normal distributions
–math conversion of normally distributed variable to STANDARD normal variable
z = (x - mean) / st. dev.
–z score gives # of st. dev. above or below the mean of normal distribution
ex. z = 4122.5 - 3485 / 425 = 1.5 (means 1.5 st. dev. above mean)
–> z score means. better score on the test
4 big steps in Big Picture of stats
- Producing data
- exploratory data analysis
- Probability
- inference