Exam 1 Flashcards
A Geometric model
good for when we’re interested in # of Bernoulli trials until next success
Bernoulli trials
2 possible outcomes (success and failure), probably of success p is constant, the trials are independent
examples of Bernoulli trials
tossing a coin, shooting hoops
a Binomial model
when we’re interested in the number of successes in a certain number of Bernoulli trials
a Normal model
to approximate a binomial when we expect at least 10 successes/failures
Poisson model
when n is large and p is small. good approximation if n>eq 20 peq 100 with p
lambda
only parameter within Poisson model
good model for number of occurrences over given period of time
Poisson (with the parameter the mean of the distribution lambda)
Exponential model
can model the time between two random events
mean time between two events
1 / lambda
when p is small for a large # of cases
Normal model
when checking the probability of this many successes in a row
Binomial for Bernoulli
when asked how many trials until this happens
Geometric for Bernoulli
3 tips for sketching good Normal curve
-(1) bell-shaped and symmetric around its mean, start at the middle and then sketch from left to right, (2) only draw for 3 standard deviations left to right, (3) changes from curving downward to back up is called inflection point and is one standard deviation away from mean
tells how many standard devs a value is from mean
z score
Let y represent value corresponding to
outlying value indicated by a certain z score (e.g. high IQ example)
Table of standard normal distribution
Use it when you’re given a z score and looking for cut off and stuff
Finding a percentile
Use z-table to find value of how many are below that given percentile, then do additional solving for y if needed
Formula for IQR
Q3 - Q1
example of random phenomenon
flipping a coin, two possible outcomes; one toss of coin will consist of a ‘trial’
term for result of one ‘trial’
‘outcome’
term for collection of all possible ‘outcomes’
‘sample space’
definition of empirical probability
[a specific number, what is that called] – says that the long-run relative frequency of repeated independent events (with identical probs.) gets closer and closer to a single value
formula for empirical probability
of times A occurs / # of trials = relative frequency of occurrence A in long run [ex. red light green light, after many days P(green?)?=.35))
When can you NOT do empirical probability assignment
when you cannot repeat events
definition of theoretical probability
for when you cannot repeat events. comes from mathetmical model, not from observations or repetitions. (Ex: American roulette, if you bet on red what is the prob of winning? [18/38])
formula for theoretical probability
P(A) = # of outcomes in A / # of possible outcomes
personal probability
subjective sense based on personal experience and guesswork
definition of formal probability
based on a set of axioms (=a statement we assume to be true) on how probability works
Rule 1
For any event A, 0 <= P(A) <= 1
Rule 2
P(S)=1, the probability of all possible outcomes of a trial must be 1
Rule 3: The complement rule
-the set of outcomes that are not in the event A is called the complement of A, denoted AC
P(A^C) = 1 - P(A)
Addition Rule
P(A or B) = P(A) + P(B) - P(A and B)
Disjoint events
events that have no outcomes in common (and thus cannot occur together). Also called ‘mutually exclusive’
If events are disjoint, then P(A and B) = ??
ONLY if two events are disjoint
0
Pause and do practice exercise on being in a relationship and in sports
Either on powerpoint or in textbook
Conditional probability [of B given A]
P (B I A ) = P (A and B) / P (A)
Contingency table
used for conditional probability, comes up often
Pause and do practice exercise on being a girl and popular
Either in book or slideshow
Shortcut for finding probability of two independent events A and B
P (A and B) = P (A) * P (B I A )
If not independent P(A I B) = ??
0, and definitely not equal to P(A)
marginal probabilities on contingency table
the totals on edges that aren’t the TOTAL
joint probability on contingency table
where probabilities of two things meet on the center part of the table
best diagram for working with conditional probability (probs. with ‘givens’ in them)
tree diagrams
conditioning event on a tree diagram
would be the sooner event
intersectional probability
** check out textbook! **
Reversing the conditioning
This means finding the probability of something working backwards through tree diagram; e.g. knowing something already happened, find the probability that they were this (binge drinker)
Bayes rule
for two branches, a bit complicated probably best to set up tree diagram
Reverse the condition [ctd]
Hard part is recognizing the regular probabilities and conditional probabilities from the problem. Basically you divide the probability of both (e.g. on tree diagram) by the probability of one (which might take some investigating)
Pause and do a practice problem on reversing the conditioning
Binge drinker or from textbook
Random variable (denoted ‘X’)
Two types, discrete and continuous.
specific value (denoted ‘x’)
e.g. P(X=x)
discrete random variables
can take one of a countable number of distinct outcomes (ex. number of credit hours)
continuous random variables
can take any numeric value within a range of values (ex. cost of books this term)
A probability model for a random variable consists of what two things?
(1) collection of all possible outcomes, (2) the probabilities that the values occur
Expected value
E(X) of a discrete random variable can be found by summing the products of each possible value by the prob. that it occurs – write this by hand, E x * P(x)
Variance, formula
E( x - u)^2 * P(x) , write this by hand
Standard deviation, formula
sq root Var (X)
Which formulas apply only to discrete random variables?
Expected value = E x * P(x) // variance = E (x-u)^2 * P(x) // standard deviation = sq rt Variance
Shifting and rescaling random variables by a constant
Adding or subtracting a constant from the data shifts the mean but doesn’t change the variance or standard deviation
Rescaling random variables by a constant
Multiplies the mean by that constant and the variance by the square of that constant
Combining random variables
- The Sum or the difference of two random variables is also a random variable
- the mean of the sum of two random variables is the sum of the means
- the mean of the difference of two random variables is the difference of the means
-If the random variables are independent,
, the variance of their sum or difference is always the sum of the variances *** this is important
Now, what if the two random variables are not independent?
-can you still compute the variance
yes, however you have one additional term here known as covariance
(in this course given definition but not asked to calculate)
if see new random variable
A random variable assumes any of several different numeric values as a result of some random event. Random variables are denoted by a capital letter such as X.
Random variable
assumes a value based on the outcome of a random event
Pause and do screenshot at 1.46 pm
** in class worksheet #1 **
Pause and do screenshot 4/30
** q2 mail-rder **
Bernoulli trials – what are 3 conditions?
- there are two possible outcomes (success and failure)
- the prob of success, p, is constant
- the trials are independent
Examples of Bernoulli trials?
- tossing a coin
- looking for defective products rolling off an assembly line
- shooting free throws in a basketball game
- and many more…
First thing you need to identify in a problem involving Bernoulli
‘what is a trial’
Geometric model – tells us what in terms of Bernoulli trials?
of Bernoulli trials until the first success
What single parameter is specified in the Geometric model
p (probability of success)
Pause and do example of Geometric model [biased coin]
*** screenshot p433
Helpful rules for solving problems on Geometric model
multiplication rule, complement rule
Independence – the 10% condition
Rule that allows us to proceed if we don’t have independent trials in Bernoulli. Still OK to proceed as long as the sample is small than 10% of the population [ex universal blood donors]
Pause do to blood donor practice problem
** screenshot at 4.36pm **
Binomial probability model
No longer counts the trials but counts the number of successes within the fixed number of Bernoulli trials.
Two parameters that define the Binomial model
n, number of trials // p, probability of success
Formulas for mean and s.d. in Binomial model
u = np // s.d. = sqrt (npq)
Pause and do example on blood donors
** 4.41 screenshot **
Poisson model
when n is very large and when p is small. has only one parameter, lambda
What do we mean by n large p small
n> with p100 with p
Pause and do example about leukemia clusters
** screenshot 4.46 ***
Steps for calculating using Poisson
steps:
- Verify Bernoulli
- Aware finite or not
- With replacement or without
- Check 10 percent condition
- Then recognize this is actually a binomial model set up, but because p is small n is large would want to use Poisson in this case
- Will be quite close to what you should be getting in reality under independence
Often on the exams people forget about what with adding up Poisson problems
X=0
Be sure to know 0 factorial
do some practice probs from hw 2
The Normal model
Can also say continuous random variable, bell shaped, unimodal, symmetric
t/f: there is definitely not quite a normal model for every mean and s.d.
false – there IS a normal model for every mean and standard deviation
Why do we use Greek letters for mean and standard deviation
because this mean and s.d. do not come from data, they are numbers (called parameters) that specify the model
Summaries of data, like the sample mean and s.d., are written with Latin letters. Such summaries are called,
statistics
Parameters, “what connects a model to,
the real world”
Pause and watch video on parameter versus statistic
What does the procedure called ‘standardization’ refer to??
change to standard normal model, then go back if needed. Uses z value
68-95-99.7 rule
68 fall within one 1 standard deviation, 95 fall within 2, 99.7 fall within three
If z score is positive
tells us y is above the mean
is z score is negative
tells us y is below the mean
what does z score tell us
how many s.d.’s away from the mean an observation is
The Normal probability plot is nearly straight, so
Normal model applies
Histogram is unimodal and somewhat symmetric, so
Normal model applies
Binomial is discrete, giving probabilities for specific counts, whereas Normal is
a continuous random variable that can take on any value
How to use the Normal model to approximate P(X=10) when X has the binomial model?
Solution: Use continuity correction
The Uniform distribution
spread over region a to b; we know the area under the prob density curve is 1, outside the curve is 0
Value of straight line at top of Uniform Distribution curve is calculated by
1 / (b-a)
Lambda
only parameter of exponential distribution
When your values go from zero to infinity on the non-negative real line
exponential distribution
Parameters of uniform model
endpoints a b
Connection between exponential distribution and Poisson distribution
ex: exponential is time between arrivals, other variable is individuals arrive [identify which variable is discrete, which is continuous, go from there]
What is a variable
One piece of information that you’re measuring for every observation that can be difft for every observation
Pause and do some practice problems that involve using the Z table
** see book chapter [SAT scores] ** also could watch a video on it
A Probability involving “in between”
Use Z scores and subtract under this from under bigger this
Binomial approximations work in two cases, what are they?
(1) n is large, p is small and use binomial, (2) n is large, p is not necessarily small and use normal model [when use normal can ask Q which normal model]
Pause and do practice problem with Tennessee red cross
* middle of lecture notes*
The Success/Failure condition
A Binomial distribution can be approximated by an appropriate Normal distribution if we expect at least 10 successes and 10 failures (i.e. np>10 and nq>10) – then standardize, get z score
Pause and redo 16.51
You got it wrong, idk what you were doing lol ***
Do practice problem on continuity correction
Uniform distribution and sub intervals
any sub interval of same length has equal probabilities
Trend for expoenential
small values more favored, as values increase prob. of observing them exponentially decrease
Difference between exponential and Poisson
- so in hw and in the exams; could have q like average # of customers arriving is equal to five per hour
- then next q is what is the prob. that the next customer will arrive within ten minutes
- information is related to Poisson random variable
- in that case you would want to use the exponential distribution to answer the probability question
More on exponential and Poisson
- both models are like complementary to each other
- one is modeling the #s ; other is modeling the time between those numbers
Relative frequency
relate the value to the whole of the data you’re looking at
Frequency table
a table whose first column displays each distinct outcome and second column displays that outcome’s frequency
relative frequency table
first column displays each distinct outcome and second column display that outcome’s relative frequency [just shows relative as opposed to freq table]
Area principle
How human eye automatically compares areas. Area of distribution should be in proportion to relative frequency values
Bar chart
displays the distribution of a categorical variable, showing the counts for each category next to Eachother for easy comparison [all must have same width and stay true to area principle] – Gaps placed deliberately between
Pie chart presents,
each category as a slice of a circle so that each slice has a size that is proportional to the whole in each category
Contingency table
columns refer to values of one categorical variable, rows refer to values of other categorical variable
Cell of a contingency table
each cell gives the count for a combination of values of the two variables. ex: there were 528 third-class ticket holders who died
bottom row is totals and reps marginal distributions for that variable where as right column,
represents marginal distribution of other variable
column percentage
for one variable
row percentage
for other
Conditional distributions
help us decide if there is any ASSOCIATION between variables we’re looking for, or none
‘Exploring the relationship between two categorical variables’
Pause and watch a video on this, now
Important point on Titanic: ‘survival’ has two categories
“condition on the other variable” [ex. conditional distribution of variable “class” conditioned on variable “survival” = alive]
Association is not between ‘class and alive’ and ‘class and dead’ – association is between class and variable survival
another note: we avoid using the word correlation, if there is association say ‘dependent’
Pause and watch video on conditional distributions
yeah
Difference between correlation and association
association: for categorical variables, also for any type of association that might not be linear // correlation: quantity that measures only certain type of position, linear
Segmented bar chart
Like pie chart but in a bar
Be careful not to confuse side by side bar charts
not the same; in side-by-side blue bars represent one distribution, red bars represent another
Is conditional distribution a variable or a number?
- it is neither a variable nor a number
- what it is, is a collection of percentages
- these three percentages together are what make up a conditional distribution
if this were asked in a test for example, you would first state the condition [“we are looking at the male group”] under this condition would state the three numbers [ u would circle all these three values]
How to check if it’s a side by side or Not
- to see if that is the case, can look at percentages and see if they add up to 100%
- if not—that’s your clue that you are not looking at three separate distributions here
- notice blue bars together add up to 100; red bars together add up to 100—so you have actually two distributions here
Bins or classes
what we divide our range of numerical data into for histograms
Choosing bin width
presentations can feature two; a gap means no occurrences in that range
Relative frequency histogram
vertical axis is relative frequency, the freq divided by total [will not be asked to draw but yes to interpret]
Stem and leaf displays
first column is leftmost digit, second shows remaining
Uniform distribution
histogram where all the bins have the same frequency or close; will be flat
Things to think about before you draw:
- is the variable quantitative?
- is the answer to the survey question or result of the experiment a number whose units are known?
Displays for quantitative data
histograms, stem-and-leaf diagrams, dotplots
Displays for categorical data
bar and pie charts
Understanding the spread of a graph. What is a mode?
A mode is a hump or high-frequency bin. One mode= unimodal, two= bimodal, 3= multimodal.
If data is unimodal and symmetric,
can treat as normally distributed
Once you have identified the mode, next step is,
to identify the symmetry of the data
A histogram is skewed right if
the longer tail is on the right side of the mode
A histogram is skewed left if
the longer tail is on the left side of the mode [also sometimes called negatively skewed but doesn’t mean values= negative]
Identifying unusual features
might tell us something interesting or exciting about the data. Should always mention any straggler or outliers that stand away from body of the distribution
Outliers
data points that are further from remaining bulk of data set. Will later classify as mild outliers and far extreme outliers
Examples of outliers
Income of a CEO, temperature of a person with a high fever, elevation at Death Valley
Median
the center of the data values; half of the data values are to the left of the median and have are to the right of the median [for symmetric right in the middle]
How do we compute the median
Odd number of numbers, (n+1)/2 … if sample size is even, split data in half and take avg of two middle values
Percentiles
Divide the data in one hundred groups. The n’th percentile is the data value such that n percent of the data lies below that value
Range
the difference between max and minimum values [it is sensitive to outliers]
The Interquartile Range
25th percentile and 75th percentile. IQR = Q3 - Q1, the difference between upper quartile and lower
Middle fifty percent of the data, its spread
IQR
5-number summary
minimum, Q1, median, Q3, and maximum
Benefits and drawbacks of the IQR
not sensitive to outliers (like range is). Reasonable summary of spread. Shows where typical values are, except for the case of a bimodal distribution. Not great for a general audience since most do not know what it is.
Anytime you use the IQR, you also in addition want to use
standard deviation
Pause and do practice problem
on the IQR, from hw
Boxplot
a chart that displays the 5-point summary and the outliers. Shows the IQR (middle yellow box). Dashed lines are called fences, outside the fences lie the outliers. Above and below the box are whiskers that display the most extreme data values within the fences. Line inside the box shows the median
Finding the fences of a boxplot
lower fence = Q1 - 1.5 * IQR. Any data outside whiskers we call outliers
Far outliers
Farther than 3 IQRs from the quartiles
If you have odd number of data points,
include median in both upper half and lower half of data (?)
Five W’s for flight cancellations
Who? Months, What? Percentage of flights cancelled at US airports, when? 1994-2013, where? US, how? bureau of transportation statistics data
Analyzing data for flight cancellations, steps
Identify the variable: percent of flight cancellations at US airports (quantitative, units are percentages) // data will be summarized in a histogram, numerical summary, and boxplot
Pause and watch extra vid
on IQR and stuff
Analyzing flight cancellations [ctd]
Describe shape, center, and spread of distribution. Report on symmetry, number of modes, and any gaps or outliers. Should also mention any CONCERNS about the data
Flight cancellations: The Mean
what most people think of as the average. Add up all the numbers and divide by number of them. The mean is the “balancing point” [where the histogram will balance perfectly like a metronome]
Mean vs. Median
For symmetric distributions, the mean and the median are equal. Balancing point is at the center. The tail “pulls” the mean towards it more than it does to the median. The Mean is more sensitive to outliers than the median.
T/f: The mean is less sensitive to outliers than the median
F [the mean is MORE sensitive to outliers, so for skewed data the median is a better measure of center]
If you choose median as measure of center
You also want to report mean and standard deviation; if it’s symmetric you just have to report those two though
What is the variance a measure of?
How far the data is spread out from the mean. Difference from the mean is y - y-bar. To make it positive square it. Will mostly be used to find the standard deviation which is the square root of variance
If your data is closely packed around the mean, the SD turns out to be,
small; if very spread apart, then SD turns out to be large
Thinking about variation
Statistics is about variation so spread is an important fundamental concept; measures of spread tell us what we DON’T know about the data.
Choose the right tool
Use histograms to compare two or three groups. Use boxplots to compare many groups.
Pause and find an example of when,
it’s appropriate to use a histogram, and boxplot respectively
Treat outliers with attention and care
Local or global [outliers], especially in a time series. Investigate if the outliers are errors or remarkable.
Use a timeplot to track,
trends over time
Re-express or transofrm data for better understanding
Can transform skewed data distributions to symmetric ones, can help to compare spreads of different groups
When to use segmented bar charts
when writing journal articles better to use bar charts than pie
What to tell if encounter histogram, stem-and-leaf, boxplot
Describe modality, symmetry, outliers
What to tell re: center and spread
Median and IQR if not symmetric. Mean and standard deviation if symmetric. Unimodal symmetric data, IQR > s. If not check for errors, skewness, outliers.
What to tell re: unusual features
For multiple modes, possibly split the data into groups. When there are outliers, report the mean and s.d. with an w/o the outliers. Note any gaps in the data set.
If asked to summarize data [specific example: fuel efficiency]
Plan: summarize the distribution of the car’s fuel efficiency. // Variable: mpg for 100 fill ups, Quantitive // Mechanics: show a histogram (described as fairly symmetric, low outlier)
Which to report mean or median [specific example: fuel efficiency]
the mean and median are close. report the mean and standard deviation
Conclusions in summary of data [specific example: fuel efficiency]
Distribution is unimodal and symmetric. Mean is 22.4 mpg. Low outlier may be investigated, but limited effect on the mean. s= 2.45; from one filling to the next, fuel efficiency differs from the mean by an avg of about 2.45 mpg
What can go wrong in analyzing data
(1) if have categorical data do not make a historgram, (2) if have categorical data do a bar chart but would not be appropriate to talk about shape because someone else can basically just rearrange differently, (3) choose appropriate bin width
What can go wrong in analyzing data [ctd, 2]
(1) don’t report too many decimal places, (2) don’t round in the middle of a calculation, (3) for multiple modes think about separating groups (4) beware of outliers, the mean and standard deviation are sensitive to outliers
What can go wrong in analyzing data [ctd, 3]
(1) Do a reality check, don’t blindly trust calculator, (2) Sort before finding median and percentiles, (3) don’t worry about small differences in quartile calculation, (4) don’t compute numerical summaries for a categorical variable!! [mean of a SS# is meaningless]
What can go wrong [ctd,4]
Make a picture, make a picture, make a picture
Pause and watch a video on identifying the variables
Like in that ch 1 homework problem
Pause and watch a video on bewaring of outliers
If they are errors, remove them