Statistics and Distributions Flashcards
Distributions
- Representation of the way values tend to vary across a single attribute
- Usually presented as a histogram
- Where is the data concentrated? Which values are less likely? Which is most likely?
Which single value best represents the data?
Central Tendency
Context dependent
- On a histogram: affects the location on the x-axis
Mean
arithmetic mean:
sum of values/number of values
Median
Middle value of sorted data
- Resistant to outliers and skew
Variability
How far does the data spread away from the mean?
Affects the width of the histogram
Standard Deviation
This is the average distance from the mean
If we pick a random value from the data, how far should we expect it to be from the mean?
sd = sqrt(sum(x-mu)^2 / N)
Percentiles and Quartiles
25th Percentile : 1st Quartile
50th Percentile : 2nd Quartile
75th Percentile : 3rd Quartile
IQR and Outliers
Interquartile Range : Q3-Q1
Lower/Upper Fences: [Q1 - (3/2) * IQR, Q3 - (3/2) * IQR]
Outlier: A value that falls outside of the fences.
Boxplots
Excellent tool to display and compare measures of variability
They display:
- Median
- IQR
- Fences
- Outliers
- Range
Normal Distribution
- Gaussian Distribution or Bell Curve
Fundamental to statistics
Countless occurrences in nature
Has a number of useful properties
Normal Distribution Properties
- Symmetric
Mean = Median = Mode - 68-95-99 Rule
- Foundation of the Central Limit Theorem
Random Experiment
A process that results in an outcome
Outcome
The value of the result of a single experiment
Sample Space
The set of all possible outcomes for an experiment
Event
A subset of the sample space
Probability
A number between 0 and 1 that dictates the chance of an event occurring
Sample Space
A sample space of an experiment is the set of all possible outcomes
Ex: Sample Space of a single die roll is: {1, 2, 3, 4, 5, 6}
Event
An event usually denoted by a single capital letter, is a subset of the sample space.
Ex: If you roll two dice, some possible events include:
- (1,1), (1,2), (2,1), (1,6), (6,6)
Probability
For a single event A, the probability of A occurring, P(A), is denoted as:
P(A) = number of outcomes in which A occurs/ total possible outcomes
Addition Rule
Addition Rule states:
P(A or B) = P(A) + P(A) - P(A and B)
Multiplication Rule
Two events are said to be independent if the outcome of one does not depend on the outcome of the other. Otherwise, they are dependent.
The multiplication rule states:
P(A and B) = P(A) * P(B, given that A occurred) = P(A) * P(B|A)
For independent events, this is simply:
P(A and B) = P(A) * P(B)
Complements
P(A) + P(not A) = 1
Deterministic Sampling
Rather than randomizing, you take the first people that walk by or choose the people deterministically
Uniform Random Sampling
Use software to assign and pick off an n’th group of people to choose
Random Sampling
Randomly select
From random sampling, what do we know about the sample mean?
The sample mean is the mean of the data sampled, and approximates the true mean.
Probability Distribution
the calculated likelihood of each possible event occurring without simulation or conducting the experiment
Empirical Distribution
the proportion of times a value is observed in a simulation or experiment, relative to the number of possible values
Law of large numbers
As our sample size grows larger, the data represents the population more accurately
statistic
a calculated number which describes a characteristic of a sample
parameter
value that estimates a characteristic of a population
statistical inference
a conclusion made based on data from multiple random
samples.
Central Limit Theorem
This theorem states:
Upon taking sufficiently large samples, the distribution of the sample means will approximate a normal distribution, regardless of the distribution sampled from.