Chapter 3 (Summarizing Distributions) Flashcards
Mode
most likely value of a variable to occur
Central tendency
values that are central in the distribution of a variable; describes what is typical
Variation
describes how dispersed the data is over the range of possible values and what is atypical
When a curve is bell-shaped (normal distribution), where does the mean, median, and mode lie?
They are all equal and lie in the middle of the distribution
Sample (arithmetic) mean or average
most common measure of centrality; applies only to data where adding and dividing the values makes sense (nominal); has minimal variance (if replaced with any other number, variance would increase)
Sample mean formula
the sum of all values of x from i=1 to n, divided by the number of observations or sample size (n)
Weighted average
each observation gets a weight of 1/n, the proportion of the sample that it represents
Weighted average
each observation gets a weight of 1/n, the proportion of the sample that it represents
Dummy or binary variable
a qualitative variable that indicates the presence or absence of an attribute; must be coded as 1=present and 0=absent; also has a mean despite being qualitative
Mean of a dummy variable
the proportion of the sample with the associated attribute
How do you describe central tendency for qualitative variables?
(1) Create a dummy variable for each level of the qualitative variable (2) Summarize the mean of the dummy variable
Percentiles
a way of describing how extreme a particular observation is (median is not extreme); the s-th percentile is the value of x such that s% of the data lies below it
How do you get the median?
(1) Order x from smallest to largest (2) If n is odd, the median is the middle-most value. If n is even, the median is the average of the two middle-most values.
Centrality of the median
the value that lies between two halves of all possible values
Residual (ei)
a measure of variation; the difference between the proposed “typical” value (i.e. the sample mean) and the actual values
Centrality of the sample mean
the sample mean is the value which is, on average, as close to the rest of the data as possible, and is subject to leverage by large or small values (i.e. outliers)
How can you deal with outliers?
(1) Remove them from the dataset (2) Choose statistics that are robust to outliers like the median instead of the mean
How is the residual a measure of variation?
it gives us a measure of how dispersed the value of xi is about the center xbar
Bessel’s correction (n-1)
a statistical adjustment to make the sample variance and standard deviation more accurate or unbiased estimators of the population variance and standard deviation, particularly for small values of n
Interquartile range (IQR)
IQR= x75 - x25; outlier robust because it is a percentile based measure like the median
Sample standard deviation
square root of the sample variance
Range
R= x100 - x0 = max(xi) - min(xi); not robust to outliers
Percentile quintets
x0, x25, x50, x75, x100
Covariance
Similar to variance but with two variables (instead of one) and has no equivalent of standard deviation
Correlation or Pearson’s Correlation Coefficient (r)
a unitless statistic (just a number good for interpretation) that is always between -1 and 1; the covariance between x and y divided by the standard deviation of x and y
Positive correlation
when r > 0, higher values of x result in higher values of y, and vice versa
Negative correlation
when r < 0, higher values of x result in lower values of y, and vice versa
No correlation
when r = 0, there is no relationship between values of x and y
Perfect correlation
when r = +/- 1, the values of x and y can be perfectly predicted from one another
How are correlation (r) and dependence related?
the r value is a numerical measurement of dependence in the data: close to -1 means strong negative dependence, close to +1 means strong positive dependence, close to 0 means a lack of dependence (“independent”)
Population parameters
statistical objects that are sample analogues of important properties of the population distribution; population counterparts of the sample, as they have the same interpretation as the sample statistics
Correspondences between sample statistics and population parameters
sample mean and population mean, sample variance and population variance, sample covariance and population covariance, sample correlation and population correlation
Sampling distribution
shows every possible result a statistic can take in every possible (hypothetical) sample from a population and how often each result occurs; observations are unique samples and variables are statistics
Empirical distribution
a very good estimate of the distribution in the population, gets closer if the data is representative and n is large
Bootstrapping
simulating samples using the empirical distribution
How do you implement the bootstrap?
(1) Randomly draw a new sample of the same size from the existing sample, with replacement (2) Do this hundreds or thousands of times to create a new sample of bootstrap samples (3) Compute the sampling distribution of your statistic from this sample
Asymptotic behavior
how samples behave when n is large
Asymptotic behavior of bootstrap
it gives us a sense of what the sampling distribution might look like and is centered around the sample statistics
Law of Large Numbers (LLN)
if the sample is representative of the population (independent) and as n becomes large, the sample mean is a very close approximation of the population mean
Central limit theorem (CLT)
given a population with a finite mean µ
and a finite non-zero variance σ2, the sampling distribution approaches a normal distribution with a variance of σ2/N, as N, the sample size, increases
Useful properties of the normal or gaussian distribution
(1) symmetrical about the mean (2) the mean, median, and mode coincide (3) quantiles are closely related to the standard deviations (empirical rule): 68% of data within 1 sd, 95% within 2 sd, 99.7% within 3 sd
Standard normal distribution
a normal distribution with a mean of 0 and a standard deviation of 1
What is the difference between standard deviation and standard error (CLT)?
SD is the variation within a sample while SE is the variation between samples; SD is always bigger