Chapter 3 (Summarizing Distributions) Flashcards
Mode
most likely value of a variable to occur
Central tendency
values that are central in the distribution of a variable; describes what is typical
Variation
describes how dispersed the data is over the range of possible values and what is atypical
When a curve is bell-shaped (normal distribution), where does the mean, median, and mode lie?
They are all equal and lie in the middle of the distribution
Sample (arithmetic) mean or average
most common measure of centrality; applies only to data where adding and dividing the values makes sense (nominal); has minimal variance (if replaced with any other number, variance would increase)
Sample mean formula
the sum of all values of x from i=1 to n, divided by the number of observations or sample size (n)
Weighted average
each observation gets a weight of 1/n, the proportion of the sample that it represents
Weighted average
each observation gets a weight of 1/n, the proportion of the sample that it represents
Dummy or binary variable
a qualitative variable that indicates the presence or absence of an attribute; must be coded as 1=present and 0=absent; also has a mean despite being qualitative
Mean of a dummy variable
the proportion of the sample with the associated attribute
How do you describe central tendency for qualitative variables?
(1) Create a dummy variable for each level of the qualitative variable (2) Summarize the mean of the dummy variable
Percentiles
a way of describing how extreme a particular observation is (median is not extreme); the s-th percentile is the value of x such that s% of the data lies below it
How do you get the median?
(1) Order x from smallest to largest (2) If n is odd, the median is the middle-most value. If n is even, the median is the average of the two middle-most values.
Centrality of the median
the value that lies between two halves of all possible values
Residual (ei)
a measure of variation; the difference between the proposed “typical” value (i.e. the sample mean) and the actual values
Centrality of the sample mean
the sample mean is the value which is, on average, as close to the rest of the data as possible, and is subject to leverage by large or small values (i.e. outliers)
How can you deal with outliers?
(1) Remove them from the dataset (2) Choose statistics that are robust to outliers like the median instead of the mean