2: Exploratory Data Analysis: Single Variable Flashcards
cases
objects described by a set of data (companies, subjects, customers)
label
variable used in some data sets to distinguish different cases
variable
characteristic of a case
distribution
of a variable tells us what values it takes and how often it takes these values
distribution of categorical variable
lists the categories and gives either the count or the percent of cases who fall in each category
stemplot
steam and leaf plot. gives quick pic of distribution shape while includes actual numerical values in graph. separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf – the final digit. write stems in vert. column with smallest at top and draw vert line at right. write each leaf in the row to the right of them stem, in increasing order out from the stem
histogram
breaks range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. classes = equal width.
tails
extreme values of a distribution
modes
major peaks in a distribution
time plot
of a variable plots each observation against the time at which it was measured. time is on horiz.. scale of plot and variable measured is on vert. scale
mean vs. median
mean is average value.
(x1 + x2+ x3 + xn / n)
median is middle value.
(1) if number of observations is odd – medium’s LOCATION can be found by counting (n+1)/2 observations up from bottom of the list
(2) if even – median is the mean of the two center observations in the ordered list. location is (n+1)/2 observations up from bottom of the list
quartile
upper quartile = median of the upper half of the data. lower quartile = median of lower half of the data
pth percentile
the value that has p percent of the observations fall at or below it
five number summary
set of observations consists of the smallest observation, the first quartile, the median, the third quartile, the largest observation - from small to big.
Min Q1 M Q3 Max
boxplot
graph of five-number summary
interquartile range IQR
distance b/w first and third quartiles. IQR = Q3-Q1
1.5 X IQR rule for outliers
observation = outlier if it falls MORE than 1.5 X IQR above third quartile or below first quartile
standard deviation
measures spread by looking at how far the observations are from their mean.
variance
s^2 of a set of observation is the average of the squares of the deviations from their mean. OR, the average of the squared differences from the mean.
(1) Work out the Mean (the simple average of the numbers)
(2) Then for each number: subtract the Mean and square the result (the squared difference).
(3) Then work out the average of those squared differences
standard deviation
= square root of the variance
degrees of freedom
the number n-1 is called the degrees of freedom of the variance or standard deviation
properties of standard deviation
(1) s measures spread about the mean and should be used only when the mean is chosen as the measure of center
(2) s = 0 only when there is no spread. this happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger
(3) s, like the mean, is not resistant. a few outliers can make s very large.
Which is better for describing a skewed distribution or a distribution with strong outliers: five number summary, mean, or std deviation?
five number summary
linear transformation
changes the original variable x into the new variable xnew given by this equation:
xnew = a + bx
they don’t change the shape of the distribution
effects of linear transformation
(1) multiplying each observation by + number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and std dev) by b
(2) adding same number a (pos or neg) to each observation adds a to measures of center and to quartiles and other percentiles – but does NOT change measures of spread
density curve
overall pattern of a distribution. has a total area of 1 underneath it.
normal distributions
are describes by bell-shaped, symmetric, unimodal density curves. the mean U and std dev completely specify the Normal distrubtion.
Mean vs. std dev in normal distribution
mean = center of symmetry
std dev = distance from mean to the change-of-curvature points on either side
68-95-99.7 rule
In the normal distribution the mean u and std dev,
Approx 68% of the observations fall within std dev of the mean
Approx 95% of the observations fall within 2 x (std dev) of the mean
Approx 99.7% of the observations fall within 3 x (std dev) of the mean
z-score
standardized value: subtract the mean of the distribution and then divide by the std dev
z = x - u / std dev
tells us how many standard devs the original observation falls away from the mean (and in which direction)
frequency distribution table
- frequency (f): number of times we observe an event
- raw frequency: (f/n): # of times event takes place / total events
- cumulative freq: running count of the frequencies of a particular value and all preceding values (sum raw freq)
- cum. relative freq: cumulative freq for a particular value in relation to the total (sum rel freqs)
measures of central tendency for cat. variables
median (if cat variables can be ranked)
mode
measures of central tendency for quant variables
mean
median if lots of outliers
calculate median
- order data from low to high
- look at location - (n+1).2
- if at 5.5, then average 5th and 6th values
median provides a ______ reasonable measure of central tendency when distributions are skewed or have outliers
median provides a MORE reasonable measure of central tendency when distributions are skewed or have outliers
mean is _____ sensitive to outliers
mean is sensitive to outliers
if distribution is exactly symmetric, then mean and median
are the same
IQR as a measure of spread is ____ useful to describe skewed distributions
NOT
2 “sides” of a skewed distribution have different spreads
standard deviation is _____ a good measure when the distribution is highly skewed
NOT