2: Exploratory Data Analysis: Single Variable Flashcards by Caitlin O'Donnell

cases

objects described by a set of data (companies, subjects, customers)

How well did you know this?

Not at all

Perfectly

label

variable used in some data sets to distinguish different cases

How well did you know this?

Not at all

Perfectly

variable

characteristic of a case

How well did you know this?

Not at all

Perfectly

distribution

of a variable tells us what values it takes and how often it takes these values

How well did you know this?

Not at all

Perfectly

distribution of categorical variable

lists the categories and gives either the count or the percent of cases who fall in each category

How well did you know this?

Not at all

Perfectly

stemplot

steam and leaf plot. gives quick pic of distribution shape while includes actual numerical values in graph. separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf – the final digit. write stems in vert. column with smallest at top and draw vert line at right. write each leaf in the row to the right of them stem, in increasing order out from the stem

How well did you know this?

Not at all

Perfectly

histogram

breaks range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. classes = equal width.

How well did you know this?

Not at all

Perfectly

tails

extreme values of a distribution

How well did you know this?

Not at all

Perfectly

modes

major peaks in a distribution

How well did you know this?

Not at all

Perfectly

time plot

of a variable plots each observation against the time at which it was measured. time is on horiz.. scale of plot and variable measured is on vert. scale

How well did you know this?

Not at all

Perfectly

mean vs. median

mean is average value.
(x1 + x2+ x3 + xn / n)

median is middle value.

(1) if number of observations is odd – medium’s LOCATION can be found by counting (n+1)/2 observations up from bottom of the list
(2) if even – median is the mean of the two center observations in the ordered list. location is (n+1)/2 observations up from bottom of the list

How well did you know this?

Not at all

Perfectly

quartile

upper quartile = median of the upper half of the data. lower quartile = median of lower half of the data

How well did you know this?

Not at all

Perfectly

pth percentile

the value that has p percent of the observations fall at or below it

How well did you know this?

Not at all

Perfectly

five number summary

set of observations consists of the smallest observation, the first quartile, the median, the third quartile, the largest observation - from small to big.

Min Q1 M Q3 Max

How well did you know this?

Not at all

Perfectly

boxplot

graph of five-number summary

How well did you know this?

Not at all

Perfectly

interquartile range IQR

Study These Flashcards

distance b/w first and third quartiles. IQR = Q3-Q1

1.5 X IQR rule for outliers

Study These Flashcards

observation = outlier if it falls MORE than 1.5 X IQR above third quartile or below first quartile

standard deviation

Study These Flashcards

measures spread by looking at how far the observations are from their mean.

variance

Study These Flashcards

s^2 of a set of observation is the average of the squares of the deviations from their mean. OR, the average of the squared differences from the mean.

(1) Work out the Mean (the simple average of the numbers)
(2) Then for each number: subtract the Mean and square the result (the squared difference).
(3) Then work out the average of those squared differences

standard deviation

Study These Flashcards

= square root of the variance

degrees of freedom

Study These Flashcards

the number n-1 is called the degrees of freedom of the variance or standard deviation

properties of standard deviation

Study These Flashcards

(1) s measures spread about the mean and should be used only when the mean is chosen as the measure of center
(2) s = 0 only when there is no spread. this happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger
(3) s, like the mean, is not resistant. a few outliers can make s very large.

Which is better for describing a skewed distribution or a distribution with strong outliers: five number summary, mean, or std deviation?

Study These Flashcards

five number summary

linear transformation

Study These Flashcards

changes the original variable x into the new variable xnew given by this equation:

xnew = a + bx

they don’t change the shape of the distribution

effects of linear transformation

(1) multiplying each observation by + number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and std dev) by b (2) adding same number a (pos or neg) to each observation adds a to measures of center and to quartiles and other percentiles -- but does NOT change measures of spread

density curve

overall pattern of a distribution. has a total area of 1 underneath it.

normal distributions

are describes by bell-shaped, symmetric, unimodal density curves. the mean U and std dev completely specify the Normal distrubtion.

Mean vs. std dev in normal distribution

mean = center of symmetry std dev = distance from mean to the change-of-curvature points on either side

68-95-99.7 rule

In the normal distribution the mean u and std dev, Approx 68% of the observations fall within std dev of the mean Approx 95% of the observations fall within 2 x (std dev) of the mean Approx 99.7% of the observations fall within 3 x (std dev) of the mean

z-score

standardized value: subtract the mean of the distribution and then divide by the std dev z = x - u / std dev tells us how many standard devs the original observation falls away from the mean (and in which direction)

frequency distribution table

1. frequency (f): number of times we observe an event 2. raw frequency: (f/n): # of times event takes place / total events 3. cumulative freq: running count of the frequencies of a particular value and all preceding values (sum raw freq) 4. cum. relative freq: cumulative freq for a particular value in relation to the total (sum rel freqs)

measures of central tendency for cat. variables

median (if cat variables can be ranked) | mode

measures of central tendency for quant variables

mean | median if lots of outliers

calculate median

1. order data from low to high 2. look at location - (n+1).2 3. if at 5.5, then average 5th and 6th values

median provides a ______ reasonable measure of central tendency when distributions are skewed or have outliers

median provides a MORE reasonable measure of central tendency when distributions are skewed or have outliers

mean is _____ sensitive to outliers

mean is sensitive to outliers

if distribution is exactly symmetric, then mean and median

are the same

IQR as a measure of spread is ____ useful to describe skewed distributions

NOT 2 "sides" of a skewed distribution have different spreads

standard deviation is _____ a good measure when the distribution is highly skewed

NOT

2: Exploratory Data Analysis: Single Variable Flashcards

(39 cards)