2: Exploratory Data Analysis: Single Variable Flashcards

1
Q

cases

A

objects described by a set of data (companies, subjects, customers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

label

A

variable used in some data sets to distinguish different cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

variable

A

characteristic of a case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

distribution

A

of a variable tells us what values it takes and how often it takes these values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

distribution of categorical variable

A

lists the categories and gives either the count or the percent of cases who fall in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

stemplot

A

steam and leaf plot. gives quick pic of distribution shape while includes actual numerical values in graph. separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf – the final digit. write stems in vert. column with smallest at top and draw vert line at right. write each leaf in the row to the right of them stem, in increasing order out from the stem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

histogram

A

breaks range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. classes = equal width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

tails

A

extreme values of a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

modes

A

major peaks in a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

time plot

A

of a variable plots each observation against the time at which it was measured. time is on horiz.. scale of plot and variable measured is on vert. scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

mean vs. median

A

mean is average value.
(x1 + x2+ x3 + xn / n)

median is middle value.

(1) if number of observations is odd – medium’s LOCATION can be found by counting (n+1)/2 observations up from bottom of the list
(2) if even – median is the mean of the two center observations in the ordered list. location is (n+1)/2 observations up from bottom of the list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

quartile

A

upper quartile = median of the upper half of the data. lower quartile = median of lower half of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

pth percentile

A

the value that has p percent of the observations fall at or below it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

five number summary

A

set of observations consists of the smallest observation, the first quartile, the median, the third quartile, the largest observation - from small to big.

Min Q1 M Q3 Max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

boxplot

A

graph of five-number summary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

interquartile range IQR

A

distance b/w first and third quartiles. IQR = Q3-Q1

17
Q

1.5 X IQR rule for outliers

A

observation = outlier if it falls MORE than 1.5 X IQR above third quartile or below first quartile

18
Q

standard deviation

A

measures spread by looking at how far the observations are from their mean.

19
Q

variance

A

s^2 of a set of observation is the average of the squares of the deviations from their mean. OR, the average of the squared differences from the mean.

(1) Work out the Mean (the simple average of the numbers)
(2) Then for each number: subtract the Mean and square the result (the squared difference).
(3) Then work out the average of those squared differences

20
Q

standard deviation

A

= square root of the variance

21
Q

degrees of freedom

A

the number n-1 is called the degrees of freedom of the variance or standard deviation

22
Q

properties of standard deviation

A

(1) s measures spread about the mean and should be used only when the mean is chosen as the measure of center
(2) s = 0 only when there is no spread. this happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger
(3) s, like the mean, is not resistant. a few outliers can make s very large.

23
Q

Which is better for describing a skewed distribution or a distribution with strong outliers: five number summary, mean, or std deviation?

A

five number summary

24
Q

linear transformation

A

changes the original variable x into the new variable xnew given by this equation:

xnew = a + bx

they don’t change the shape of the distribution

25
Q

effects of linear transformation

A

(1) multiplying each observation by + number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and std dev) by b
(2) adding same number a (pos or neg) to each observation adds a to measures of center and to quartiles and other percentiles – but does NOT change measures of spread

26
Q

density curve

A

overall pattern of a distribution. has a total area of 1 underneath it.

27
Q

normal distributions

A

are describes by bell-shaped, symmetric, unimodal density curves. the mean U and std dev completely specify the Normal distrubtion.

28
Q

Mean vs. std dev in normal distribution

A

mean = center of symmetry

std dev = distance from mean to the change-of-curvature points on either side

29
Q

68-95-99.7 rule

A

In the normal distribution the mean u and std dev,

Approx 68% of the observations fall within std dev of the mean

Approx 95% of the observations fall within 2 x (std dev) of the mean

Approx 99.7% of the observations fall within 3 x (std dev) of the mean

30
Q

z-score

A

standardized value: subtract the mean of the distribution and then divide by the std dev

z = x - u / std dev

tells us how many standard devs the original observation falls away from the mean (and in which direction)

31
Q

frequency distribution table

A
  1. frequency (f): number of times we observe an event
  2. raw frequency: (f/n): # of times event takes place / total events
  3. cumulative freq: running count of the frequencies of a particular value and all preceding values (sum raw freq)
  4. cum. relative freq: cumulative freq for a particular value in relation to the total (sum rel freqs)
32
Q

measures of central tendency for cat. variables

A

median (if cat variables can be ranked)

mode

33
Q

measures of central tendency for quant variables

A

mean

median if lots of outliers

34
Q

calculate median

A
  1. order data from low to high
  2. look at location - (n+1).2
  3. if at 5.5, then average 5th and 6th values
35
Q

median provides a ______ reasonable measure of central tendency when distributions are skewed or have outliers

A

median provides a MORE reasonable measure of central tendency when distributions are skewed or have outliers

36
Q

mean is _____ sensitive to outliers

A

mean is sensitive to outliers

37
Q

if distribution is exactly symmetric, then mean and median

A

are the same

38
Q

IQR as a measure of spread is ____ useful to describe skewed distributions

A

NOT

2 “sides” of a skewed distribution have different spreads

39
Q

standard deviation is _____ a good measure when the distribution is highly skewed

A

NOT