Lesson 3 Flashcards

1
Q

scatterplots

A

used for identifying explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which axis should have response or explanatory variable?

A

response= y axis
explanatory = x axis
this is only a way to keep track of which one we suspect is which

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What factors do you use to evaluate the relationship of variables on a scatterplot?

A

direction (neg/pos)
shape (linear/curved)
strength (strong/weak)
outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

histogram

A

data are binned into intervals and heights represent number of cases that fall into each interval
provides a view of the data density
useful for describing shape of distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data density

A

where the data is relatively more common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

considerations for outliers

A

don’t handle “naively” by discarding outliers without careful consideration
check data for data entry mistakes
outliers can be very interesting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

left skewed

A

-data sets with longer tail on left
-negative end tail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

right skewed

A

-data sets with longer tail on the right
-data trails off to the right
-skewed to the high end
-skewed to the positive end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

symmetric

A

data sets with roughly equal trailing off in both directions are called symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

long tail

A

when data trail off in one direction the distribution has a long tail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

define modality and state examples

A

a mode is represented by a prominent peak in the distribution, if a peak is less prominent or only represented by a few data points-its not counted
1-unimodal
2-bimodal
0-uniform (no prominent peaks)
3+ -multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how does bin width alter the story the data is telling in a histogram?

A

too wide= might loose interesting details
too narrow= might be difficult to get an overall sense of distribution
-ideal width depends on data you are working with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

dot plot

A

especially useful when individual samples are of interest but might get too busy with sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

box plot

A

useful for displaying outliers, median, interquartile range IQR, not good for showing modality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

intensity map

A

showing data in context, often geographical, using different layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

center of distribution: name and define the 3 types

A

mean: arithmetic average
mode: most frequent observation
median: midpoint of the distribution(50th percentile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

sample statistic

A

-when mean, median and mode are calculated from a sample
-use letters from latin alphabet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

point estimate

A

involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter (for example, the population mean).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

are greek or latin letters used to describe population parameters?

A

greek letters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is a helpful way to think of the mean?

A

think of the mean as the balancing point in the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

mode

A

most common value in distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

median

A

average of middle two observations or middle value of odd number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

range

A

difference between min and max
not as reliable because uses two most extreme points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what are the four measures of spread

A

range
variance
standard deviation
inter-quartile range

25
Q

variance- describe roughly what this is

A

roughly the average squared deviation from the mean

26
Q

formula for calculating sample variance (s^2)

A

1)difference between mean and each observation
2)square each deviation and add them up
3)then divide by sample size by n-1
–answer is squared

27
Q

why do we divide by n-1, rather than n, when computing a sample’s variance?

A

doing this makes the statistics slightly more reliable and useful

28
Q

letters used to describe sample variance and population variance

A

s^2 = sample variance
sigma ^2= population variance

29
Q

why do we square the differences when calculating variance?

A

-get rid of negatives so that negatives and positives don’t cancel each other out when added together
-increase larger deviations more than smaller ones so that they are weighted more heavily

30
Q

deviation

A

distance of an observation from its mean

31
Q

what is the symbol for standard deviation?

A

lowercase sigma

32
Q

standard deviation

A

roughly the average deviation around the mean and has the same units as the data
formula= square root of the variance

33
Q

Which distribution is more variable, one where more observations are clustered around the center or one where less are centered around the variable?

A

Distributions where less observations are clustered around the center are more variable.

34
Q

interquartile range

A

-range of the middle 50% of the data, distance between first quartile(25th percentile) and third quartile (75th percentile)
IQR = Q3-Q1
-best to use box plot to visualize
-IQR is more reliable in looking at spread because it doesn’t look at values which could be outliers

35
Q

robust statistics

A

measures on which extreme observations have little effect
eg
data. mean. median
1,2,3,4,5,6 3.5. 3.5
1,2,3,4,5,1000. 169. 3.5
Here the median is more robust.

36
Q

describe which: median, mean, IQR and SD are robust or non-robust and when are they better for describing data

A

———- robust. non-robust
center. median. mean
spread. IQR. SD, range

Median and IQR= more robust when looking at data with skewed with extreme observations
Mean and SD= best for looking at symmetric observations

37
Q

transformation

A

-rescaling of the data using a function
-when data are very strongly skewed, we sometimes transform them so they are easier to model

38
Q

(natural) log transformation

A

-often applied when much of the data cluster near zero(relative to the larger values in the data set) and all values are positive
-can be easier to analyze because outliers become less extreme, data is more symmetric, less skewed
-make the relationship between the variables more linear and easier to model with simple methods

39
Q

square root transformation

A

plot the square root or the inverse square root

40
Q

goals of transformation

A

-see data structure differently
-reduce skew to assist in modeling
-straighten a nonlinear relationship in a scatterplot

41
Q

what greek letter represents the population mean?

A

mu (micro)

42
Q

what letter represents sample mean?

A

x bar

43
Q

contingency table

A

-summarizes data for two categorical variables
-shows number of times a particular combination of variable outcomes occurred, along with column totals and rows totals

44
Q

bar plot

A

way to display single categorical variable
-x-axis shows categories

45
Q

difference between histogram and
bar plot

A

bar plot
-shows discrete or categorical variables
-x-axis shows categories
-bars can be rearranged

histogram-depicts the frequency distribution of variables in a dataset
-x-axis shows numbers
-the bars cannot be rearranged

46
Q

row or column proportions

A

-counts divided by their row totals
ie. 3496 renters/8505 total = 0.441
-can be displayed in a contingency table

47
Q

types of bar plots

A

stacked

standardized version of stacked

side-by-side

48
Q

when is standardized stacked bar most useful and what is the downside

A

-useful if the primary variable in the stacked bar plot is relatively unbalanced
-downside is that we lose all sense of how many cases each of the bars represents

49
Q

when is side-by-side most useful and what is the downside

A
  • agnostic in their display about which variable if any represents the explanatory and which the response variable
    -easy to see the number of cases in the group combinations
    -downside is that it can require more horizontal space
    -downside is that it can be difficult to to discern if there is an association between two variables if two groups are of very different sizes
50
Q

when is a stacked bar graph the most useful

A
  • when one variable is the explanatory variable and the other is the response
51
Q

mosaic plot

A

plot suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well
- the x-axis shows the width of the columns based on area representing relative proportion
-the y-axis can be split into different variables

52
Q

why is a pie chart less useful than a bar plot

A

it can be difficult to see details in a pie chart which are more obvious in a bar plot, especially for comparing groups

53
Q

side-by-side box plot

A

good for comparing across groups

54
Q

hollow histograms

A

compare numerical data across groups
-outlines of histograms of each group put on the same plot

55
Q

independence model

A

a model to test when the variables are independent and any observed result is due to chance

56
Q

alternative model

A

a model to test when the variables are not independent, the observed result is not due to chance

57
Q

simulation

A

testing whether a different randomization will affect result
eg, take 20 notecards to represent 20 subjects

58
Q

statistical inference

A

one field of statistics, evaluating whether differences are due to chance