Lesson 3 Flashcards

1
Q

scatterplots

A

used for identifying explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which axis should have response or explanatory variable?

A

response= y axis
explanatory = x axis
this is only a way to keep track of which one we suspect is which

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What factors do you use to evaluate the relationship of variables on a scatterplot?

A

direction (neg/pos)
shape (linear/curved)
strength (strong/weak)
outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

histogram

A

data are binned into intervals and heights represent number of cases that fall into each interval
provides a view of the data density
useful for describing shape of distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data density

A

where the data is relatively more common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

considerations for outliers

A

don’t handle “naively” by discarding outliers without careful consideration
check data for data entry mistakes
outliers can be very interesting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

left skewed

A

-data sets with longer tail on left
-negative end tail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

right skewed

A

-data sets with longer tail on the right
-data trails off to the right
-skewed to the high end
-skewed to the positive end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

symmetric

A

data sets with roughly equal trailing off in both directions are called symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

long tail

A

when data trail off in one direction the distribution has a long tail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

define modality and state examples

A

a mode is represented by a prominent peak in the distribution, if a peak is less prominent or only represented by a few data points-its not counted
1-unimodal
2-bimodal
0-uniform (no prominent peaks)
3+ -multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how does bin width alter the story the data is telling in a histogram?

A

too wide= might loose interesting details
too narrow= might be difficult to get an overall sense of distribution
-ideal width depends on data you are working with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

dot plot

A

especially useful when individual samples are of interest but might get too busy with sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

box plot

A

useful for displaying outliers, median, interquartile range IQR, not good for showing modality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

intensity map

A

showing data in context, often geographical, using different layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

center of distribution: name and define the 3 types

A

mean: arithmetic average
mode: most frequent observation
median: midpoint of the distribution(50th percentile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

sample statistic

A

-when mean, median and mode are calculated from a sample
-use letters from latin alphabet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

point estimate

A

involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter (for example, the population mean).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

are greek or latin letters used to describe population parameters?

A

greek letters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is a helpful way to think of the mean?

A

think of the mean as the balancing point in the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

mode

A

most common value in distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

median

A

average of middle two observations or middle value of odd number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

range

A

difference between min and max
not as reliable because uses two most extreme points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what are the four measures of spread

A

range
variance
standard deviation
inter-quartile range

25
variance- describe roughly what this is
roughly the average squared deviation from the mean
26
formula for calculating sample variance (s^2)
1)difference between mean and each observation 2)square each deviation and add them up 3)then divide by sample size by n-1 --answer is squared
27
why do we divide by n-1, rather than n, when computing a sample's variance?
doing this makes the statistics slightly more reliable and useful
28
letters used to describe sample variance and population variance
s^2 = sample variance sigma ^2= population variance
29
why do we square the differences when calculating variance?
-get rid of negatives so that negatives and positives don't cancel each other out when added together -increase larger deviations more than smaller ones so that they are weighted more heavily
30
deviation
distance of an observation from its mean
31
what is the symbol for standard deviation?
lowercase sigma
32
standard deviation
roughly the average deviation around the mean and has the same units as the data formula= square root of the variance
33
Which distribution is more variable, one where more observations are clustered around the center or one where less are centered around the variable?
Distributions where less observations are clustered around the center are more variable.
34
interquartile range
-range of the middle 50% of the data, distance between first quartile(25th percentile) and third quartile (75th percentile) IQR = Q3-Q1 -best to use box plot to visualize -IQR is more reliable in looking at spread because it doesn't look at values which could be outliers
35
robust statistics
measures on which extreme observations have little effect eg data. mean. median 1,2,3,4,5,6 3.5. 3.5 1,2,3,4,5,1000. 169. 3.5 Here the median is more robust.
36
describe which: median, mean, IQR and SD are robust or non-robust and when are they better for describing data
---------- robust. non-robust center. median. mean spread. IQR. SD, range Median and IQR= more robust when looking at data with skewed with extreme observations Mean and SD= best for looking at symmetric observations
37
transformation
-rescaling of the data using a function -when data are very strongly skewed, we sometimes transform them so they are easier to model
38
(natural) log transformation
-often applied when much of the data cluster near zero(relative to the larger values in the data set) and all values are positive -can be easier to analyze because outliers become less extreme, data is more symmetric, less skewed -make the relationship between the variables more linear and easier to model with simple methods
39
square root transformation
plot the square root or the inverse square root
40
goals of transformation
-see data structure differently -reduce skew to assist in modeling -straighten a nonlinear relationship in a scatterplot
41
what greek letter represents the population mean?
mu (micro)
42
what letter represents sample mean?
x bar
43
contingency table
-summarizes data for two categorical variables -shows number of times a particular combination of variable outcomes occurred, along with column totals and rows totals
44
bar plot
way to display single categorical variable -x-axis shows categories
45
difference between histogram and bar plot
bar plot -shows discrete or categorical variables -x-axis shows categories -bars can be rearranged histogram-depicts the frequency distribution of variables in a dataset -x-axis shows numbers -the bars cannot be rearranged
46
row or column proportions
-counts divided by their row totals ie. 3496 renters/8505 total = 0.441 -can be displayed in a contingency table
47
types of bar plots
stacked standardized version of stacked side-by-side
48
when is standardized stacked bar most useful and what is the downside
-useful if the primary variable in the stacked bar plot is relatively unbalanced -downside is that we lose all sense of how many cases each of the bars represents
49
when is side-by-side most useful and what is the downside
- agnostic in their display about which variable if any represents the explanatory and which the response variable -easy to see the number of cases in the group combinations -downside is that it can require more horizontal space -downside is that it can be difficult to to discern if there is an association between two variables if two groups are of very different sizes
50
when is a stacked bar graph the most useful
- when one variable is the explanatory variable and the other is the response
51
mosaic plot
plot suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well - the x-axis shows the width of the columns based on area representing relative proportion -the y-axis can be split into different variables
52
why is a pie chart less useful than a bar plot
it can be difficult to see details in a pie chart which are more obvious in a bar plot, especially for comparing groups
53
side-by-side box plot
good for comparing across groups
54
hollow histograms
compare numerical data across groups -outlines of histograms of each group put on the same plot
55
independence model
a model to test when the variables are independent and any observed result is due to chance
56
alternative model
a model to test when the variables are not independent, the observed result is not due to chance
57
simulation
testing whether a different randomization will affect result eg, take 20 notecards to represent 20 subjects
58
statistical inference
one field of statistics, evaluating whether differences are due to chance