Lesson 3 Flashcards
scatterplots
used for identifying explanatory variable
Which axis should have response or explanatory variable?
response= y axis
explanatory = x axis
this is only a way to keep track of which one we suspect is which
What factors do you use to evaluate the relationship of variables on a scatterplot?
direction (neg/pos)
shape (linear/curved)
strength (strong/weak)
outliers
histogram
data are binned into intervals and heights represent number of cases that fall into each interval
provides a view of the data density
useful for describing shape of distribution
data density
where the data is relatively more common
considerations for outliers
don’t handle “naively” by discarding outliers without careful consideration
check data for data entry mistakes
outliers can be very interesting
left skewed
-data sets with longer tail on left
-negative end tail
right skewed
-data sets with longer tail on the right
-data trails off to the right
-skewed to the high end
-skewed to the positive end
symmetric
data sets with roughly equal trailing off in both directions are called symmetric
long tail
when data trail off in one direction the distribution has a long tail
define modality and state examples
a mode is represented by a prominent peak in the distribution, if a peak is less prominent or only represented by a few data points-its not counted
1-unimodal
2-bimodal
0-uniform (no prominent peaks)
3+ -multimodal
how does bin width alter the story the data is telling in a histogram?
too wide= might loose interesting details
too narrow= might be difficult to get an overall sense of distribution
-ideal width depends on data you are working with
dot plot
especially useful when individual samples are of interest but might get too busy with sample size
box plot
useful for displaying outliers, median, interquartile range IQR, not good for showing modality
intensity map
showing data in context, often geographical, using different layers
center of distribution: name and define the 3 types
mean: arithmetic average
mode: most frequent observation
median: midpoint of the distribution(50th percentile)
sample statistic
-when mean, median and mode are calculated from a sample
-use letters from latin alphabet
point estimate
involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter (for example, the population mean).
are greek or latin letters used to describe population parameters?
greek letters
what is a helpful way to think of the mean?
think of the mean as the balancing point in the variance
mode
most common value in distribution
median
average of middle two observations or middle value of odd number of values
range
difference between min and max
not as reliable because uses two most extreme points