Lesson 3 Flashcards
scatterplots
used for identifying explanatory variable
Which axis should have response or explanatory variable?
response= y axis
explanatory = x axis
this is only a way to keep track of which one we suspect is which
What factors do you use to evaluate the relationship of variables on a scatterplot?
direction (neg/pos)
shape (linear/curved)
strength (strong/weak)
outliers
histogram
data are binned into intervals and heights represent number of cases that fall into each interval
provides a view of the data density
useful for describing shape of distribution
data density
where the data is relatively more common
considerations for outliers
don’t handle “naively” by discarding outliers without careful consideration
check data for data entry mistakes
outliers can be very interesting
left skewed
-data sets with longer tail on left
-negative end tail
right skewed
-data sets with longer tail on the right
-data trails off to the right
-skewed to the high end
-skewed to the positive end
symmetric
data sets with roughly equal trailing off in both directions are called symmetric
long tail
when data trail off in one direction the distribution has a long tail
define modality and state examples
a mode is represented by a prominent peak in the distribution, if a peak is less prominent or only represented by a few data points-its not counted
1-unimodal
2-bimodal
0-uniform (no prominent peaks)
3+ -multimodal
how does bin width alter the story the data is telling in a histogram?
too wide= might loose interesting details
too narrow= might be difficult to get an overall sense of distribution
-ideal width depends on data you are working with
dot plot
especially useful when individual samples are of interest but might get too busy with sample size
box plot
useful for displaying outliers, median, interquartile range IQR, not good for showing modality
intensity map
showing data in context, often geographical, using different layers
center of distribution: name and define the 3 types
mean: arithmetic average
mode: most frequent observation
median: midpoint of the distribution(50th percentile)
sample statistic
-when mean, median and mode are calculated from a sample
-use letters from latin alphabet
point estimate
involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter (for example, the population mean).
are greek or latin letters used to describe population parameters?
greek letters
what is a helpful way to think of the mean?
think of the mean as the balancing point in the variance
mode
most common value in distribution
median
average of middle two observations or middle value of odd number of values
range
difference between min and max
not as reliable because uses two most extreme points
what are the four measures of spread
range
variance
standard deviation
inter-quartile range
variance- describe roughly what this is
roughly the average squared deviation from the mean
formula for calculating sample variance (s^2)
1)difference between mean and each observation
2)square each deviation and add them up
3)then divide by sample size by n-1
–answer is squared
why do we divide by n-1, rather than n, when computing a sample’s variance?
doing this makes the statistics slightly more reliable and useful
letters used to describe sample variance and population variance
s^2 = sample variance
sigma ^2= population variance
why do we square the differences when calculating variance?
-get rid of negatives so that negatives and positives don’t cancel each other out when added together
-increase larger deviations more than smaller ones so that they are weighted more heavily
deviation
distance of an observation from its mean
what is the symbol for standard deviation?
lowercase sigma
standard deviation
roughly the average deviation around the mean and has the same units as the data
formula= square root of the variance
Which distribution is more variable, one where more observations are clustered around the center or one where less are centered around the variable?
Distributions where less observations are clustered around the center are more variable.
interquartile range
-range of the middle 50% of the data, distance between first quartile(25th percentile) and third quartile (75th percentile)
IQR = Q3-Q1
-best to use box plot to visualize
-IQR is more reliable in looking at spread because it doesn’t look at values which could be outliers
robust statistics
measures on which extreme observations have little effect
eg
data. mean. median
1,2,3,4,5,6 3.5. 3.5
1,2,3,4,5,1000. 169. 3.5
Here the median is more robust.
describe which: median, mean, IQR and SD are robust or non-robust and when are they better for describing data
———- robust. non-robust
center. median. mean
spread. IQR. SD, range
Median and IQR= more robust when looking at data with skewed with extreme observations
Mean and SD= best for looking at symmetric observations
transformation
-rescaling of the data using a function
-when data are very strongly skewed, we sometimes transform them so they are easier to model
(natural) log transformation
-often applied when much of the data cluster near zero(relative to the larger values in the data set) and all values are positive
-can be easier to analyze because outliers become less extreme, data is more symmetric, less skewed
-make the relationship between the variables more linear and easier to model with simple methods
square root transformation
plot the square root or the inverse square root
goals of transformation
-see data structure differently
-reduce skew to assist in modeling
-straighten a nonlinear relationship in a scatterplot
what greek letter represents the population mean?
mu (micro)
what letter represents sample mean?
x bar
contingency table
-summarizes data for two categorical variables
-shows number of times a particular combination of variable outcomes occurred, along with column totals and rows totals
bar plot
way to display single categorical variable
-x-axis shows categories
difference between histogram and
bar plot
bar plot
-shows discrete or categorical variables
-x-axis shows categories
-bars can be rearranged
histogram-depicts the frequency distribution of variables in a dataset
-x-axis shows numbers
-the bars cannot be rearranged
row or column proportions
-counts divided by their row totals
ie. 3496 renters/8505 total = 0.441
-can be displayed in a contingency table
types of bar plots
stacked
standardized version of stacked
side-by-side
when is standardized stacked bar most useful and what is the downside
-useful if the primary variable in the stacked bar plot is relatively unbalanced
-downside is that we lose all sense of how many cases each of the bars represents
when is side-by-side most useful and what is the downside
- agnostic in their display about which variable if any represents the explanatory and which the response variable
-easy to see the number of cases in the group combinations
-downside is that it can require more horizontal space
-downside is that it can be difficult to to discern if there is an association between two variables if two groups are of very different sizes
when is a stacked bar graph the most useful
- when one variable is the explanatory variable and the other is the response
mosaic plot
plot suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well
- the x-axis shows the width of the columns based on area representing relative proportion
-the y-axis can be split into different variables
why is a pie chart less useful than a bar plot
it can be difficult to see details in a pie chart which are more obvious in a bar plot, especially for comparing groups
side-by-side box plot
good for comparing across groups
hollow histograms
compare numerical data across groups
-outlines of histograms of each group put on the same plot
independence model
a model to test when the variables are independent and any observed result is due to chance
alternative model
a model to test when the variables are not independent, the observed result is not due to chance
simulation
testing whether a different randomization will affect result
eg, take 20 notecards to represent 20 subjects
statistical inference
one field of statistics, evaluating whether differences are due to chance