Chapter 2 Flashcards
Scatter plots
Graphs that provide case-by-case views of data for two ‘numerical variables
X axis shows explanatory variable, y-axis shows response variable.
Car be helpful to quickly spot associations relating variables
Dot plot
One variable version of a scatter plot to show the distribution of data. Mean is usually shown
Histograms
Graph of data density. Groups data into bins to display as bars, rather than individual date points.
When data trails off on one side, it is said to be skewed in that direction, i.e. If the bars trail Off on the right, it is right showed- also called a long right tail
Roughly equal tails on each side are then called symmetric
Mode
A mode represents a prominent peak in a distribution ( histogram)
Unimodal - only one prominent peak
Bimodal -2 prominent peaks
Multimodal - 3 or more prominent peaks
Deviation & variance
The distance of an observation from the mean = deviation.
X minus mean of x
Variance is the squared deviations & then averaged. Gets rid of negatives.
(Deviations squared, added together) / n-1
Standard deviation
The square root of the variance. Represents the typical deviation from the mean.
Usually 70% of data falls within 1 standard deviation and 95% within 2
Box plots
Summarize data using 5 statistics and plots outliers.
Median - thick line to separate data in half
Box of interquartile range (IQR) - measures the variability of the data. (The more variability in the data, the larger the std deviation and IQR).
Q1 First quartile shows the 25th percentile, meaning 25% of data falls below the line
Q3 third quartile shows 75th percentile
IQR = Q3 - Q1 which is 50% of the data
Whiskers capture data outside the box. Never more than 1.5IQR
Upper is Q3 + 1.5IQR. Lower is Q1 - 1.5*IQR
Whiskers stop at the highest or lowest point if they do not reach this maximum.
Outliers are data points beyond the whiskers. Useful for identifying strong skew in distribution, possible data collection/entry errors, & insight into interesting properties of the data
Robust statistics
Stats where outliers have little effect on their values such as median and IQR
Mean & std dev are highly influenced by extreme observations
Transformation of data
Rescaling of data using a function - helpful for strongly skewed data where much of the data is clustered near zero
Intensity maps
Graphic of geographic data using colors to show values of a variable, not helpful with getting /showing precise values more so with seeing trends
Contingency tables & plots
Summarizes data for 2 categorical variables where each value represents the number of times a particular combination of variables occurred
Row or column proportions in contingency tables
Row or column proportions use a fractional break down of one variable in another
Row proportions are computed as counts divided by their row totals & the cases are the proportions or percentages of that case
Columns are the same just using column totals instead,
Stacked bar plots or side-by-side ber plots
.graphical display of contingency table information
Stacked bar include two variables in one bar
Mosaic plots
Visualization technique suitable for contingency tables that resemble a standardized stacked bar plot with the added benefit of still seeing the relative group sizes of the primary variable as well
Comparing numerical date across groups?
Using side-by-side box plots or hollow histograms