Lecture 4 Flashcards
Low Dimensional Vizualization
What are the main reasons for using data visualization?
To explore data and uncover patterns or associations.
To effectively communicate findings.
To detect errors or outliers visually.
Example: Using a histogram to spot an unusually high data value (outlier) in height measurements.
What are the 3 core principles of the Grammar of Graphics?
Separation of data and aesthetics: Define how data is mapped (e.g., color, size).
Plot element definition: Specify visual components like points, lines, or bars.
Layer composition: Combine layers to build a plot.
Extra: This concept, developed by Leland Wilkinson, inspired the ggplot2 package in R.
Name the 6 main layers in ggplot2.
Data: Input data.
Aesthetics (aes): Mapping variables to visual features like color, x/y positions.
Geometric objects (geom): Defines the plot type (e.g., geom_point, geom_bar).
Scales: Adjusts visual scaling (e.g., scale_x_log10).
Facets: Creates subplots for different subsets of data (facet_grid).
Theme: Sets plot styles (e.g., axis labels, grid lines).
How do histograms visualize data?
They show the frequency of data values across intervals (bins).
Example:
Command: geom_histogram(bins=10)
Adjust bins to control the granularity of the plot.
Tip: Histograms are good for showing distributions but can hide subtle patterns.
What are density plots, and how do you control their appearance?
Density plots smooth data distributions using kernel density estimation.
Command Example:
geom_density(bw=0.5) (controls the bandwidth for smoothness).
Tip: Use caution as bandwidth significantly affects visual interpretation.
What are the key elements of a box plot?
Median: Center line of the box.
Quartiles (Q1 & Q3): Edges of the box (25th and 75th percentiles).
Whiskers: Extend up to 1.5 times the interquartile range (IQR).
Outliers: Points beyond whiskers.
Extra: Box plots are not ideal for multimodal or discrete data.
What do scatter plots show?
They show relationships between two continuous variables.
Example: Comparing life expectancy vs. GDP using:
geom_point(aes(x = gdpPercap, y = lifeExp)).
Enhancements:
Use color or size to add more dimensions (aes(color=continent, size=pop)).
Log scaling helps with large variance (scale_x_log10()).
What is the purpose of a Q-Q plot?
To compare the distribution of data to a theoretical distribution (e.g., normal or uniform).
Command: geom_qq(distribution = stats::qunif)
Diagonal line (geom_abline) represents perfect alignment between distributions.
Tip: Use Q-Q plots to check normality before statistical tests.
When should line plots be used?
To show connections or trends over time (e.g., unemployment rate over years).
Command: geom_line(aes(x = date, y = unemploy/pop)).
What is a violin plot, and when is it useful?
A combination of a box plot and density plot, showing distribution and density.
Ideal for multimodal data.
Command: geom_violin().
What do bar plots show?
Quantitative values per category (e.g., number of countries per continent).
Command: geom_bar(stat = ‘identity’).
Tip: Add error bars with geom_errorbar() to show uncertainty.
What is a scatterplot matrix?
A grid of scatter plots that shows relationships between several variables.
Command: ggpairs(mpg, columns = c(‘displ’,’cyl’,’cty’,’hwy’)).
What is the purpose of a 2D density plot?
It shows point density across a 2D space, useful for large datasets.
Command: geom_hex().
What are two important principles for data visualization?
Show raw data when possible: Avoid over-smoothing or hiding outliers.
Maximize data/ink ratio: Present data with minimal visual clutter.