2 Data Visualizations Flashcards
4 Basic Visualization Technique Categories
- Array plots
- Scatter plots
- Histograms
- Graphs
Array Plots
Rows are data points (instances)
Columns are numerical features
Each Grid element is colored based on the feature value
Uses a color map and color bar (often)
What does it mean when values on array plot are not colored
There are missing values in the data (that are set to negative infinity)
Strengths of Array Plots
Reveals qualitative information about dataset structure
Helps detect missing values, normalization issues, and batch differences quickly
Limitations of Array Plots
Can become overwhelming with large datasets.
Lacks precise information about value distributions or feature correlations.
How do array plots help detect missing values, normalization issues, and batch differences?
- Missing values: Appear as distinct color gaps or unusual patterns.
- Normalization issues: Features with different scales will have inconsistent color ranges.
- Batch differences: Different groups of data points may show noticeably different color patterns, indicating batch effects. (If groups have their own color)
Scatter Plots
Consider two features at a time in order to detect potential correlation by displaying values for the two features on the x and y axis respectively.
How to augment scatter plots using transparency
By changing transparency based on density (where a lot of datapoints on the plot are located), it is easier to identify where most of the datapoints lie on the plot.
Histograms
Focus on single numerical feature in order to extract more information about that feature.
Number of instances having a particular feature value is given on the y-axis. (Count)
Feature values are indicated on x-axis.
What are the strengths of histograms in data visualization?
Precisely shows the distribution of feature values (e.g., mean, variance, tailedness, and outliers).
Provides hints for preprocessing to reduce the impact of outliers (e.g., large spenders).
Weaknesses of histograms in data visualization
Does not show correlations between features.
Not suitable for high-dimensional data (requires a separate histogram for each feature).
Graph Visualization
Arrange nodes of the graph in a ring (2D layout) and draw lines between nodes connected by an edge (line).
This works well when the graph is sparse (not many edges between all nodes)
If edge strength is “real-valued”, only draw a line if edge value is above a certain threshold, or use different transparency levels
What is the purpose of low-dimensional embedding in data visualization?
It aims to construct a scatter plot where the x axis and y axis do not carry any specific meaning but where distances/similarities between points faithfully represent distances/similarities in the original input space. (interpret clusters)
What is MDS (Multi Dimensional Scaling)
Multi Dimensional Scaling is a popular low-dimensional embedding technique that generates for each instance a vector in low dimensions. These vectors are optimized so that the distances between points in low-dimensional space replicate the true distances between corresponding instances.
What is Metric MDS?
Metric MDS (multi-dimensional scaling) is a specific type of MDS that focuses on maintaining the distances from the original data as closely as possible. It aims to minimize the difference (or “stress”) between the true distances dij and the distances in the lower-dimensional representation d^ij .