Week 2 - Visualising your data and models Flashcards
Model-in-the-data-space
Assessing model fit by plotting it on the data
Straightforward if it is low dimensions
However, challenging if it is high dimensions
Data-in-the-model-space
Plot the data using the model’s perspective to see how well it aligns with predictions
Scatterplot matrix
Showcases:
- Linear association (correlation)
- Clumping (separated data points)
- Clustering (un-separated points with high concentration)
- Outliers
Scatterplot matrix (with supervised data)
When its supervised and includes a response variable, always include it
More clear in terms of cluster and linear relationship
Scatterplot matrix drawbacks
Difficult to plot large numbers of variables
May not be able to detect outliers (and need of higher dimension)
Perception
Aspect ratio for scatterplots need to be equal/square because it adversely affects the perception of correlation and association between variables
Parallel coordinate plot
x-axis is the variables and y-axis is their value
Each line represents an observation
Examining direction and orientation of lines to perceive multivariate relationships
- Crossing lines indicate negative association
- Lines with same slope indicate positive association
- Outliers have different pattern
- Groups of lines with same pattern indicate clustering
Can plot many more variables than a scatter plot matrix
Parallel coordinate plot drawbacks
Disadvantages:
- Hard to follow lines
- Order of variables matters (insights are coming from the view, not the data)
- Scaling of variables matters
Scaling
Used to make data comparable, where Scaled data will be between 0 and 1
Standard data -> (data value - variable mean) / variable standard deviation
MinMax data -> (data value - variable lower bound) / (variable upper bound - variable lower bound)
High-dimensions
Higher dimensions add new orthogonal axes and in machine learning, data often exists in much higher dimensions
Data Matrix
n*p
Projection Matrix
p*d
‘p’ can be any value, but we most likely going to ‘2’ as the ‘d’ dimension
Projected Data Matrix
n*d
Dimension reduction
Chooses the optimal projection
Principal component analysis (PCA)
Produces a low-dimensional representation of a dataset
It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
It is an unsupervised learning method
First principal component
Maximises variance among the data points
Important feature in PCA because:
- It is a linear combination of the original variables.
- It represents the axis along which the data is most spread out.
- It helps reduce dimensionality while preserving the most critical information.
Second principal component
Next most important direction in data after the first principal component
Second highest variance in data
It is orthogonal (perpendicular) to PC1, ensuring it provides new information not already explained by PC1
It helps in better understanding the structure of high-dimensional data
Total variance
The sum of the variances of all original features in the dataset. It represents the total amount of information (spread) present in the data before transformation.
PCA redistributes this variance among the principal components (PCs):
- The first principal component (PC1) captures the highest variance
- The second principal component (PC2) captures the next highest variance, and so on
- The sum of variances of all principal components equals the total variance of the original data (assuming no dimensionality reduction)
This concept is useful for deciding how many principal components to keep—by retaining those that explain most of the total variance
To choose k (the number of principal components)
Select k components that retain the largest proportion of variance in the data
Use the Proportion of Variance Explained (PVE) to measure how much variance each component captures
Examine the scree plot (variance explained vs. number of components) and look for the elbow point, where adding more components gives minimal additional variance
Delectable details
PCA summarises linear relationships, and might not see other interesting dependencies.
Projection pursuit is a generalisation that can find other interesting patterns
Outliers can affect results, because direction of outliers will appear to have larger variance
Scaling of variables matters, and typically you would first standardize each variable to have mean 0 and variance 1
Tour
Explores high-dimensional data by projecting it into lower-dimensional space and animating transitions between these projections.
Helps us understand structure and relationships with high-dimensional data
Projection matrix dimension stays the same, but the values within the projection matrix change over time, creating an animation effect
Eigenvalues
Special numbers associated with a square matrix that tell us how the matrix stretches or shrinks vectors