Week 2 - Visualising your data and models Flashcards

Question 1

Q

Model-in-the-data-space

Answer

A

Assessing model fit by plotting it on the data

Straightforward if it is low dimensions
However, challenging if it is high dimensions

Question 2

Q

Data-in-the-model-space

Answer

A

Plot the data using the model’s perspective to see how well it aligns with predictions

Question 3

Q

Scatterplot matrix

Answer

A

Showcases:
- Linear association (correlation)
- Clumping (separated data points)
- Clustering (un-separated points with high concentration)
- Outliers

Question 4

Q

Scatterplot matrix (with supervised data)

Answer

A

When its supervised and includes a response variable, always include it

More clear in terms of cluster and linear relationship

Question 5

Q

Scatterplot matrix drawbacks

Answer

A

Difficult to plot large numbers of variables

May not be able to detect outliers (and need of higher dimension)

Question 6

Q

Perception

Answer

A

Aspect ratio for scatterplots need to be equal/square because it adversely affects the perception of correlation and association between variables

Question 7

Q

Parallel coordinate plot

Answer

A

x-axis is the variables and y-axis is their value
Each line represents an observation

Examining direction and orientation of lines to perceive multivariate relationships
- Crossing lines indicate negative association
- Lines with same slope indicate positive association
- Outliers have different pattern
- Groups of lines with same pattern indicate clustering

Can plot many more variables than a scatter plot matrix

Question 8

Q

Parallel coordinate plot drawbacks

Answer

A

Disadvantages:
- Hard to follow lines
- Order of variables matters (insights are coming from the view, not the data)
- Scaling of variables matters

Question 9

Q

Scaling

Answer

A

Used to make data comparable, where Scaled data will be between 0 and 1

Standard data -> (data value - variable mean) / variable standard deviation

MinMax data -> (data value - variable lower bound) / (variable upper bound - variable lower bound)

Question 10

Q

High-dimensions

Answer

A

Higher dimensions add new orthogonal axes and in machine learning, data often exists in much higher dimensions

Question 11

Q

Data Matrix

Question 12

Q

Projection Matrix

Answer

A

p*d

‘p’ can be any value, but we most likely going to ‘2’ as the ‘d’ dimension

Question 13

Q

Projected Data Matrix

Question 14

Q

Dimension reduction

Answer

A

Chooses the optimal projection

Question 15

Q

Principal component analysis (PCA)

Answer

A

Produces a low-dimensional representation of a dataset

It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated

It is an unsupervised learning method

Question 16

Q

First principal component

Answer

Study These Flashcards

A

Maximises variance among the data points

Important feature in PCA because:
- It is a linear combination of the original variables.
- It represents the axis along which the data is most spread out.
- It helps reduce dimensionality while preserving the most critical information.

Question 17

Q

Second principal component

Answer

Study These Flashcards

A

Next most important direction in data after the first principal component

Second highest variance in data

It is orthogonal (perpendicular) to PC1, ensuring it provides new information not already explained by PC1

It helps in better understanding the structure of high-dimensional data

Question 18

Q

Total variance

Answer

Study These Flashcards

A

The sum of the variances of all original features in the dataset. It represents the total amount of information (spread) present in the data before transformation.

PCA redistributes this variance among the principal components (PCs):
- The first principal component (PC1) captures the highest variance
- The second principal component (PC2) captures the next highest variance, and so on
- The sum of variances of all principal components equals the total variance of the original data (assuming no dimensionality reduction)

This concept is useful for deciding how many principal components to keep—by retaining those that explain most of the total variance

Question 19

Q

To choose k (the number of principal components)

Answer

Study These Flashcards

A

Select k components that retain the largest proportion of variance in the data

Use the Proportion of Variance Explained (PVE) to measure how much variance each component captures

Examine the scree plot (variance explained vs. number of components) and look for the elbow point, where adding more components gives minimal additional variance

Question 20

Q

Delectable details

Answer

Study These Flashcards

A

PCA summarises linear relationships, and might not see other interesting dependencies.
Projection pursuit is a generalisation that can find other interesting patterns

Outliers can affect results, because direction of outliers will appear to have larger variance

Scaling of variables matters, and typically you would first standardize each variable to have mean 0 and variance 1

Question 21

Q

Tour

Answer

Study These Flashcards

A

Explores high-dimensional data by projecting it into lower-dimensional space and animating transitions between these projections.

Helps us understand structure and relationships with high-dimensional data

Projection matrix dimension stays the same, but the values within the projection matrix change over time, creating an animation effect

Question 22

Q

Eigenvalues

Answer

Study These Flashcards

A

Special numbers associated with a square matrix that tell us how the matrix stretches or shrinks vectors

Week 2 - Visualising your data and models Flashcards

(22 cards)