Multivariate Data Flashcards
Multivariate Data
When multiple variables or features are measured for each observation
If data has many features, it may be referred to as ‘high dimensional’ data
Dimensionality Reduction
Tried to find a reduced number of features to represent the dataset while preserving its structure
Axis on which features are drawn which allows one to understand how they link
Finds a reduced number of features representing the dataset whilst preserving some of the features in the data set
Dimensionality Reduction Usefulness
Allows the visualisation of multivariate data
Analyse/interpret its features
Understanding the structure of the original variables in terms of these latent features
Summary of Multivariate Data
More than one variable/feature of the data for each observation
Widespread in many areas
Many variables/features then dimensionality reduction can help with visualising/anaylsis/interpretation
Dimensionality reduction is a form of unsupervised learning
Matrix of Scatterplts
Diagonal elements = distribution of each variable
Off-diagonal below = scatterplot between each variable
Off-diagonal above = correlation coefficient of variables
All datapoints plotted
If working with all data sets can greatly impact the easiness of visualising data
Heat Map
Visualise relationship between many variables
Colour relates to correlation coefficient
Representational similarity analysis - distinct representation of value and where subjects are attending
Structure in data can ‘pop out’
Calculating a Covariance/Correlation Matrix
Capture relationship between all the different variables
Correlation matrix contains all of the correlations between all the different data points
Summary of Visualisation
Multivariate data commonly visualised to examine how different variables correlate (covary) with each other
Heatmaps provide an intuitive way to examine structure when there are many variables
Covariance matrix is a matrix that contains the covariance of all variables with each other
Principal Component Analysis
Examines correlation between two different variables
Normally related to how correlated the different variables are
Regression includes fitting a line summarising the sum of squared individuals in the y axis emphasising difference
PCA finds Euclidean distance between data points trying to find axis predicting all of the variables available
Principal Component 1
Axis explaining the most variance in the data
Regression vs PCA
Regression minimises residuals in y
PCA minimises euclidean distance
Variables Correlated
Describes data using a single vector - explains how much each of the x and y contributes to the components
Defining axis is saying how much of a contribution each original variable is contributiong to the PC1
Some variance is not explained
Principal Component 2
Vector explaining most variance after the contribution of PC1 has been removed
Points in another direction
Length of vector - amount of variance explained in original data
Final Principal Component
Resulting princiapl components depend on identifying the axes of covariance between the variables
Scree plot shows the variance explained by each principal component
PCA in Practice
- subtract the mean of each column, and divide the standard deviation
- calculate covariance matrix between these columns
- calculate the eigenvectors and eigenvalues of the covariance matrix
- sort the eigenvectors by the eigenvalues