Feature Selection & Dimensionality Reduction Flashcards by Joshua Carey-Young

What are the main aims for applying Feature Selection and Dimensionality Reduction techniques?

Reduce the impact caused by Curse of Dimensionality
Remove redundant features to improve performance
Increase computational efficiency
Reduce cost in new data acquisition

How well did you know this?

Not at all

Perfectly

What factors should be considered when using Feature Selection or Dimensionality Reduction methods?

The target dimension i.e. what you wish to reduce down to
Interpretability (Yes - use Feature Selection, No - Use either method)
Feature Correlations/dependency
Feature reliability and repeatability
Methods (different methods result in different features being used)

How well did you know this?

Not at all

Perfectly

What are the three popular Feature Selection methods?

Wrapper Method - Searches for optimal feature subset that maximises the decision-making performance.

Embedded Methods - Integrates Feature Selection into the Model Learning process

Filter-based Methods - Selection is based on feature relationships and statistics, rather than performance

How well did you know this?

Not at all

Perfectly

What are some examples of Wrapper Methods, in regards to Feature Selection?

Recursive Feature Elimination
Sequential Feature Selection

How well did you know this?

Not at all

Perfectly

What are some examples of Embedded Methods, in regards to Feature Selection?

Ridge (ElasticNet)
LASSO
Random Forest (feature ranking)

How well did you know this?

Not at all

Perfectly

What are some examples of Filter-based Methods, in regards to Feature Selection?

Univariate (ANOVA)
Chi Square
Correlation/Variance

How well did you know this?

Not at all

Perfectly

How does the Forward Feature Selection method work?

Starts with an empty set of features (X), and adds features one by one. The goal is to identify a subset of features that maximises the model’s performance based on a chosen evaluation metric, such as Accuracy, F1 Score, Mean Squared Error, etc…

How well did you know this?

Not at all

Perfectly

What is the step-by-step breakdown of Forward Feature Selection?

Start with an Empty Feature Set
Then, for each feature:
- If it increases the evaluation metric beyond the previously defined evaluation metric, add it to the list of features and update the evaluation value.
- If it doesn’t, then ignore it
After iterating through all the features, then return the newly defined subset of features as the features to be used

How well did you know this?

Not at all

Perfectly

What is the difference between Forward Feature Selection and Recursive Feature Elimination?

Forward Feature Selection starts with an empty set of features, and adds them one by one
Recursive Feature Elimination starts with a full set of features, and removes them one by one if they add no value.

How well did you know this?

Not at all

Perfectly

What is the LASSO method?

LASSO is an embedded method, and it is a regularisation technique used in regression analysis to enhance model performance. It introduces a penalty to the loss function to prevent overfitting and perform feature selection.

How well did you know this?

Not at all

Perfectly

How does LASSO work?

Lasso regression works by adding an L1 regularisation term to reduce the number of effective features in the feature space. This penalty encourages sparsity in the coefficient vector, causing some coefficients to shrink to 0, which removes certain features from the model.

How well did you know this?

Not at all

Perfectly

What does the regularisation parameter in Lasso control?

The regularisation parameter, also known as gamma, controls the degree of regularisation, balancing model accuracy and feature selection. It is the integral part of LASSO, which is used to identify key features whilst also building a model.

How well did you know this?

Not at all

Perfectly

What can happen if you set the regularisation parameter too high/at a higher value in LASSO?

A higher regularisation parameter will make some of the weights of X in the equation become 0, which will reduce the number of dimensions by removing features that are less relevant/important.

How well did you know this?

Not at all

Perfectly

What does a Chi-Square do?

A Chi-Square tests the independence of predictor and outcome events.
It is suitable for categorical features in categorical outcomes.

How well did you know this?

Not at all

Perfectly

What does a T-test do?

T-test compares the statistical difference of two groups (binary class), and is used for continuous features.
It checks how the mean of two groups are different from one another.

How well did you know this?

Not at all

Perfectly

What does ANOVA do?

Study These Flashcards

ANOVA uses variance to test the relationship between categorical predictors and continuous outcome response e.g. gender, age group to predict exam mark

What does a correlation test do?

Study These Flashcards

A correlation test works for predictors and outcomes that are both continuous

What is the null hypothesis for the Chi-Square test?

Study These Flashcards

Null hypothesis - The two categorical data are independent

What is the null hypothesis for the T-test?

Study These Flashcards

The mean of the two groups are the same

What is the null hypothesis for ANOVA?

Study These Flashcards

The variance across categories are equal.

What is the null hypothesis for Correlation?

Study These Flashcards

The two groups of data are not correlated.

How do you disprove a null hypothesis?

Study These Flashcards

You calculate the P-value, and if it is below the threshold e.g. 0.01, then it can be used to reject the null hypothesis.

What are the three popular Dimensionality Reduction Methods?

Study These Flashcards

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Manifold Learning (non-linear)

How does PCA work?

Study These Flashcards

It works by transforming the original data into a set of orthogonal features (uncorrelated) called Principal Components, which capture the most significant information from the original features.

What is the step-by-step process for PCA?

1. Standardise the data using the formula for standardisation (X_(standardised) = (X - mean)/standard deviation) 2. Find the covariance matrix of the dataset, which describes how the features vary with respect to each other. 3. Compute the Eigenvalues and Eigenvectors of the covariance matrix. 4. Sort the Eigenvectors in descending order, where the first in the order represents the first principal component 5. Select the top k amount of Principal Components 6. Project the data onto the new Principal Components

Why should you use PCA in Machine Learning?

- It reduces the amount of dimensions in the dataset - Noise Reduction, by keeping the largest eigenvalues for the principal components - Feature Independence, whereby the new features that are produced by PCA are uncorrelated, which can be beneficial for algorithms.

What are some limitations of PCA?

PCA assumes that the data is linearly correlated, so it may not perform well for data with complex, non-linear relationships The new components produced by PCA may not be easy to interpret PCA is sensitive to the scaling of data, so it's crucial to standardise the data used when features have different scales.

How does LDA (Linear Discriminant Analysis) work?

It aims to find a linear combination of features that best separates different classes in the data by focusing on maximising the separation between different classes

What is the step-by-step process for performing LDA?

1. Compute the mean vectors for each class, and then compute the overall mean vector across all classes 2. Compute the Between-Class Scatter Matrix 3. Compute the Within-Class Scatter Matrix 4. Compute the Fisher Criterion, which measures the ratio of between-class variance to within-class variance. 5. Solve for the projection matrix

What is the main difference between PCA and LDA?

PCA is supervised, whereas LDA is unsupervised.

How does Manifold Learning work?

Manifold learning aims to learn the latent representation of the original data in lower dimensions.

What is an example method of Manifold Learning?

PCA is (technically) a linear manifold learning method, but that doesn't work for all cases.

Feature Selection & Dimensionality Reduction Flashcards

Notes from Lecture 4 that may help in my exam (32 cards)