Feature Selection & Dimensionality Reduction Flashcards

Notes from Lecture 4 that may help in my exam

1
Q

What are the main aims for applying Feature Selection and Dimensionality Reduction techniques?

A
  • Reduce the impact caused by Curse of Dimensionality
  • Remove redundant features to improve performance
  • Increase computational efficiency
  • Reduce cost in new data acquisition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What factors should be considered when using Feature Selection or Dimensionality Reduction methods?

A
  • The target dimension i.e. what you wish to reduce down to
  • Interpretability (Yes - use Feature Selection, No - Use either method)
  • Feature Correlations/dependency
  • Feature reliability and repeatability
  • Methods (different methods result in different features being used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three popular Feature Selection methods?

A

Wrapper Method - Searches for optimal feature subset that maximises the decision-making performance.

Embedded Methods - Integrates Feature Selection into the Model Learning process

Filter-based Methods - Selection is based on feature relationships and statistics, rather than performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some examples of Wrapper Methods, in regards to Feature Selection?

A

Recursive Feature Elimination
Sequential Feature Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some examples of Embedded Methods, in regards to Feature Selection?

A

Ridge (ElasticNet)
LASSO
Random Forest (feature ranking)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some examples of Filter-based Methods, in regards to Feature Selection?

A

Univariate (ANOVA)
Chi Square
Correlation/Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the Forward Feature Selection method work?

A

Starts with an empty set of features (X), and adds features one by one. The goal is to identify a subset of features that maximises the model’s performance based on a chosen evaluation metric, such as Accuracy, F1 Score, Mean Squared Error, etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the step-by-step breakdown of Forward Feature Selection?

A

Start with an Empty Feature Set
Then, for each feature:
- If it increases the evaluation metric beyond the previously defined evaluation metric, add it to the list of features and update the evaluation value.
- If it doesn’t, then ignore it
After iterating through all the features, then return the newly defined subset of features as the features to be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between Forward Feature Selection and Recursive Feature Elimination?

A

Forward Feature Selection starts with an empty set of features, and adds them one by one
Recursive Feature Elimination starts with a full set of features, and removes them one by one if they add no value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the LASSO method?

A

LASSO is an embedded method, and it is a regularisation technique used in regression analysis to enhance model performance. It introduces a penalty to the loss function to prevent overfitting and perform feature selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does LASSO work?

A

Lasso regression works by adding an L1 regularisation term to reduce the number of effective features in the feature space. This penalty encourages sparsity in the coefficient vector, causing some coefficients to shrink to 0, which removes certain features from the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the regularisation parameter in Lasso control?

A

The regularisation parameter, also known as gamma, controls the degree of regularisation, balancing model accuracy and feature selection. It is the integral part of LASSO, which is used to identify key features whilst also building a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What can happen if you set the regularisation parameter too high/at a higher value in LASSO?

A

A higher regularisation parameter will make some of the weights of X in the equation become 0, which will reduce the number of dimensions by removing features that are less relevant/important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a Chi-Square do?

A

A Chi-Square tests the independence of predictor and outcome events.
It is suitable for categorical features in categorical outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does a T-test do?

A

T-test compares the statistical difference of two groups (binary class), and is used for continuous features.
It checks how the mean of two groups are different from one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does ANOVA do?

A

ANOVA uses variance to test the relationship between categorical predictors and continuous outcome response e.g. gender, age group to predict exam mark

16
Q

What does a correlation test do?

A

A correlation test works for predictors and outcomes that are both continuous

17
Q

What is the null hypothesis for the Chi-Square test?

A

Null hypothesis - The two categorical data are independent

18
Q

What is the null hypothesis for the T-test?

A

The mean of the two groups are the same

19
Q

What is the null hypothesis for ANOVA?

A

The variance across categories are equal.

20
Q

What is the null hypothesis for Correlation?

A

The two groups of data are not correlated.

21
Q

How do you disprove a null hypothesis?

A

You calculate the P-value, and if it is below the threshold e.g. 0.01, then it can be used to reject the null hypothesis.

22
Q

What are the three popular Dimensionality Reduction Methods?

A

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Manifold Learning (non-linear)

23
Q

How does PCA work?

A

It works by transforming the original data into a set of orthogonal features (uncorrelated) called Principal Components, which capture the most significant information from the original features.

24
Q

What is the step-by-step process for PCA?

A
  1. Standardise the data using the formula for standardisation (X_(standardised) = (X - mean)/standard deviation)
  2. Find the covariance matrix of the dataset, which describes how the features vary with respect to each other.
  3. Compute the Eigenvalues and Eigenvectors of the covariance matrix.
  4. Sort the Eigenvectors in descending order, where the first in the order represents the first principal component
  5. Select the top k amount of Principal Components
  6. Project the data onto the new Principal Components
25
Q

Why should you use PCA in Machine Learning?

A
  • It reduces the amount of dimensions in the dataset
  • Noise Reduction, by keeping the largest eigenvalues for the principal components
  • Feature Independence, whereby the new features that are produced by PCA are uncorrelated, which can be beneficial for algorithms.
26
Q

What are some limitations of PCA?

A

PCA assumes that the data is linearly correlated, so it may not perform well for data with complex, non-linear relationships

The new components produced by PCA may not be easy to interpret

PCA is sensitive to the scaling of data, so it’s crucial to standardise the data used when features have different scales.

27
Q

How does LDA (Linear Discriminant Analysis) work?

A

It aims to find a linear combination of features that best separates different classes in the data by focusing on maximising the separation between different classes

28
Q

What is the step-by-step process for performing LDA?

A
  1. Compute the mean vectors for each class, and then compute the overall mean vector across all classes
  2. Compute the Between-Class Scatter Matrix
  3. Compute the Within-Class Scatter Matrix
  4. Compute the Fisher Criterion, which measures the ratio of between-class variance to within-class variance.
  5. Solve for the projection matrix
29
Q

What is the main difference between PCA and LDA?

A

PCA is supervised, whereas LDA is unsupervised.

30
Q

How does Manifold Learning work?

A

Manifold learning aims to learn the latent representation of the original data in lower dimensions.

31
Q

What is an example method of Manifold Learning?

A

PCA is (technically) a linear manifold learning method, but that doesn’t work for all cases.