Dimensionality Reduction Flashcards

Question 1

Q

Curse of Dimensionality

Answer

A

Increasing the number of features will not always improve classification accuracy, in fact, it may make it worse

Question 2

Q

Two main routes to reduce dimensionality

Answer

A

Feature extraction
Feature Selection

Question 3

Q

Application of dimensionality reduction

Answer

A

Customer relationship management
Text Mining
Image retrieval
Microarray data analysis
Protein classification
face recognition
handwriting digit recognition
intrusion detection

Question 4

Q

Feature Selection

Answer

A

A process that chooses an optimal subset of features according to a objective function

Objectives: reduce dimensionality and remove noise. Improve speed of learning, predictive accuracy, and simplicity

Think stepwise / forward / backward regressions

Question 5

Q

Feature Extraction

Answer

A

The mapping of the original high dimensionality data to a lower dimensional space

Goals can change based on end usage:
Unsupervised learning - minimize information loss (PCA)
Supervised learning - maximize class discrimination (LDA)

Think PCA

Question 6

Q

Pros of feature reduction

Answer

A

All original features are used although they may not be used in the same form. They are combined linearly.

In feature selection, only a subset of the original features are selected

Question 7

Q

Feature selection methods

Answer

A

Remove features with missing values
remove features with low variance
remove highly correlated features
univariate feature selection
feature selection using select from model
filter methods
wrapper methods
embedded methods
hybrid methods

Question 8

Q

Univariate feature selection

Answer

A

selecting best features based on univariate statistical tests. sklearns selectKbest

Question 9

Q

Filter Methods for Feature Selection

Answer

A

Filter based on:
Information Gain
Chi Squared Test
Fishers Score
Correlation coefficient

Question 10

Q

Information gain

Answer

A

Calculates the reduction in entropy from the transformation of a dataset

Question 11

Q

Fisher Score

Answer

A

Fishers score is one of the most widely used supervised feature selection methods.

The algorithm returns the ranks of variables based on the fishers score

Question 12

Q

Correlation Coefficient

Answer

A

Variables should be correlated with the target but should be uncorrelated among themselves (think the grid map)

Question 13

Q

Wrapper Methods

Answer

A

Generally ends with better results than filter methods as it can include feature interactions. It follows a greedy search approach by evaluating all the possible combinations of features against evaluation criterion

Forward selection (start with the best predictor and add), backwards selection (start with all features and remove weak ones), exhaustive (tries all combos), recursive selection (selects features by recursively considering smaller and smaller sets of features)

Question 14

Q

Embedded methods

Answer

A

These methods encompass the benefits of both wrapper and filter methods, by including interactions of features but also maintaining a reasonable computational cost.

LASSO, Random Forest Importance

Question 15

Q

LASSO

Answer

A

More accurate than base regressions

Uses shrinkage - where all data values are shrunken towards a central point as the mean

Encourages simple, sparse models

Well suited for models showing high levels of multicollinearity

Regularization consists of adding a penalty to the different parameters of the
machine learning model to reduce the freedom of the model, i.e. to avoid over-
fitting. In linear model regularization, the penalty is applied over the coefficients.
Lasso or L1 is able to shrink some of the coefficients to zero. Therefore, that
feature can be removed from the model.

Question 16

Q

Random Forest Importance

Answer

Study These Flashcards

A

Random forests naturally rank by how well they improve the purity of the node, i.e. decrease the Gini impurity over all trees. Pruning trees below a particular node can create a subset of the most important features.