Lecture 3 Flashcards

Question 1

Q

Why dimension reduction?

Answer

A

Visualisation

- Curse of dimensitionality

Question 2

Q

How dimension reduction?

Answer

A

Feature selection:

Filtering strategy
Wrapper strategy
Embedding strategy

2 Feature extraction:

Linear:

PCA
Factor analysis

Non-linear:

Kerner PCA
Curves
Manifolds

Question 3

Q

How: Feature selection

Answer

A

Filtering strategy
Wrapper strategy
Embedding strategy

Question 4

Q

How: Feature extraction

Answer

A

Linear:

PCA (unsupervised learning, data driven, distance, no labels)
Factor analysis (unsupervised learning, idem)

Non-linear:

Kerner PCA
Curves
Manifolds

Question 5

Q

Why visualization?

Answer

A

Our visual system is incredibly good at detecting
patterns in a few dimensions (max 3)
Often visualizing a dataset in a few dimensions can
lead to insights. Or at least make a concept easier to convey to other people.

Question 6

Q

Curse of conditionality

https://www.youtube.com/watch?v=8CpRLplmdqE

In short;
- Training examples needed grows exponentially with # of dim. to generalize accurately

If n training examples stays the same & # dim. grows, problems with distance function.
Each instance becomes each instances become more unique.
More likely to overfit; increase accuracy TrDs, but decrease in accuracy TeDs

Answer

A

The amount of training data needed to obtain the same amount of cover grows exponentially with the # of dimensions to generalize accurately.

If we want N training items per unit of “feature space”
Then for each additional dimension = # of training samples * N.

You density of training examples becomes lower and lower with each dimension you add. If # training examples would stay the same and feature dimensions increase, each instance becomes more unique (further apart from each other & surrounded by a lot of ‘empty space’ ) –> therefore you need more training items.
Problem for distance functions;
- In high-dim. space e.g. for Euclidean distance there is little difference between different pairs of data points.
- For independent and identically distributed (i.i.d.) data and fixed n, minimum and maximum distance between random reference point Q and list of n random data points P1,…,Pn become “invisible” compared to minimum distance (? clueless about this last point)
More likely to overfit training data when the # of dim. is higher.
- This leads to improved accuracy on the training dataset
- But a decrease in accuracy on the test dataset

Question 7

Q

Support Vector Machine

SVMs can make more complex decision boundaries by using kernels (see next card for more kernel functions, soft margin formulation explained here)

In short:

Supervised, binary classification
Can make complex decision boundaries by using kernels
Allows miss-classification and they can handle outliers (allows bias to reduce variance) –> reduce overfitting

https://www.youtube.com/watch?v=efR1C6CvhmEhttps://www.r-bloggers.com/support-vector-machines-with-the-mlr-package/

Answer

A

For linearly-separable data

The SVM algorithm finds an optimal linear hyperplane that separates the classes. For a two-dimensional feature space, such as in the example in figure 1, a hyperplane is simply a straight line. For a three-dimensional feature space, a hyperplane is a surface. The principle is the same: they are surfaces that cut through the feature space.

The margin is the shortest distance between threshold and observations. Linearly separable data can use a (1) Maximum margin kelner or (2) Soft margin kelner.

(2) We might not want to choose a decision boundary that perfectly separates the data to avoid overfitting. Allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly.

So a soft margin a margin allows miss-classification.
We use cross validation to determine how many misclassifications and observations to allow inside the soft margin to get the best classification.

When we use a soft margin to determine the location of a treshold, we use a soft margin classifier.

Question 8

Q

Kernel functions

A kernel performs a mathematical operation on instances in two classes that are not linearly separable, so that they do become linearly separable.

So , SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.

Non-linearly separable data

Linear kernel (equivalent to no kernel)
Polynonomial kernel function
Radial Basis Function

Answer

A

The algorithm learns linear hyperplanes, this seems like a contradiction. Well here’s what makes the SVM algorithm so powerful: it can add an extra dimension to your data, to find a linear way to separate non-linear data.

The SVM algorithm adds an extra dimension to the data, such that a linear hyperplane can separate the classes in this new, higher dimensional space.

Linear kernel functions
==> Soft margins or Maximum margins

Polynonomial kernel function
==> Has parameter, d, that stands for the degree of polynomials. e.g. When d is 1 it computes the relationship between each pair of observations in 1-Dimension. When d is 2 it computes the 2-Dimensional relationship between each pair of observations.

Radial Basis Function
==> behaves as weighted nearest neighbors. Observations that are close have much influence on classification and the ones that are low have very little influence.

Question 9

Q

Radial Basis Function Kernels

Answer

A

Two hyperparameters

C -> proportional to the misclassification penalty
Gamma -> range of influence in feature space
Sigma -> 1/Gamma

Question 10

Q

Feature selection

https: //www.youtube.com/watch?v=sOWqWmJhQ8A
https: //www.youtube.com/watch?v=rA_WFUf2-YM

Answer

A

When you try to reduce the number of dimensions,
don’t add additional dimensions
Just remove some dimensions

All the action in Feature Selection is in how to decide
which dimensions to remove?

3 strategies

Question 11

Q

The filter strategy (feature selection)

In short:
Filters features before handing it to a learning algorithm.

The criteria for knowing if you did good or not is buried inside the search algorithm itself.

Answer

A

Perhaps the most basic method of eliminating
features
• It considers each feature dimension separately
• Not depending on the type of model you are using

For each feature dimension, consider the
relationship between it and the value to predict
Based on some criteria, determine if that feature
should be retained

Example criteria: correlation, mutual information, or
various significance tests, information gain (decision tree), Entropy.
Criteria should be relatively fast to compute.

Advantages:
• Fast
• Independent of the model

Disadvantages:
• Only considers dimensions independently
• Not actually evaluate the importance for a model. There is no feedback. The learning algorithm can not inform the search algorithm (e.g. maybe the search algorithm took a feature away, but without that feature the learning algorithm does not perform well, but no way to communicate that)
• Independent of the model

Question 12

Q

The wrapper strategy (feature selection)

In short:
The search for the features is wrapped around whatever your learning algorithm is

Answer

A

Treat the set of feature dimensions as a
hyperparameter of the model
Fit the model to training data using a subset of
the possible features
Evaluate the restricted model on a test set of data
(this is Cross-Validation)
Use the set of features that provide the best fit for
this model

Question 13

Q

The wrapper strategy (feature selection)

Advantage & Disadvantages

Answer

A

Advantages:
• Actually make decisions about features based on
performance (accuracy, precision, recall, etc.)
• Considers all possible sets of features
• Evaluates the features for this model, not in general

Disadvantages:
• This process can be really slow
• The set of possible feature dimensions to include could be really, really large
• With N dimensions, the number of possible sets are 2N
• The set of features is completely dependent on the model

Question 14

Q

The embedded strategy (feature selection)

Answer

A

• Embed the filtering strategy within the wrapper strategy

• These are special cases of wrapper strategies where
part of fitting the model involves filtering out some
feature dimensions
– Canonical example is LASSO regression where as part of the iterative estimation of the parameter weights (a wrapper strategy), the regression weight for many features is set to 0 (a filtering strategy)

Question 15

Q

The embedded strategy (feature selection)

Advantage & Disadvantages

Answer

A

Benefits:

Possibly faster than a pure wrapper strategy because many fewer sets of parameters need to be considered
Evaluates combinations of features (better than a pure filtering strategy)

Issues:
• Can be slow relative to filtering strategies
• The selected features are model dependent
• Only reduces dimensions, can’t find any new ones!

Question 16

Q

Feature Extraction

Answer

Study These Flashcards

A

Feature Extraction is building a set of new and better (and fewer) features

Question 17

Q

Why (feature extraction)?

Answer

Study These Flashcards

A

Maybe none of the existing features is a particularly
good feature

• Let’s build some new ones!
• You’ve already partially done this when normalizing
the existing features for regression and classification:
– Subtracting the overall mean
– Dividing by the overall variance

Question 18

Q

PCA (feature extraction technique)

NOTE
PCA performs dimensionality reduction and it reduces dimensions, is does not reduce dimensionality itself.

Answer

Study These Flashcards

A

The most common (linear) technique is Principle Components Analysis (PCA).

It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important

PCA is a way of projecting a high dimensional space into a lower space that has nice properties.

PCA finds the best fitting line by MAXIMIZING the sum of the squared distances from the projected points to the origin. Looking for dimensions of greatest variance.

The steps of PCA are:

Transform the existing features
1 A. New features are uncorrelated with each other
1 B. New features are ranked based on ‘importance’
Extract the N most ‘important’ features

Question 19

Q

PCA: Recoding into new features

Answer

Study These Flashcards

A

Assuming we have N original features

These N original features can be recombined into N
new features without losing any information where the new features are;

linear combinations of the original features
uncorrelated with each other
are weighted in ‘importance’ (differences along the first PC axis are more important than differences in along PC axis 2).

Question 20

Q

PCA: in higher dimensions

https://youtu.be/FgakZw6K1QQ

Answer

Study These Flashcards

A

• For datasets with N features (N>2), PCA rotates the coordinate system in such a way that:

- the projection of the data on the first PC (new axis) has the largest variance,
- the projection of the data on the second PC (new axis) has the onebut-largest variance,
- and so forth… (up to N PC’s)
Provided that the variation in the data is associated with relevance for classification (or regression), the most relevant features are captured by the first M PC’s (and the rest captures noise)
Retaining the first M PC’s and throwing away the rest effectively reduces the dimensionality
M is typically much smaller than N, hence “dimensionality reduction”