Lecture 3 Flashcards

1
Q

Why dimension reduction?

A
  • Visualisation

- Curse of dimensitionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How dimension reduction?

A
  1. Feature selection:
  • Filtering strategy
  • Wrapper strategy
  • Embedding strategy

2 Feature extraction:

Linear:

  • PCA
  • Factor analysis

Non-linear:

  • Kerner PCA
  • Curves
  • Manifolds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How: Feature selection

A
  • Filtering strategy
  • Wrapper strategy
  • Embedding strategy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How: Feature extraction

A

Linear:

  • PCA (unsupervised learning, data driven, distance, no labels)
  • Factor analysis (unsupervised learning, idem)

Non-linear:

  • Kerner PCA
  • Curves
  • Manifolds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why visualization?

A
  • Our visual system is incredibly good at detecting
    patterns in a few dimensions (max 3)
  • Often visualizing a dataset in a few dimensions can
    lead to insights. Or at least make a concept easier to convey to other people.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Curse of conditionality

https://www.youtube.com/watch?v=8CpRLplmdqE

In short;
- Training examples needed grows exponentially with # of dim. to generalize accurately

  • If n training examples stays the same & # dim. grows, problems with distance function.
  • Each instance becomes each instances become more unique.
  • More likely to overfit; increase accuracy TrDs, but decrease in accuracy TeDs
A
  1. The amount of training data needed to obtain the same amount of cover grows exponentially with the # of dimensions to generalize accurately.

If we want N training items per unit of “feature space”
Then for each additional dimension = # of training samples * N.

  1. You density of training examples becomes lower and lower with each dimension you add. If # training examples would stay the same and feature dimensions increase, each instance becomes more unique (further apart from each other & surrounded by a lot of ‘empty space’ ) –> therefore you need more training items.
  2. Problem for distance functions;
    - In high-dim. space e.g. for Euclidean distance there is little difference between different pairs of data points.
    - For independent and identically distributed (i.i.d.) data and fixed n, minimum and maximum distance between random reference point Q and list of n random data points P1,…,Pn become “invisible” compared to minimum distance (? clueless about this last point)
  3. More likely to overfit training data when the # of dim. is higher.
    - This leads to improved accuracy on the training dataset
    - But a decrease in accuracy on the test dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Support Vector Machine

SVMs can make more complex decision boundaries by using kernels (see next card for more kernel functions, soft margin formulation explained here)

In short:

  • Supervised, binary classification
  • Can make complex decision boundaries by using kernels
  • Allows miss-classification and they can handle outliers (allows bias to reduce variance) –> reduce overfitting

https://www.youtube.com/watch?v=efR1C6CvhmEhttps://www.r-bloggers.com/support-vector-machines-with-the-mlr-package/

A

For linearly-separable data

The SVM algorithm finds an optimal linear hyperplane that separates the classes. For a two-dimensional feature space, such as in the example in figure 1, a hyperplane is simply a straight line. For a three-dimensional feature space, a hyperplane is a surface. The principle is the same: they are surfaces that cut through the feature space.

The margin is the shortest distance between threshold and observations. Linearly separable data can use a (1) Maximum margin kelner or (2) Soft margin kelner.

(2) We might not want to choose a decision boundary that perfectly separates the data to avoid overfitting. Allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly.

So a soft margin a margin allows miss-classification.
We use cross validation to determine how many misclassifications and observations to allow inside the soft margin to get the best classification.

When we use a soft margin to determine the location of a treshold, we use a soft margin classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kernel functions

A kernel performs a mathematical operation on instances in two classes that are not linearly separable, so that they do become linearly separable.

So , SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.

Non-linearly separable data

  • Linear kernel (equivalent to no kernel)
  • Polynonomial kernel function
  • Radial Basis Function
A

The algorithm learns linear hyperplanes, this seems like a contradiction. Well here’s what makes the SVM algorithm so powerful: it can add an extra dimension to your data, to find a linear way to separate non-linear data.

The SVM algorithm adds an extra dimension to the data, such that a linear hyperplane can separate the classes in this new, higher dimensional space.

Linear kernel functions
==> Soft margins or Maximum margins

Polynonomial kernel function
==> Has parameter, d, that stands for the degree of polynomials. e.g. When d is 1 it computes the relationship between each pair of observations in 1-Dimension. When d is 2 it computes the 2-Dimensional relationship between each pair of observations.

Radial Basis Function
==> behaves as weighted nearest neighbors. Observations that are close have much influence on classification and the ones that are low have very little influence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Radial Basis Function Kernels

A

Two hyperparameters

C -> proportional to the misclassification penalty
Gamma -> range of influence in feature space
Sigma -> 1/Gamma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature selection

https: //www.youtube.com/watch?v=sOWqWmJhQ8A
https: //www.youtube.com/watch?v=rA_WFUf2-YM

A
  • When you try to reduce the number of dimensions,
    don’t add additional dimensions
  • Just remove some dimensions

All the action in Feature Selection is in how to decide
which dimensions to remove?

3 strategies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The filter strategy (feature selection)

In short:
Filters features before handing it to a learning algorithm.

The criteria for knowing if you did good or not is buried inside the search algorithm itself.

A

Perhaps the most basic method of eliminating
features
• It considers each feature dimension separately
• Not depending on the type of model you are using

  1. For each feature dimension, consider the
    relationship between it and the value to predict
  2. Based on some criteria, determine if that feature
    should be retained

Example criteria: correlation, mutual information, or
various significance tests, information gain (decision tree), Entropy.
Criteria should be relatively fast to compute.

Advantages:
• Fast
• Independent of the model

Disadvantages:
• Only considers dimensions independently
• Not actually evaluate the importance for a model. There is no feedback. The learning algorithm can not inform the search algorithm (e.g. maybe the search algorithm took a feature away, but without that feature the learning algorithm does not perform well, but no way to communicate that)
• Independent of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The wrapper strategy (feature selection)

In short:
The search for the features is wrapped around whatever your learning algorithm is

A
  1. Treat the set of feature dimensions as a
    hyperparameter of the model
  2. Fit the model to training data using a subset of
    the possible features
  3. Evaluate the restricted model on a test set of data
    (this is Cross-Validation)
  4. Use the set of features that provide the best fit for
    this model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The wrapper strategy (feature selection)

Advantage & Disadvantages

A

Advantages:
• Actually make decisions about features based on
performance (accuracy, precision, recall, etc.)
• Considers all possible sets of features
• Evaluates the features for this model, not in general

Disadvantages:
• This process can be really slow
• The set of possible feature dimensions to include could be really, really large
• With N dimensions, the number of possible sets are 2N
• The set of features is completely dependent on the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The embedded strategy (feature selection)

A

• Embed the filtering strategy within the wrapper strategy

• These are special cases of wrapper strategies where
part of fitting the model involves filtering out some
feature dimensions
– Canonical example is LASSO regression where as part of the iterative estimation of the parameter weights (a wrapper strategy), the regression weight for many features is set to 0 (a filtering strategy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The embedded strategy (feature selection)

Advantage & Disadvantages

A

Benefits:

  • Possibly faster than a pure wrapper strategy because many fewer sets of parameters need to be considered
  • Evaluates combinations of features (better than a pure filtering strategy)

Issues:
• Can be slow relative to filtering strategies
• The selected features are model dependent
• Only reduces dimensions, can’t find any new ones!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feature Extraction

A

Feature Extraction is building a set of new and better (and fewer) features

17
Q

Why (feature extraction)?

A

Maybe none of the existing features is a particularly
good feature

• Let’s build some new ones!
• You’ve already partially done this when normalizing
the existing features for regression and classification:
– Subtracting the overall mean
– Dividing by the overall variance

18
Q

PCA (feature extraction technique)

NOTE
PCA performs dimensionality reduction and it reduces dimensions, is does not reduce dimensionality itself.

A

The most common (linear) technique is Principle Components Analysis (PCA).

It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important

PCA is a way of projecting a high dimensional space into a lower space that has nice properties.

PCA finds the best fitting line by MAXIMIZING the sum of the squared distances from the projected points to the origin. Looking for dimensions of greatest variance.

The steps of PCA are:

  1. Transform the existing features
    1 A. New features are uncorrelated with each other
    1 B. New features are ranked based on ‘importance’
  2. Extract the N most ‘important’ features
19
Q

PCA: Recoding into new features

A

Assuming we have N original features

These N original features can be recombined into N
new features without losing any information where the new features are;

  • linear combinations of the original features
  • uncorrelated with each other
  • are weighted in ‘importance’ (differences along the first PC axis are more important than differences in along PC axis 2).
20
Q

PCA: in higher dimensions

https://youtu.be/FgakZw6K1QQ

A

• For datasets with N features (N>2), PCA rotates the coordinate system in such a way that:

    • the projection of the data on the first PC (new axis) has the largest variance,
    • the projection of the data on the second PC (new axis) has the onebut-largest variance,
    • and so forth… (up to N PC’s)
  • Provided that the variation in the data is associated with relevance for classification (or regression), the most relevant features are captured by the first M PC’s (and the rest captures noise)
  • Retaining the first M PC’s and throwing away the rest effectively reduces the dimensionality
  • M is typically much smaller than N, hence “dimensionality reduction”
21
Q

PCA in behavioural sciences

A

In data mining we determine the optimal number of components empirically!

22
Q

Random Decision Forests

A

• The complexity of RDFs is determined by the
number of trees (and their depths)

• In some decision forests trees are induced on the
same complete set of features

• In random decision forests, trees are induced on
randomly selected subsets of features