Lecture 3 Flashcards
Why dimension reduction?
- Visualisation
- Curse of dimensitionality
How dimension reduction?
- Feature selection:
- Filtering strategy
- Wrapper strategy
- Embedding strategy
2 Feature extraction:
Linear:
- PCA
- Factor analysis
Non-linear:
- Kerner PCA
- Curves
- Manifolds
How: Feature selection
- Filtering strategy
- Wrapper strategy
- Embedding strategy
How: Feature extraction
Linear:
- PCA (unsupervised learning, data driven, distance, no labels)
- Factor analysis (unsupervised learning, idem)
Non-linear:
- Kerner PCA
- Curves
- Manifolds
Why visualization?
- Our visual system is incredibly good at detecting
patterns in a few dimensions (max 3) - Often visualizing a dataset in a few dimensions can
lead to insights. Or at least make a concept easier to convey to other people.
Curse of conditionality
https://www.youtube.com/watch?v=8CpRLplmdqE
In short;
- Training examples needed grows exponentially with # of dim. to generalize accurately
- If n training examples stays the same & # dim. grows, problems with distance function.
- Each instance becomes each instances become more unique.
- More likely to overfit; increase accuracy TrDs, but decrease in accuracy TeDs
- The amount of training data needed to obtain the same amount of cover grows exponentially with the # of dimensions to generalize accurately.
If we want N training items per unit of “feature space”
Then for each additional dimension = # of training samples * N.
- You density of training examples becomes lower and lower with each dimension you add. If # training examples would stay the same and feature dimensions increase, each instance becomes more unique (further apart from each other & surrounded by a lot of ‘empty space’ ) –> therefore you need more training items.
- Problem for distance functions;
- In high-dim. space e.g. for Euclidean distance there is little difference between different pairs of data points.
- For independent and identically distributed (i.i.d.) data and fixed n, minimum and maximum distance between random reference point Q and list of n random data points P1,…,Pn become “invisible” compared to minimum distance (? clueless about this last point) - More likely to overfit training data when the # of dim. is higher.
- This leads to improved accuracy on the training dataset
- But a decrease in accuracy on the test dataset
Support Vector Machine
SVMs can make more complex decision boundaries by using kernels (see next card for more kernel functions, soft margin formulation explained here)
In short:
- Supervised, binary classification
- Can make complex decision boundaries by using kernels
- Allows miss-classification and they can handle outliers (allows bias to reduce variance) –> reduce overfitting
https://www.youtube.com/watch?v=efR1C6CvhmEhttps://www.r-bloggers.com/support-vector-machines-with-the-mlr-package/
For linearly-separable data
The SVM algorithm finds an optimal linear hyperplane that separates the classes. For a two-dimensional feature space, such as in the example in figure 1, a hyperplane is simply a straight line. For a three-dimensional feature space, a hyperplane is a surface. The principle is the same: they are surfaces that cut through the feature space.
The margin is the shortest distance between threshold and observations. Linearly separable data can use a (1) Maximum margin kelner or (2) Soft margin kelner.
(2) We might not want to choose a decision boundary that perfectly separates the data to avoid overfitting. Allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly.
So a soft margin a margin allows miss-classification.
We use cross validation to determine how many misclassifications and observations to allow inside the soft margin to get the best classification.
When we use a soft margin to determine the location of a treshold, we use a soft margin classifier.
Kernel functions
A kernel performs a mathematical operation on instances in two classes that are not linearly separable, so that they do become linearly separable.
So , SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.
Non-linearly separable data
- Linear kernel (equivalent to no kernel)
- Polynonomial kernel function
- Radial Basis Function
The algorithm learns linear hyperplanes, this seems like a contradiction. Well here’s what makes the SVM algorithm so powerful: it can add an extra dimension to your data, to find a linear way to separate non-linear data.
The SVM algorithm adds an extra dimension to the data, such that a linear hyperplane can separate the classes in this new, higher dimensional space.
Linear kernel functions
==> Soft margins or Maximum margins
Polynonomial kernel function
==> Has parameter, d, that stands for the degree of polynomials. e.g. When d is 1 it computes the relationship between each pair of observations in 1-Dimension. When d is 2 it computes the 2-Dimensional relationship between each pair of observations.
Radial Basis Function
==> behaves as weighted nearest neighbors. Observations that are close have much influence on classification and the ones that are low have very little influence.
Radial Basis Function Kernels
Two hyperparameters
C -> proportional to the misclassification penalty
Gamma -> range of influence in feature space
Sigma -> 1/Gamma
Feature selection
https: //www.youtube.com/watch?v=sOWqWmJhQ8A
https: //www.youtube.com/watch?v=rA_WFUf2-YM
- When you try to reduce the number of dimensions,
don’t add additional dimensions - Just remove some dimensions
All the action in Feature Selection is in how to decide
which dimensions to remove?
3 strategies
The filter strategy (feature selection)
In short:
Filters features before handing it to a learning algorithm.
The criteria for knowing if you did good or not is buried inside the search algorithm itself.
Perhaps the most basic method of eliminating
features
• It considers each feature dimension separately
• Not depending on the type of model you are using
- For each feature dimension, consider the
relationship between it and the value to predict - Based on some criteria, determine if that feature
should be retained
Example criteria: correlation, mutual information, or
various significance tests, information gain (decision tree), Entropy.
Criteria should be relatively fast to compute.
Advantages:
• Fast
• Independent of the model
Disadvantages:
• Only considers dimensions independently
• Not actually evaluate the importance for a model. There is no feedback. The learning algorithm can not inform the search algorithm (e.g. maybe the search algorithm took a feature away, but without that feature the learning algorithm does not perform well, but no way to communicate that)
• Independent of the model
The wrapper strategy (feature selection)
In short:
The search for the features is wrapped around whatever your learning algorithm is
- Treat the set of feature dimensions as a
hyperparameter of the model - Fit the model to training data using a subset of
the possible features - Evaluate the restricted model on a test set of data
(this is Cross-Validation) - Use the set of features that provide the best fit for
this model
The wrapper strategy (feature selection)
Advantage & Disadvantages
Advantages:
• Actually make decisions about features based on
performance (accuracy, precision, recall, etc.)
• Considers all possible sets of features
• Evaluates the features for this model, not in general
Disadvantages:
• This process can be really slow
• The set of possible feature dimensions to include could be really, really large
• With N dimensions, the number of possible sets are 2N
• The set of features is completely dependent on the model
The embedded strategy (feature selection)
• Embed the filtering strategy within the wrapper strategy
• These are special cases of wrapper strategies where
part of fitting the model involves filtering out some
feature dimensions
– Canonical example is LASSO regression where as part of the iterative estimation of the parameter weights (a wrapper strategy), the regression weight for many features is set to 0 (a filtering strategy)
The embedded strategy (feature selection)
Advantage & Disadvantages
Benefits:
- Possibly faster than a pure wrapper strategy because many fewer sets of parameters need to be considered
- Evaluates combinations of features (better than a pure filtering strategy)
Issues:
• Can be slow relative to filtering strategies
• The selected features are model dependent
• Only reduces dimensions, can’t find any new ones!