[4] Feature Manipulation Flashcards
What is feature selection?
The process of extracting a small set of relevant features from a larger original set
What are the motives of feature selection?
- Increase classification accuracy
- Speed up processing time
- Improve interpretability
What are the possible overall purposes of feature selection?
- Classical - select m features from n while retaining classification accuracy etc
- Idealised - find the minimal number of features that can fully describe the target
What are the key decision to make when doing feature selection?
- the overall purpose
- the evaluation approach
- the iterative approach
What are the iterative approaches for feature selection?
Sequential forward starts with an empty set and adds another feature on each step
Sequential backward starts with a full set and iteratively removes features
What are the evaluation approaches that can be used for feature selection?
- Wrapper - include a learning algorithm in the learning process i.e. a model is trained each time a feature is considered
- Filter - instead of using a learning algorithm; measures based on distance (separability), information (i.e. entropy), dependency (correlation) or consistency are used.
- — Single feature ranking lists the features in order and picks m
- Embedded - a model is trained once and analysed to see which variables are most important
What are the tradeoffs of the evaluation approaches for feature selection?
Wrapper leads to better results, but it computationally expense and doesn’t generalise well if a different classification algorithm is used
Single-feature ranking ignores feature interaction, and there is a risk that the top features may be redundant
What is entropy?
The number of bits needed to encode a variable
What is Pearson’s correlation?
A measure between -1 (strong negative correlation) and 1 (strong positive correlation) of correlation
What is feature construction?
New features are created from existing features
What is PCA?
It creates linear transformations so that the first component has the highest variance and the second has the second most etc.
It assumes the bigger the variance, the better the feature
It doesn’t take into account classes, so doesn’t ensure good separability
What is key to evaluating features during feature manipulation?
Cross-validation to avoid biases
How is GP used for feature construction?
Intervals for each class are defined as [u_c - 3 sigma_c, u_c + 3 sigma_c]
The GP is optimised the avoid overlapping the class intervals i.e. maximising the conditional entropy
What are the two general uses of transfer learning?
Domain adaption is when the feature space remains the same but there are different probability distributions
If the feature space also changes, it is is cross-domain adaption
What are the approaches to transfer learning?
Instance based - re-weight labelled data from the source domain
Feature based - find a good feature representation which is similar across the domains
Model parameter based - find shared parameters between models
Relational knowledge - map knowledge from the source domain to the target domain