[4] Feature Manipulation Flashcards
What is feature selection?
The process of extracting a small set of relevant features from a larger original set
What are the motives of feature selection?
- Increase classification accuracy
- Speed up processing time
- Improve interpretability
What are the possible overall purposes of feature selection?
- Classical - select m features from n while retaining classification accuracy etc
- Idealised - find the minimal number of features that can fully describe the target
What are the key decision to make when doing feature selection?
- the overall purpose
- the evaluation approach
- the iterative approach
What are the iterative approaches for feature selection?
Sequential forward starts with an empty set and adds another feature on each step
Sequential backward starts with a full set and iteratively removes features
What are the evaluation approaches that can be used for feature selection?
- Wrapper - include a learning algorithm in the learning process i.e. a model is trained each time a feature is considered
- Filter - instead of using a learning algorithm; measures based on distance (separability), information (i.e. entropy), dependency (correlation) or consistency are used.
- — Single feature ranking lists the features in order and picks m
- Embedded - a model is trained once and analysed to see which variables are most important
What are the tradeoffs of the evaluation approaches for feature selection?
Wrapper leads to better results, but it computationally expense and doesn’t generalise well if a different classification algorithm is used
Single-feature ranking ignores feature interaction, and there is a risk that the top features may be redundant
What is entropy?
The number of bits needed to encode a variable
What is Pearson’s correlation?
A measure between -1 (strong negative correlation) and 1 (strong positive correlation) of correlation
What is feature construction?
New features are created from existing features
What is PCA?
It creates linear transformations so that the first component has the highest variance and the second has the second most etc.
It assumes the bigger the variance, the better the feature
It doesn’t take into account classes, so doesn’t ensure good separability
What is key to evaluating features during feature manipulation?
Cross-validation to avoid biases
How is GP used for feature construction?
Intervals for each class are defined as [u_c - 3 sigma_c, u_c + 3 sigma_c]
The GP is optimised the avoid overlapping the class intervals i.e. maximising the conditional entropy
What are the two general uses of transfer learning?
Domain adaption is when the feature space remains the same but there are different probability distributions
If the feature space also changes, it is is cross-domain adaption
What are the approaches to transfer learning?
Instance based - re-weight labelled data from the source domain
Feature based - find a good feature representation which is similar across the domains
Model parameter based - find shared parameters between models
Relational knowledge - map knowledge from the source domain to the target domain
When are multi-objective solutions unambiguously better than others?
A solution dominates another if it is better by every other measure
What are the ways of approaching multi-objective optimisation?
Aggregation-based - weight each objective to give a single objective which can be optimised
Ideal multi-objective optimisation - output high-leave information i.e. a range of feasible solutions
How are solutions to ideal multi-objective problems represented?
With a Pareto front, which is a set of solutions which aren’t dominated
How is EC used to solve multi-objective problems?
Evolutionary multi-objective optimisation (EMO) obtains multiple Pareto-optimal solutions in a single run
It can handle discontinuities and concavities along the Pareto front
It has three considerations
What are the considerations of EMO?
Fitness assignment - a scalar fitness is needed.
- – Aggregating functions such as weighted sum can be used, but miss the concave part of the Pareto front
- – Dominance-based methods assign ranks based on who they dominate; within rank, individuals are sorted by crowding distance
Diversity preservation ensures coverage of the Pareto front; Elitism ensures non-dominated solutions aren’t lost