9-featureselection Flashcards
What are the two main ways to do feature selection?
Wrapper methods and filtering
What are wrapper methods?
Wrapper methods are a feature selection method focussed on choosing a subset of attributes that give the best performance on the development data
What are the advantages of wrapper methods?
Build feature set with optimal performance on development data
What are the disadvantages of wrapper methods?
They take a long time
What are more practical wrapper methods?
Greedy wrapper method
Ablation wrapper method
What is the greedy wrapper approach?
Train and evaluate model on each single attribute. Choose best attribute. Then train by combining best(s) attributes with each other attribute. Choose best combination. End when accuracy is not increased
What are the disadvantages of greedy wrapper approach?
Still takes n^2/2 time, and converges usually to a suboptimal outcome
What is the ablation approach?
Start with the entire feature set. Remove each attribute and assess on the remaining set. Stop when performance significantly degrades
What are the advantages of ablation method?
Quickly remove irrelevant attributes
What is pointwise mutual information?
PMI(A,C) = log2(P(A,C)/P(A)P(C)). We want to find values with high PMI
What are the disadvantages of ablation method?
Still takes O(m^2). Assumes features are independent.
What are feature filtering methods?
Methods that evaluate the goodness of each feature, by finding features that better predict the class
What makes a single feature good?
Well correlated with class, reverse correlated with class, well correlated with not class
What is mutual information?
The weighted average of all PMI.
P(a, c)PMI(a, c) + P( ̄a, c)PMI( ̄a, c)+
P(a, ̄c)PMI(a, ̄c) + P( ̄a, ̄c)PMI( ̄a, ̄c
What are alternatives to mutual information?
Chi-square
What is the principle of Chi-square?
Compare the observed value O(w) to the expected value E(w)
How do we conduct feature selection on nominal attributes?
Either treat attributes as multiple binary attributes or modify mutual information definition
How do we conduct feature selection on continuous attributes?
Estimate probability using Gaussian distribution
How do we conduct feature selection on ordinal attributes?
Treat as binary, treat as continuous or treat as nominal
What are the disadvantages of mutual information?
It is biased towards rare, uninformative features
What is an unsupervised feature selection method used for text documents?
Term frequency, inverse document frequency (TF-IDF). This helps us find words that are relevant to a document in a given document collection.
What are examples of models that do feature selection inherently?
Decision trees
Regression models with regularisation
Neural networks