What Is Feature Selection Flashcards
HOW DO STATISTICAL-BASED FEATURE SELECTION METHODS WORK? P128
Evaluation of the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.
HOW MANY MAIN TYPES OF FEATURE SELECTION TECHNIQUES ARE THERE? WHAT ARE THEIR NAMES? P128
2: supervised and unsupervised
WHAT ARE THE TYPES OF SUPERVISED METHODS OF FEATURE SELECTION? P128
Wrapper, Filter, Intrinsic
HOW DO FILTER-BASED FEATURE SELECTION METHODS WORK? P128
Using statistical measures to score correlation or dependence
WHAT IS THE DIFFERENCE BETWEEN SUPERVISED AND UNSUPERVISED FEATURE SELECTION METHODS? P129
Whether features are selected based on the TARGET VARIABLE or not. (unsupervised selection does not use the target variable, supervised selection DOES use the TARGET VARIABLE
DEFINE UNSUPERVISED FEATURE SELECTION TECHNIQUES? GIVE EXAMPLES P129
Feature selection techniques that ignore the target variable, such as methods that remove redundant variables using correlation or features that have few values or low variance.
EXPLAIN THE TYPES OF SUPERVISED FEATURE SELECTION? P129-P130
Intrinsic: Algorithms that perform automatic feature selection during training (as part of learning the model). Including algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.
Filter: Select subsets of features based on their relationship with the target
Wrapper: Search subsets of features that perform according to a predictive model. These methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric.
…. IS A TYPE OF FEATURE SELECTION WHICH IS UNCONCERNED WITH THE VARIABLE TYPES, BUT CAN BE COMPUTATIONALLY EXPENSIVE? P129
Wrapper
WHAT IS THE DEFINITION OF UNIVARIATE STATISTICAL MEASURES? P131
The statistical measures used in filter-based features selection are generally calculated one input variable at a time with the target variable. Hence, they are called univariate statistical measures
WHAT ARE THE MOST COMMON UNIVARIATE STATISTICAL MEASURES FOR NUMERIC INPUT-NUMERIC OUTPUT FILTER-BASED FEATURE SELECTION? P132
Pearson’s correlation coefficient (linear)
Spearman’s rank coefficient (nonlinear): WWW: Consider Spearman’s rank order correlation when you have pairs of continuous variables and the relationships between them don’t follow a straight line, or you have pairs of ordinal data.
Mutual Information
WHAT ARE THE MOST COMMON UNIVARIATE STATISTICAL MEASURES FOR NUMERIC INPUT-CATEGORICAL OUTPUT FILTER-BASED FEATURE SELECTION? P132
ANOVA correlation coefficient (linear)
Kendall’s rank coefficient (nonlinear): Assumes that the categorical variable is ordinal
Mutual Information
WHAT ARE THE MOST COMMON UNIVARIATE STATISTICAL MEASURES FOR CATEGORICAL INPUT-CATEGORICAL OUTPUT FILTER-BASED FEATURE SELECTION? P132
Chi-Squared test (contingency tables)
Mutual Information
WHAT ARE THE METHODS FOR USING A WRAPPER AS A FEATURE SELECTION TECHNIQUE? P133
ˆ Tree-Searching Methods (depth-first, breadth-first, etc.).
ˆ Stochastic Global Search (simulated annealing, genetic algorithm).
ˆ Step-Wise Models.
ˆ RFE.
HOW CAN YOU CHOOSE THE NUMBER OF VARIABLES (FEATURES) IN THE FILTER-BASED FEATURE SELECTION METHODS OF SKLEARN? (2 WAYS) P134
There are 2 main techniques for filtering input variables. 1st is to rank variables by their score and select k-top input variables with the largest scores.
2nd is to convert scores into a percentage of the largest score and select all the features above a minimum percentile.
K variables: SelectKBest , top percemtile: SelectPercentile
WHAT ARE THE ASSUMPTIONS OF PEARSON’S CORRELATION? P134
Gaussian distribution and linear relationship