Sect 7- Feature selection, dimension reduction, statistical methods, PCA, & operations Flashcards
Training time
increases exponentially with number of features.
Models have increasing risk of overfitting with increasing number of ___________
features
Filter methods
considers the relationship between features and the target variable to compute the importance of features.
F Test
statistical test used to compare between models and check if the difference is significant between the model.
F-Test does a hypothesis testing model X and Y where X is a model created by just a constant and Y is the model created by a constant and a feature.
The least square errors in both the models are compared and checks if the difference in errors between model X and Y are significant or introduced by chance.
F-Test is useful in feature selection as we get to know the significance of each feature in improving the model.
Scikit learn provides the Selecting K best features using F-Test.
sklearn.feature_selection.f_regression
For Classification tasks
sklearn.feature_selection.f_classif
There are some drawbacks of using F-Test to select your features. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score.
- Correlation is highly deceptive as it doesn’t capture strong non-linear relationships.
- Using summary statistics like correlation may be a bad idea, as illustrated by Anscombe’s quartet.
Mutual information
Mutual Information between two variables measures the dependence of one variable to another. If X and Y are two variables, and
If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.
If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.
When we have Y = f(X,Z,M,N), 0 < mutual information < 1
We can select our features from feature space by ranking their mutual information with the target variable.
Advantage of using mutual information over F-Test is, it does well with the non-linear relationship between feature and target variable.
Sklearn offers feature selection with Mutual Information for regression and classification tasks.
sklearn.feature_selection.mututal_info_regression
sklearn.feature_selection.mututal_info_classif
Variance threshold
This method removes features with variation below a certain cutoff.
The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
sklearn.feature_selection.VarianceThreshold
Variance Threshold doesn’t consider the relationship of features with the target variable.
Wrapper methods
Wrapper Methods generate models with a subsets of feature and gauge their model performances.
Forward search
This method allows you to search for the best feature w.r.t model performance and add them to your feature subset one after the other.
For data with n features,
->On first round ‘n’ models are created with individual feature and the best predictive feature is selected.
->On second round, ‘n-1’ models are created with each feature and the previously selected feature.
->This is repeated till a best subset of ‘m’ features are selected.
Recursive Feature Elimination
As the name suggests, this method eliminates worst performing features on a particular model one after the other until the best subset of features are known.
For data with n features,
->On first round ‘n-1’ models are created with combination of all features except one. The least performing feature is removed
-> On second round ‘n-2’ models are created by removing another feature.
Wrapper Methods promises you a best set of features with a extensive greedy search.
But the main drawbacks of wrapper methods is the sheer amount of models that needs to be trained. It is computationally very expensive and is infeasible with large number of features.
Embedded Methods
Feature selection can also be acheived by the insights provided by some Machine Learning models.
LASSO Linear Regression can be used for feature selections. Lasso Regression is performed by adding an extra term to the cost function of Linear Regression. This apart from preventing overfitting also reduces the coefficients of less important features to zero.
Tree based models
calculates feature importance for they need to keep the best performing features as close to the root of the tree. Constructing a decision tree involves calculating the best predictive feature.
The feature importance in tree based models are calculated based on Gini Index, Entropy or Chi-Square value.
Feature Selection as most things in Data Science is highly context and data dependent and there is no one stop solution for Feature Selection. The best way to go forward is to understand the mechanism of each methods and use when required.
When you’re getting started with a machine learning (ML) project, one critical principle to keep in mind is that data is everything. It is often said that if ML is the rocket engine, then the fuel is the (high-quality) data fed to ML algorithms. However, deriving truth and insight from a pile of data can be a complicated and error-prone job. To have a solid start for your ML project, it always helps to analyze the data up front, a practice that describes the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. During that process, it’s important that you get a deep understanding of:
The properties of the data, such as schema and statistical properties;
The quality of the data, like missing values and inconsistent data types;
The predictive power of the data, such as correlation of features against target.
Descriptive analysis
Univariate analysis
Descriptive analysis, or univariate analysis, provides an understanding of the characteristics of each attribute of the dataset. It also offers important evidence for feature preprocessing and selection in a later stage. The following table lists the suggested analysis for attributes that are common, numerical, categorical and textual.
Correlation analysis
bivariate analysis
Correlation analysis (or bivariate analysis) examines the relationship between two attributes, say X and Y, and examines whether X and Y are correlated. This analysis can be done from two perspectives to get various possible combinations:
Qualitative analysis. This performs computation of the descriptive statistics of dependent numerical/categorical attributes against each unique value of the independent categorical attribute. This perspective helps intuitively understand the relationship between X and Y. Visualizations are often used together with qualitative analysis as a more intuitive way of presenting the result.
Quantitative analysis
This is a quantitative test of the relationship between X and Y, based on hypothesis testing framework. This perspective provides a formal and mathematical methodology to quantitatively determine the existence and/or strength of the relationship.
Contextual analysis
Descriptive analysis and correlation analysis are both generic enough to be performed on any structured dataset, neither of which requires context information. To further understand or profile the given dataset and to gain more domain-specific insights, you can use one of two common contextual information-based analyses:
Time-based analysis: In many real-world datasets, the timestamp (or a similar time-related attribute) is one of the key pieces of contextual information. Observing and/or understanding the characteristics of the data along the time dimension, with various granularities, is essential to understanding the data generation process and ensuring data quality
Agent-based analysis: As an alternative to the time, the other common attribute is the unique identification (ID, such as user ID) of each record. Analyzing the dataset by aggregating along the agent dimension, i.e., histogram of number of records per agent, can further help improve your understanding of the dataset.
The ultimate goal of EDA (whether rigorous or through visualization) is to provide insights on the dataset you’re studying. This can inspire your subsequent feature selection, engineering, and model-building process.
Descriptive analysis provides the basic statistics of each attribute of the dataset. Those statistics can help you identify the following issues:
High percentage of missing values
Low variance of numeric attributes
Low entropy of categorical attributes
Imbalance of categorical target (class imbalance)
Skew distribution of numeric attributes
High cardinality of categorical attributes
The correlation analysis examines the relationship between two attributes. There are two typical action points triggered by the correlation analysis in the context of feature selection or feature engineering:
Low correlation between feature and target
High correlation between features
Once you’ve identified issues, the next task is to make a sound decision on how to properly mitigate these issues. One such example is for “High percentage of missing values.” The identified problem is that the attribute is missing in a significant proportion of the data points. The threshold or definition of “significant” can be set based on domain knowledge. There are two options to handle this, depending on the business scenario:
Assign a unique value to the missing value records, if the missing value, in certain contexts, is actually meaningful. For example, a missing value could indicate that a monitored, underlying process was not functioning properly.
Discard the feature if the values are missing due to misconfiguration, issues with data collection or untraceable random reasons, and the historic data can’t be reconstituted.
Dimensionality Reduction
the process of reducing the number of features in a dataset while retaining as much information as possible. It is often used in the field of data science to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize.
High-dimensional data can be difficult to visualize, making it harder to understand patterns and relationships in the data.
High-dimensional data can be computationally expensive to process, making it harder to train machine learning models. Therefore consumes more time.
High-dimensional data can increase the risk of overfitting, which can lead to poor performance on unseen data.
Dimensionality reduction is a powerful technique used in data science to reduce the number of features in a dataset while retaining as much information as possible. It can be used to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize. Popular techniques for dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Autoencoders. But PCA is widely used method .
Feature selection
approaches try to find a subset of the input variables (also called features or attributes). The three strategies are: the filter strategy (e.g. information gain), the wrapper strategy (e.g. search guided by accuracy), and the embedded strategy (selected features are added or removed while building the model based on prediction errors).
Data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.