Sect 7- Feature selection, dimension reduction, statistical methods, PCA, & operations Flashcards
Training time
increases exponentially with number of features.
Models have increasing risk of overfitting with increasing number of ___________
features
Filter methods
considers the relationship between features and the target variable to compute the importance of features.
F Test
statistical test used to compare between models and check if the difference is significant between the model.
F-Test does a hypothesis testing model X and Y where X is a model created by just a constant and Y is the model created by a constant and a feature.
The least square errors in both the models are compared and checks if the difference in errors between model X and Y are significant or introduced by chance.
F-Test is useful in feature selection as we get to know the significance of each feature in improving the model.
Scikit learn provides the Selecting K best features using F-Test.
sklearn.feature_selection.f_regression
For Classification tasks
sklearn.feature_selection.f_classif
There are some drawbacks of using F-Test to select your features. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score.
- Correlation is highly deceptive as it doesn’t capture strong non-linear relationships.
- Using summary statistics like correlation may be a bad idea, as illustrated by Anscombe’s quartet.
Mutual information
Mutual Information between two variables measures the dependence of one variable to another. If X and Y are two variables, and
If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.
If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.
When we have Y = f(X,Z,M,N), 0 < mutual information < 1
We can select our features from feature space by ranking their mutual information with the target variable.
Advantage of using mutual information over F-Test is, it does well with the non-linear relationship between feature and target variable.
Sklearn offers feature selection with Mutual Information for regression and classification tasks.
sklearn.feature_selection.mututal_info_regression
sklearn.feature_selection.mututal_info_classif
Variance threshold
This method removes features with variation below a certain cutoff.
The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
sklearn.feature_selection.VarianceThreshold
Variance Threshold doesn’t consider the relationship of features with the target variable.
Wrapper methods
Wrapper Methods generate models with a subsets of feature and gauge their model performances.
Forward search
This method allows you to search for the best feature w.r.t model performance and add them to your feature subset one after the other.
For data with n features,
->On first round ‘n’ models are created with individual feature and the best predictive feature is selected.
->On second round, ‘n-1’ models are created with each feature and the previously selected feature.
->This is repeated till a best subset of ‘m’ features are selected.
Recursive Feature Elimination
As the name suggests, this method eliminates worst performing features on a particular model one after the other until the best subset of features are known.
For data with n features,
->On first round ‘n-1’ models are created with combination of all features except one. The least performing feature is removed
-> On second round ‘n-2’ models are created by removing another feature.
Wrapper Methods promises you a best set of features with a extensive greedy search.
But the main drawbacks of wrapper methods is the sheer amount of models that needs to be trained. It is computationally very expensive and is infeasible with large number of features.
Embedded Methods
Feature selection can also be acheived by the insights provided by some Machine Learning models.
LASSO Linear Regression can be used for feature selections. Lasso Regression is performed by adding an extra term to the cost function of Linear Regression. This apart from preventing overfitting also reduces the coefficients of less important features to zero.
Tree based models
calculates feature importance for they need to keep the best performing features as close to the root of the tree. Constructing a decision tree involves calculating the best predictive feature.
The feature importance in tree based models are calculated based on Gini Index, Entropy or Chi-Square value.
Feature Selection as most things in Data Science is highly context and data dependent and there is no one stop solution for Feature Selection. The best way to go forward is to understand the mechanism of each methods and use when required.
When you’re getting started with a machine learning (ML) project, one critical principle to keep in mind is that data is everything. It is often said that if ML is the rocket engine, then the fuel is the (high-quality) data fed to ML algorithms. However, deriving truth and insight from a pile of data can be a complicated and error-prone job. To have a solid start for your ML project, it always helps to analyze the data up front, a practice that describes the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. During that process, it’s important that you get a deep understanding of:
The properties of the data, such as schema and statistical properties;
The quality of the data, like missing values and inconsistent data types;
The predictive power of the data, such as correlation of features against target.
Descriptive analysis
Univariate analysis
Descriptive analysis, or univariate analysis, provides an understanding of the characteristics of each attribute of the dataset. It also offers important evidence for feature preprocessing and selection in a later stage. The following table lists the suggested analysis for attributes that are common, numerical, categorical and textual.
Correlation analysis
bivariate analysis
Correlation analysis (or bivariate analysis) examines the relationship between two attributes, say X and Y, and examines whether X and Y are correlated. This analysis can be done from two perspectives to get various possible combinations:
Qualitative analysis. This performs computation of the descriptive statistics of dependent numerical/categorical attributes against each unique value of the independent categorical attribute. This perspective helps intuitively understand the relationship between X and Y. Visualizations are often used together with qualitative analysis as a more intuitive way of presenting the result.
Quantitative analysis
This is a quantitative test of the relationship between X and Y, based on hypothesis testing framework. This perspective provides a formal and mathematical methodology to quantitatively determine the existence and/or strength of the relationship.
Contextual analysis
Descriptive analysis and correlation analysis are both generic enough to be performed on any structured dataset, neither of which requires context information. To further understand or profile the given dataset and to gain more domain-specific insights, you can use one of two common contextual information-based analyses:
Time-based analysis: In many real-world datasets, the timestamp (or a similar time-related attribute) is one of the key pieces of contextual information. Observing and/or understanding the characteristics of the data along the time dimension, with various granularities, is essential to understanding the data generation process and ensuring data quality
Agent-based analysis: As an alternative to the time, the other common attribute is the unique identification (ID, such as user ID) of each record. Analyzing the dataset by aggregating along the agent dimension, i.e., histogram of number of records per agent, can further help improve your understanding of the dataset.
The ultimate goal of EDA (whether rigorous or through visualization) is to provide insights on the dataset you’re studying. This can inspire your subsequent feature selection, engineering, and model-building process.
Descriptive analysis provides the basic statistics of each attribute of the dataset. Those statistics can help you identify the following issues:
High percentage of missing values
Low variance of numeric attributes
Low entropy of categorical attributes
Imbalance of categorical target (class imbalance)
Skew distribution of numeric attributes
High cardinality of categorical attributes
The correlation analysis examines the relationship between two attributes. There are two typical action points triggered by the correlation analysis in the context of feature selection or feature engineering:
Low correlation between feature and target
High correlation between features
Once you’ve identified issues, the next task is to make a sound decision on how to properly mitigate these issues. One such example is for “High percentage of missing values.” The identified problem is that the attribute is missing in a significant proportion of the data points. The threshold or definition of “significant” can be set based on domain knowledge. There are two options to handle this, depending on the business scenario:
Assign a unique value to the missing value records, if the missing value, in certain contexts, is actually meaningful. For example, a missing value could indicate that a monitored, underlying process was not functioning properly.
Discard the feature if the values are missing due to misconfiguration, issues with data collection or untraceable random reasons, and the historic data can’t be reconstituted.
Dimensionality Reduction
the process of reducing the number of features in a dataset while retaining as much information as possible. It is often used in the field of data science to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize.
High-dimensional data can be difficult to visualize, making it harder to understand patterns and relationships in the data.
High-dimensional data can be computationally expensive to process, making it harder to train machine learning models. Therefore consumes more time.
High-dimensional data can increase the risk of overfitting, which can lead to poor performance on unseen data.
Dimensionality reduction is a powerful technique used in data science to reduce the number of features in a dataset while retaining as much information as possible. It can be used to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize. Popular techniques for dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Autoencoders. But PCA is widely used method .
Feature selection
approaches try to find a subset of the input variables (also called features or attributes). The three strategies are: the filter strategy (e.g. information gain), the wrapper strategy (e.g. search guided by accuracy), and the embedded strategy (selected features are added or removed while building the model based on prediction errors).
Data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.
Feature projection
transforms the data from the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[4][5] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning
Principal component analysis (PCA)
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system, because they often contribute the vast majority of the system’s energy, especially in low-dimensional systems. Still, this must be proven on a case-by-case basis as not all systems exhibit this behavior. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.
Non-negative matrix factorization (NMF)
NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist,[7][8] such as astronomy.[9][10] NMF is well known since the multiplicative update rule by Lee & Seung,[7] which has been continuously developed: the inclusion of uncertainties,[9] the consideration of missing data and parallel computation,[11] sequential construction[11] which leads to the stability and linearity of NMF,[10] as well as other updates including handling missing data in digital image processing.[12]
With a stable component basis during construction, and a linear modeling process, sequential NMF[11] is able to preserve the flux in direct imaging of circumstellar structures in astronomy,[10] as one of the methods of detecting exoplanets, especially for the direct imaging of circumstellar discs. In comparison with PCA, NMF does not remove the mean of the matrices, which leads to unphysical non-negative fluxes; therefore NMF is able to preserve more information than PCA as demonstrated by Ren et al
Kernel PCA
Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is called kernel PCA.
Graph-based kernel PCA
Other prominent nonlinear techniques include manifold learning techniques such as Isomap, locally linear embedding (LLE),[13] Hessian LLE, Laplacian eigenmaps, and methods based on tangent space analysis.[14] These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA.
More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.
An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include: classical multidimensional scaling, which is identical to PCA; Isomap, which uses geodesic distances in the data space; diffusion maps, which use diffusion distances in the data space; t-distributed stochastic neighbor embedding (t-SNE), which minimizes the divergence between distributions over pairs of points; and curvilinear component analysis.
A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feedforward neural networks with a bottle-neck hidden layer.[15] The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation.
Linear discriminant analysis (LDA)
a generalization of Fisher’s linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.
Generalized disciminant analysis (GDA)
GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support-vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space.[16][17] Similar to LDA, the objective of GDA is to find a projection for the features into a lower dimensional space by maximizing the ratio of between-class scatter to within-class scatter.
Autoencoder
can be used to learn nonlinear dimension reduction functions and codings together with an inverse function from the coding to the original representation.
t-SNE
T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique useful for visualization of high-dimensional datasets. It is not recommended for use in analysis such as clustering or outlier detection since it does not necessarily preserve densities or distances well
UMAP
Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique. Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.
Dimension reduction
For high-dimensional datasets (i.e. with number of dimensions more than 10), dimension reduction is usually performed prior to applying a K-nearest neighbors algorithm (k-NN) in order to avoid the effects of the curse of dimensionality.[19]
Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF) techniques as a pre-processing step followed by clustering by K-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding.[20]
For very-high-dimensional datasets (e.g. when performing similarity search on live video streams, DNA data or high-dimensional time series) running a fast approximate K-NN search using locality-sensitive hashing, random projection,[21] “sketches”,[22] or other high-dimensional similarity search techniques from the VLDB conference toolbox might be the only feasible option.
Why is Dimensionality Reduction Needed?
Often in machine learning, the more features that are present in the dataset the better a classifier can learn. However, more features also means a higher computational cost. Not only can high dimensionality lead to long training times, more features often lead to an algorithm overfitting as it tries to create a model that explains all the features in the data.
Because dimensionality reduction reduces the overall number of features, it can reduce the computational demands associated with training a model but also helps combat overfitting by keeping the features that will be fed to the model fairly simple.
Dimensionality reduction can be used in both supervised and unsupervised learning contexts. In the case of unsupervised learning, dimensionality reduction is often used to preprocess the data by carrying out feature selection or feature extraction.
The primary algorithms used to carry out dimensionality reduction for unsupervised learning are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).
In the case of supervised learning, dimensionality reduction can be used to simplify the features fed into the machine learning classifier. The most common methods used to carry out dimensionality reduction for supervised learning problems is Linear Discriminant Analysis (LDA) and PCA, and it can be utilized to predict new cases.
Take note that the use cases described above are general use cases and not the only conditions these techniques are used in. After all, dimensionality reduction techniques are statistical methods and their use is not restricted by machine learning models.
PCA Implementation Example
Let’s take a look at how PCA can be implemented in Scikit-Learn. We’ll be using the Mushroom classification dataset for this.
First, we need to import all the modules we need, which includes PCA, train_test_split, and labeling and scaling tools:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore”)
After we load in the data, we’ll check for any null values. We’ll also encode the data with the LabelEncoder. The class feature is the first column in the dataset, so we split up the features and labels accordingly:
m_data = pd.read_csv(‘mushrooms.csv’)
Machine learning systems work with integers, we need to encode these
# string characters into ints
encoder = LabelEncoder()
Now apply the transformation to all the columns:
for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
We’ll now scale the features with the standard scaler. This is optional as we aren’t actually running the classifier, but it may impact how our data is analyzed by PCA:
Scale the features
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
We’ll now use PCA to get the list of features and plot which features have the most explanatory power, or have the most variance. These are the principle components. It looks like around 17 or 18 of the features explain the majority, almost 95% of our data:
Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align=’center’, label=’individual variance’)
plt.legend()
plt.ylabel(‘Variance ratio’)
plt.xlabel(‘Principal components’)
plt.show()
Let’s convert the features into the 17 top features. We’ll then plot a scatter plot of the data point classification based on these 17 features:
pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[‘class’])
plt.show()
Let’s also do this for the top 2 features and see how the classification changes:
pca3 = PCA(n_components=2)
pca3.fit(X_features)
x_3d = pca3.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data[‘class’])
plt.show()
Singular Value Decomposition
The purpose of Singular Value Decomposition is to simplify a matrix and make doing calculations with the matrix easier. The matrix is reduced to its constituent parts, similar to the goal of PCA. Understanding the ins and outs of SVD isn’t completely necessary to implement it in your machine learning models, but having an intuition for how it works will give you a better idea of when to use it.
SVD can be carried out on either complex or real-valued matrices, but to make this explanation easier to understand, we’ll go over the method of decomposing a real-valued matrix.
When doing SVD we have a matrix filled in with data and we want to reduce the number of columns the matrix has. This reduces the dimensionality of the matrix while still preserving as much of the variability in the data as possible.
We can say that Matrix A equals the transpose of matrix V:
A = U * D * V^t
Assuming we have some matrix A, we can represent that matrix as three other matrices called U, V, and D. Matrix A has the original xy elements, while Matrix U is an orthogonal matrix containing xx elements and Matrix V is a different orthogonal matrix containing yy elements. Finally, D is a diagonal matrix containing xy elements.
Decomposing values for a matrix involves converting the singular values in the original matrix into the diagonal values of the new matrix. Orthogonal matrices do not have their properties changed if they are multiplied by other numbers, and we can take advantage of this property to get an approximation of matrix A. When multiplying the orthogonal matrix together combined when the transpose of matrix V, we get a matrix that is equivalent to the original matrix A.
When we break/decompose matrix A down into U, D, and V, we then have three different matrices that contain the information of Matrix A.
It turns out that the left-most columns of the matrices hold the majority of our data, and we can select just these few columns to have a good approximation of Matrix A. This new matrix is much simpler and easier to work with, as it has far fewer dimensions.
SVD Implementation Example
One of the most common ways that SVD is used is to compress images. After all, the pixel values that make up the red, green, and blue channels in the image can just be reduced and the result will be an image that is less complex but still contains the same image content. Let’s try using SVD to compress an image and render it.
We’ll use several functions to handle the compression of the image. We’ll really only need Numpy and the Image function from the PIL library in order to accomplish this, since Numpy has a method to carry out the SVD calculation:
import numpy
from PIL import Image
First, we’ll just write a function to load in the image and turn it into a Numpy array. We then want to select the red, green, and blue color channels from the image:
def load_image(image):
image = Image.open(image)
im_array = numpy.array(image)
red = im_array[:, :, 0] green = im_array[:, :, 1] blue = im_array[:, :, 2] return red, green, blue
Now that we have the colors, we need to compress the color channels. We can start by calling Numpy’s SVD function on the color channel we want. We’ll then create an array of zeroes that we’ll fill in after the matrix multiplication is completed. We then specify the singular value limit we want to use when doing the calculations:
def channel_compress(color_channel, singular_value_limit):
u, s, v = numpy.linalg.svd(color_channel)
compressed = numpy.zeros((color_channel.shape[0], color_channel.shape[1]))
n = singular_value_limit
left_matrix = numpy.matmul(u[:, 0:n], numpy.diag(s)[0:n, 0:n]) inner_compressed = numpy.matmul(left_matrix, v[0:n, :]) compressed = inner_compressed.astype('uint8') return compressed
red, green, blue = load_image(“dog3.jpg”)
singular_val_lim = 350
After this, we do matrix multiplication on the diagonal and the value limits in the U matrix, as described above. This gets us the left matrix and we then multiply it with the V matrix. This should get us the compressed values which we transform to the ‘uint8’ type:
def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)
im_red = Image.fromarray(compressed_red) im_blue = Image.fromarray(compressed_blue) im_green = Image.fromarray(compressed_green) new_image = Image.merge("RGB", (im_red, im_green, im_blue)) new_image.show() new_image.save("dog3-edited.jpg")
compress_image(red, green, blue, singular_val_lim)
We’ll be using this image of a dog to test our SVD compression on:
We also need to set the singular value limit we’ll use, let’s start with 600 for now:
red, green, blue = load_image(“dog.jpg”)
singular_val_lim = 350
Finally, we can get the compressed values for the three color channels and transform them from Numpy arrays into image components using PIL. We then just have to join the three channels together and show the image. This image should be a little smaller and simpler than the original image:
Indeed, if you inspect the size of the images, you’ll notice that the compressed one is smaller, though we’ve also had a bit of lossy compression. You can see some noise in the image as well.
You can play around with adjusting the singular value limit. The lower the chosen limit the greater the compression will be, but at a certain point image artifact-ing will show up and the image will degrade in quality:
def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)
im_red = Image.fromarray(compressed_red) im_blue = Image.fromarray(compressed_blue) im_green = Image.fromarray(compressed_green) new_image = Image.merge("RGB", (im_red, im_green, im_blue)) new_image.show()
compress_image(red, green, blue, singular_val_lim)
Linear Discriminant Analysis
operates by projecting data from a multidimensional graph onto a linear graph. The easiest way to conceive of this is with a graph filled up with data points of two different classes. Assuming that there is no line that will neatly separate the data into two classes, the two dimensional graph can be reduced down into a 1D graph. This 1D graph can then be used to hopefully achieve the best possible separation of the data points.
When LDA is carried out there are two primary goals: minimizing the variance of the two classes and maximizing the distance between the means of the two data classes.
In order to achieve this, a new axis will be plotted in the 2D graph. This new axis should separate the two data points based on the previously mentioned criteria. Once the new axis has been created the data points within the 2D graph are redrawn along the new axis.
LDA carries out three different steps to move the original graph to the new axis. First, the separability between the classes has to be calculated, and this is based on the distance between the class means or the between-class variance. In the next step, the within class variance must be calculated, which is the distance between the mean and sample for the different classes. Finally, the lower dimensional space that maximizes the between class variance has to be constructed.
LDA works best when the means of the classes are far from each other. If the means of the distribution are shared it won’t be possible for LDA to separate the classes with a new linear axis.