Sect 7- Feature selection, dimension reduction, statistical methods, PCA, & operations Flashcards

1
Q

Training time

A

increases exponentially with number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Models have increasing risk of overfitting with increasing number of ___________

A

features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Filter methods

A

considers the relationship between features and the target variable to compute the importance of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

F Test

A

statistical test used to compare between models and check if the difference is significant between the model.

F-Test does a hypothesis testing model X and Y where X is a model created by just a constant and Y is the model created by a constant and a feature.

The least square errors in both the models are compared and checks if the difference in errors between model X and Y are significant or introduced by chance.

F-Test is useful in feature selection as we get to know the significance of each feature in improving the model.

Scikit learn provides the Selecting K best features using F-Test.

sklearn.feature_selection.f_regression
For Classification tasks

sklearn.feature_selection.f_classif

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

There are some drawbacks of using F-Test to select your features. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score.

A
  1. Correlation is highly deceptive as it doesn’t capture strong non-linear relationships.
  2. Using summary statistics like correlation may be a bad idea, as illustrated by Anscombe’s quartet.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Mutual information

A

Mutual Information between two variables measures the dependence of one variable to another. If X and Y are two variables, and

If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.

If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.

When we have Y = f(X,Z,M,N), 0 < mutual information < 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

We can select our features from feature space by ranking their mutual information with the target variable.

Advantage of using mutual information over F-Test is, it does well with the non-linear relationship between feature and target variable.

Sklearn offers feature selection with Mutual Information for regression and classification tasks.

A

sklearn.feature_selection.mututal_info_regression
sklearn.feature_selection.mututal_info_classif

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Variance threshold

A

This method removes features with variation below a certain cutoff.

The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.

sklearn.feature_selection.VarianceThreshold

Variance Threshold doesn’t consider the relationship of features with the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Wrapper methods

A

Wrapper Methods generate models with a subsets of feature and gauge their model performances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Forward search

A

This method allows you to search for the best feature w.r.t model performance and add them to your feature subset one after the other.

For data with n features,

->On first round ‘n’ models are created with individual feature and the best predictive feature is selected.

->On second round, ‘n-1’ models are created with each feature and the previously selected feature.

->This is repeated till a best subset of ‘m’ features are selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Recursive Feature Elimination

A

As the name suggests, this method eliminates worst performing features on a particular model one after the other until the best subset of features are known.

For data with n features,

->On first round ‘n-1’ models are created with combination of all features except one. The least performing feature is removed

-> On second round ‘n-2’ models are created by removing another feature.

Wrapper Methods promises you a best set of features with a extensive greedy search.

But the main drawbacks of wrapper methods is the sheer amount of models that needs to be trained. It is computationally very expensive and is infeasible with large number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Embedded Methods

A

Feature selection can also be acheived by the insights provided by some Machine Learning models.

LASSO Linear Regression can be used for feature selections. Lasso Regression is performed by adding an extra term to the cost function of Linear Regression. This apart from preventing overfitting also reduces the coefficients of less important features to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Tree based models

A

calculates feature importance for they need to keep the best performing features as close to the root of the tree. Constructing a decision tree involves calculating the best predictive feature.

The feature importance in tree based models are calculated based on Gini Index, Entropy or Chi-Square value.

Feature Selection as most things in Data Science is highly context and data dependent and there is no one stop solution for Feature Selection. The best way to go forward is to understand the mechanism of each methods and use when required.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When you’re getting started with a machine learning (ML) project, one critical principle to keep in mind is that data is everything. It is often said that if ML is the rocket engine, then the fuel is the (high-quality) data fed to ML algorithms. However, deriving truth and insight from a pile of data can be a complicated and error-prone job. To have a solid start for your ML project, it always helps to analyze the data up front, a practice that describes the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. During that process, it’s important that you get a deep understanding of:

A

The properties of the data, such as schema and statistical properties;

The quality of the data, like missing values and inconsistent data types;

The predictive power of the data, such as correlation of features against target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Descriptive analysis
Univariate analysis

A

Descriptive analysis, or univariate analysis, provides an understanding of the characteristics of each attribute of the dataset. It also offers important evidence for feature preprocessing and selection in a later stage. The following table lists the suggested analysis for attributes that are common, numerical, categorical and textual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Correlation analysis
bivariate analysis

A

Correlation analysis (or bivariate analysis) examines the relationship between two attributes, say X and Y, and examines whether X and Y are correlated. This analysis can be done from two perspectives to get various possible combinations:

Qualitative analysis. This performs computation of the descriptive statistics of dependent numerical/categorical attributes against each unique value of the independent categorical attribute. This perspective helps intuitively understand the relationship between X and Y. Visualizations are often used together with qualitative analysis as a more intuitive way of presenting the result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Quantitative analysis

A

This is a quantitative test of the relationship between X and Y, based on hypothesis testing framework. This perspective provides a formal and mathematical methodology to quantitatively determine the existence and/or strength of the relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Contextual analysis

A

Descriptive analysis and correlation analysis are both generic enough to be performed on any structured dataset, neither of which requires context information. To further understand or profile the given dataset and to gain more domain-specific insights, you can use one of two common contextual information-based analyses:

Time-based analysis: In many real-world datasets, the timestamp (or a similar time-related attribute) is one of the key pieces of contextual information. Observing and/or understanding the characteristics of the data along the time dimension, with various granularities, is essential to understanding the data generation process and ensuring data quality

Agent-based analysis: As an alternative to the time, the other common attribute is the unique identification (ID, such as user ID) of each record. Analyzing the dataset by aggregating along the agent dimension, i.e., histogram of number of records per agent, can further help improve your understanding of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The ultimate goal of EDA (whether rigorous or through visualization) is to provide insights on the dataset you’re studying. This can inspire your subsequent feature selection, engineering, and model-building process.

Descriptive analysis provides the basic statistics of each attribute of the dataset. Those statistics can help you identify the following issues:

A

High percentage of missing values

Low variance of numeric attributes

Low entropy of categorical attributes

Imbalance of categorical target (class imbalance)

Skew distribution of numeric attributes

High cardinality of categorical attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The correlation analysis examines the relationship between two attributes. There are two typical action points triggered by the correlation analysis in the context of feature selection or feature engineering:

A

Low correlation between feature and target

High correlation between features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Once you’ve identified issues, the next task is to make a sound decision on how to properly mitigate these issues. One such example is for “High percentage of missing values.” The identified problem is that the attribute is missing in a significant proportion of the data points. The threshold or definition of “significant” can be set based on domain knowledge. There are two options to handle this, depending on the business scenario:

A

Assign a unique value to the missing value records, if the missing value, in certain contexts, is actually meaningful. For example, a missing value could indicate that a monitored, underlying process was not functioning properly.

Discard the feature if the values are missing due to misconfiguration, issues with data collection or untraceable random reasons, and the historic data can’t be reconstituted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Dimensionality Reduction

A

the process of reducing the number of features in a dataset while retaining as much information as possible. It is often used in the field of data science to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

High-dimensional data can be difficult to visualize, making it harder to understand patterns and relationships in the data.

High-dimensional data can be computationally expensive to process, making it harder to train machine learning models. Therefore consumes more time.

High-dimensional data can increase the risk of overfitting, which can lead to poor performance on unseen data.

A

Dimensionality reduction is a powerful technique used in data science to reduce the number of features in a dataset while retaining as much information as possible. It can be used to improve the performance of machine learning models, reduce the risk of overfitting, and make data easier to visualize. Popular techniques for dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Autoencoders. But PCA is widely used method .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Feature selection

A

approaches try to find a subset of the input variables (also called features or attributes). The three strategies are: the filter strategy (e.g. information gain), the wrapper strategy (e.g. search guided by accuracy), and the embedded strategy (selected features are added or removed while building the model based on prediction errors).

Data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Feature projection

A

transforms the data from the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[4][5] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Principal component analysis (PCA)

A

The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system, because they often contribute the vast majority of the system’s energy, especially in low-dimensional systems. Still, this must be proven on a case-by-case basis as not all systems exhibit this behavior. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Non-negative matrix factorization (NMF)

A

NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist,[7][8] such as astronomy.[9][10] NMF is well known since the multiplicative update rule by Lee & Seung,[7] which has been continuously developed: the inclusion of uncertainties,[9] the consideration of missing data and parallel computation,[11] sequential construction[11] which leads to the stability and linearity of NMF,[10] as well as other updates including handling missing data in digital image processing.[12]

With a stable component basis during construction, and a linear modeling process, sequential NMF[11] is able to preserve the flux in direct imaging of circumstellar structures in astronomy,[10] as one of the methods of detecting exoplanets, especially for the direct imaging of circumstellar discs. In comparison with PCA, NMF does not remove the mean of the matrices, which leads to unphysical non-negative fluxes; therefore NMF is able to preserve more information than PCA as demonstrated by Ren et al

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Kernel PCA

A

Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is called kernel PCA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Graph-based kernel PCA

A

Other prominent nonlinear techniques include manifold learning techniques such as Isomap, locally linear embedding (LLE),[13] Hessian LLE, Laplacian eigenmaps, and methods based on tangent space analysis.[14] These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA.

More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.

An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include: classical multidimensional scaling, which is identical to PCA; Isomap, which uses geodesic distances in the data space; diffusion maps, which use diffusion distances in the data space; t-distributed stochastic neighbor embedding (t-SNE), which minimizes the divergence between distributions over pairs of points; and curvilinear component analysis.

A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feedforward neural networks with a bottle-neck hidden layer.[15] The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Linear discriminant analysis (LDA)

A

a generalization of Fisher’s linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Generalized disciminant analysis (GDA)

A

GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support-vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space.[16][17] Similar to LDA, the objective of GDA is to find a projection for the features into a lower dimensional space by maximizing the ratio of between-class scatter to within-class scatter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Autoencoder

A

can be used to learn nonlinear dimension reduction functions and codings together with an inverse function from the coding to the original representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

t-SNE

A

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique useful for visualization of high-dimensional datasets. It is not recommended for use in analysis such as clustering or outlier detection since it does not necessarily preserve densities or distances well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

UMAP

A

Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique. Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Dimension reduction

A

For high-dimensional datasets (i.e. with number of dimensions more than 10), dimension reduction is usually performed prior to applying a K-nearest neighbors algorithm (k-NN) in order to avoid the effects of the curse of dimensionality.[19]

Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF) techniques as a pre-processing step followed by clustering by K-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding.[20]

For very-high-dimensional datasets (e.g. when performing similarity search on live video streams, DNA data or high-dimensional time series) running a fast approximate K-NN search using locality-sensitive hashing, random projection,[21] “sketches”,[22] or other high-dimensional similarity search techniques from the VLDB conference toolbox might be the only feasible option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why is Dimensionality Reduction Needed?

A

Often in machine learning, the more features that are present in the dataset the better a classifier can learn. However, more features also means a higher computational cost. Not only can high dimensionality lead to long training times, more features often lead to an algorithm overfitting as it tries to create a model that explains all the features in the data.

Because dimensionality reduction reduces the overall number of features, it can reduce the computational demands associated with training a model but also helps combat overfitting by keeping the features that will be fed to the model fairly simple.

Dimensionality reduction can be used in both supervised and unsupervised learning contexts. In the case of unsupervised learning, dimensionality reduction is often used to preprocess the data by carrying out feature selection or feature extraction.

The primary algorithms used to carry out dimensionality reduction for unsupervised learning are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

In the case of supervised learning, dimensionality reduction can be used to simplify the features fed into the machine learning classifier. The most common methods used to carry out dimensionality reduction for supervised learning problems is Linear Discriminant Analysis (LDA) and PCA, and it can be utilized to predict new cases.

Take note that the use cases described above are general use cases and not the only conditions these techniques are used in. After all, dimensionality reduction techniques are statistical methods and their use is not restricted by machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

PCA Implementation Example
Let’s take a look at how PCA can be implemented in Scikit-Learn. We’ll be using the Mushroom classification dataset for this.

First, we need to import all the modules we need, which includes PCA, train_test_split, and labeling and scaling tools:

A

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore”)

After we load in the data, we’ll check for any null values. We’ll also encode the data with the LabelEncoder. The class feature is the first column in the dataset, so we split up the features and labels accordingly:

m_data = pd.read_csv(‘mushrooms.csv’)

Machine learning systems work with integers, we need to encode these
# string characters into ints

encoder = LabelEncoder()

Now apply the transformation to all the columns:
for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])

X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
We’ll now scale the features with the standard scaler. This is optional as we aren’t actually running the classifier, but it may impact how our data is analyzed by PCA:

Scale the features
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
We’ll now use PCA to get the list of features and plot which features have the most explanatory power, or have the most variance. These are the principle components. It looks like around 17 or 18 of the features explain the majority, almost 95% of our data:

Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align=’center’, label=’individual variance’)
plt.legend()
plt.ylabel(‘Variance ratio’)
plt.xlabel(‘Principal components’)
plt.show()

Let’s convert the features into the 17 top features. We’ll then plot a scatter plot of the data point classification based on these 17 features:

pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)

plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[‘class’])
plt.show()

Let’s also do this for the top 2 features and see how the classification changes:

pca3 = PCA(n_components=2)
pca3.fit(X_features)
x_3d = pca3.transform(X_features)

plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data[‘class’])
plt.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Singular Value Decomposition

A

The purpose of Singular Value Decomposition is to simplify a matrix and make doing calculations with the matrix easier. The matrix is reduced to its constituent parts, similar to the goal of PCA. Understanding the ins and outs of SVD isn’t completely necessary to implement it in your machine learning models, but having an intuition for how it works will give you a better idea of when to use it.

SVD can be carried out on either complex or real-valued matrices, but to make this explanation easier to understand, we’ll go over the method of decomposing a real-valued matrix.

When doing SVD we have a matrix filled in with data and we want to reduce the number of columns the matrix has. This reduces the dimensionality of the matrix while still preserving as much of the variability in the data as possible.

We can say that Matrix A equals the transpose of matrix V:

A = U * D * V^t

Assuming we have some matrix A, we can represent that matrix as three other matrices called U, V, and D. Matrix A has the original xy elements, while Matrix U is an orthogonal matrix containing xx elements and Matrix V is a different orthogonal matrix containing yy elements. Finally, D is a diagonal matrix containing xy elements.

Decomposing values for a matrix involves converting the singular values in the original matrix into the diagonal values of the new matrix. Orthogonal matrices do not have their properties changed if they are multiplied by other numbers, and we can take advantage of this property to get an approximation of matrix A. When multiplying the orthogonal matrix together combined when the transpose of matrix V, we get a matrix that is equivalent to the original matrix A.

When we break/decompose matrix A down into U, D, and V, we then have three different matrices that contain the information of Matrix A.

It turns out that the left-most columns of the matrices hold the majority of our data, and we can select just these few columns to have a good approximation of Matrix A. This new matrix is much simpler and easier to work with, as it has far fewer dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

SVD Implementation Example

A

One of the most common ways that SVD is used is to compress images. After all, the pixel values that make up the red, green, and blue channels in the image can just be reduced and the result will be an image that is less complex but still contains the same image content. Let’s try using SVD to compress an image and render it.

We’ll use several functions to handle the compression of the image. We’ll really only need Numpy and the Image function from the PIL library in order to accomplish this, since Numpy has a method to carry out the SVD calculation:

import numpy
from PIL import Image
First, we’ll just write a function to load in the image and turn it into a Numpy array. We then want to select the red, green, and blue color channels from the image:

def load_image(image):
image = Image.open(image)
im_array = numpy.array(image)

red = im_array[:, :, 0]
green = im_array[:, :, 1]
blue = im_array[:, :, 2]

return red, green, blue

Now that we have the colors, we need to compress the color channels. We can start by calling Numpy’s SVD function on the color channel we want. We’ll then create an array of zeroes that we’ll fill in after the matrix multiplication is completed. We then specify the singular value limit we want to use when doing the calculations:

def channel_compress(color_channel, singular_value_limit):
u, s, v = numpy.linalg.svd(color_channel)
compressed = numpy.zeros((color_channel.shape[0], color_channel.shape[1]))
n = singular_value_limit

left_matrix = numpy.matmul(u[:, 0:n], numpy.diag(s)[0:n, 0:n])
inner_compressed = numpy.matmul(left_matrix, v[0:n, :])
compressed = inner_compressed.astype('uint8')
return compressed

red, green, blue = load_image(“dog3.jpg”)
singular_val_lim = 350

After this, we do matrix multiplication on the diagonal and the value limits in the U matrix, as described above. This gets us the left matrix and we then multiply it with the V matrix. This should get us the compressed values which we transform to the ‘uint8’ type:

def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)

im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)

new_image = Image.merge("RGB", (im_red, im_green, im_blue))
new_image.show()
new_image.save("dog3-edited.jpg")

compress_image(red, green, blue, singular_val_lim)
We’ll be using this image of a dog to test our SVD compression on:

We also need to set the singular value limit we’ll use, let’s start with 600 for now:

red, green, blue = load_image(“dog.jpg”)
singular_val_lim = 350
Finally, we can get the compressed values for the three color channels and transform them from Numpy arrays into image components using PIL. We then just have to join the three channels together and show the image. This image should be a little smaller and simpler than the original image:

Indeed, if you inspect the size of the images, you’ll notice that the compressed one is smaller, though we’ve also had a bit of lossy compression. You can see some noise in the image as well.

You can play around with adjusting the singular value limit. The lower the chosen limit the greater the compression will be, but at a certain point image artifact-ing will show up and the image will degrade in quality:

def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)

im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)

new_image = Image.merge("RGB", (im_red, im_green, im_blue))
new_image.show()

compress_image(red, green, blue, singular_val_lim)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Linear Discriminant Analysis

A

operates by projecting data from a multidimensional graph onto a linear graph. The easiest way to conceive of this is with a graph filled up with data points of two different classes. Assuming that there is no line that will neatly separate the data into two classes, the two dimensional graph can be reduced down into a 1D graph. This 1D graph can then be used to hopefully achieve the best possible separation of the data points.

When LDA is carried out there are two primary goals: minimizing the variance of the two classes and maximizing the distance between the means of the two data classes.

In order to achieve this, a new axis will be plotted in the 2D graph. This new axis should separate the two data points based on the previously mentioned criteria. Once the new axis has been created the data points within the 2D graph are redrawn along the new axis.

LDA carries out three different steps to move the original graph to the new axis. First, the separability between the classes has to be calculated, and this is based on the distance between the class means or the between-class variance. In the next step, the within class variance must be calculated, which is the distance between the mean and sample for the different classes. Finally, the lower dimensional space that maximizes the between class variance has to be constructed.

LDA works best when the means of the classes are far from each other. If the means of the distribution are shared it won’t be possible for LDA to separate the classes with a new linear axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

LDA Implementation Example
Finally, let’s see how LDA can be used to carry out dimensionality reduction. Note that LDA can be used as a classification algorithm in addition to carrying out dimensionality reduction.

We’ll be using the Titanic dataset for the following example.

Let’s start off by making all our necessary imports:

A

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
We’ll now load in our training data, which we’ll divide into training and validation sets.

Though, we need to do a little data preprocessing first. Let’s drop the Name, Cabin, and Ticket columns as they don’t carry a lot of useful info. We also need to fill in any missing data, which we’ll replace with median values in the case of the Age feature and an S in the case of the Embarked feature:

training_data = pd.read_csv(“train.csv”)

Let’s drop the cabin and ticket columns
training_data.drop(labels=[‘Cabin’, ‘Ticket’], axis=1, inplace=True)

training_data[“Age”].fillna(training_data[“Age”].median(), inplace=True)
training_data[“Embarked”].fillna(“S”, inplace=True)
We also need to encode the non-numerical features. We’ll encode both the Sex and Embarked columns. Let’s drop the Name column as well, since it seems unlikely to be useful in classification:

encoder_1 = LabelEncoder()

Fit the encoder on the data
encoder_1.fit(training_data[“Sex”])

Transform and replace the training data
training_sex_encoded = encoder_1.transform(training_data[“Sex”])
training_data[“Sex”] = training_sex_encoded

encoder_2 = LabelEncoder()
encoder_2.fit(training_data[“Embarked”])

training_embarked_encoded = encoder_2.transform(training_data[“Embarked”])
training_data[“Embarked”] = training_embarked_encoded

Assume the name is going to be useless and drop it
training_data.drop(“Name”, axis=1, inplace=True)
We need to scale the values, but the Scaler tool takes arrays, so the values we want to reshape need to be turned into arrays first. After that, we can scale the data:

Remember that the scaler takes arrays
ages_train = np.array(training_data[“Age”]).reshape(-1, 1)
fares_train = np.array(training_data[“Fare”]).reshape(-1, 1)

scaler = StandardScaler()

training_data[“Age”] = scaler.fit_transform(ages_train)
training_data[“Fare”] = scaler.fit_transform(fares_train)

Now to select our training and testing data
features = training_data.drop(labels=[‘PassengerId’, ‘Survived’], axis=1)
labels = training_data[‘Survived’]
We can now select the training features and labels and use train_test_split to make our training and validation data. It’s easy to do classification with LDA, you handle it just like you would any other classifier in Scikit-Learn.

Just fit the function on the training data and have it predict on the validation/testing data. We can then print metrics for the predictions against the actual values:

X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2, random_state=27)

model = LDA()
model.fit(X_train, y_train)
preds = model.predict(X_val)
acc = accuracy_score(y_val, preds)
f1 = f1_score(y_val, preds)

print(“Accuracy: {}”.format(acc))
print(“F1 Score: {}”.format(f1))
Here’s the print out:

Accuracy: 0.8100558659217877
F1 Score: 0.734375

When it comes to transforming the data and reducing dimensionality, let’s run a Logistic Regression classifier on the data first so we can see what our performance is prior to dimensionality reduction:

logreg_clf = LogisticRegression()
logreg_clf.fit(X_train, y_train)
preds = logreg_clf.predict(X_val)
acc = accuracy_score(y_val, preds)
f1 = f1_score(y_val, preds)

print(“Accuracy: {}”.format(acc))
print(“F1 Score: {}”.format(f1))
Here’s the results:

Accuracy: 0.8100558659217877
F1 Score: 0.734375
Now we will transform the data features by specifying a number of desired components for LDA and fitting the model on the features and labels. We then just transform the features and save it into a new variable. Let’s print out the original and reduced number of features:

LDA_transform = LDA(n_components=1)
LDA_transform.fit(features, labels)
features_new = LDA_transform.transform(features)

Print the number of features
print(‘Original feature #:’, features.shape[1])
print(‘Reduced feature #:’, features_new.shape[1])

Print the ratio of explained variance
print(LDA_transform.explained_variance_ratio_)
Here’s the print out for the above code:

Original feature #: 7
Reduced feature #: 1
[1.]
We now just have to do train/test split again with the new features and run the classifier again to see how performance changed:

X_train, X_val, y_train, y_val = train_test_split(features_new, labels, test_size=0.2, random_state=27)

logreg_clf = LogisticRegression()
logreg_clf.fit(X_train, y_train)
preds = logreg_clf.predict(X_val)
acc = accuracy_score(y_val, preds)
f1 = f1_score(y_val, preds)

print(“Accuracy: {}”.format(acc))
print(“F1 Score: {}”.format(f1))
Accuracy: 0.8212290502793296
F1 Score: 0.7500000000000001

42
Q

Regression

A

It is used for estimating the relationship between the dependent and independent variables. It is useful in determining the strength of the relationship among these variables and to model the future relationship between them. It has multiple variants like Linear Regression, Multi Linear Regression, and Non-Linear Regression, where Linear and Multi Linear are the most common ones. It offers numerous applications in discipline, including finance.

uppose you are asked to predict how much snowfall will happen this year. Knowing the fact that global warming is reducing the average snowfall in your city. You are provided with the following tabular data of the year with respect to the amount of snowfall that happened each year in inches. Looking at the table you can estimate that the average snowfall will be 20-40 inches, which is a fair estimate but this can be even better-using regression.
Since regression is fitting points to the graph, look at the following graph, From the regression line, it is clear to visualize that our initial estimate of 20-40 inch for 2015 is nowhere closer to the possible value. Since following the regression line, we can estimate that the snow falls for the year 2015 will be somewhere around 5-10 inches. Along with this, estimation regression also provides the line equation which in this case is:

y = -2.2923x + 4624.4.

This means that we can plug in the year as x value in the equation and get the estimate for that year.

Let’s say:

if you want to predict the snowfall in the year 2016 then by putting 2016 in-place of x we get:

y = -2.2923*2016 + 4624.4

= 3.132 inches.

43
Q

Standard deviation

A

Standard deviations measure how the data are concentrated around their mean. More concentration will result in a smaller standard deviation and vice-versa. In other words, we can say that it is the summary measure of the difference of each observation from the mean. If we add the differences, the positive would be exactly equal to the negative and adding both will result is zero. In order to calculate the standard deviation, we need the computer the variance first. Variance is a measure of how far the set numbers are spread out.

44
Q

Mean

A

Mean is the sum of the list of numbers divided by the total number of items on the list. It is also known as the average. Assume that we have 3 different ratings for a movie. First is 7.0, The second rating is 9.0, and the third 6.0. The mean of these ratings is calculated by summing up these ratings and then dividing it by the number of ratings.

45
Q

Sample Size Determination

A

Sample size determination as the name suggests is the samples of the dataset which is used for the analysis of data. This technique is useful when we have a large amount of dataset and we don’t want to go through each feature of the dataset. Instead of that, we select few data from the dataset such that those data are not bias. The whole idea is to get the right amount of data for the samples, because if that is not current then the whole data analysis will be affected. So, determining sample size is an important issue because a large sample size will result in a waste of time, money, and resource while a small sample size will result in an inaccurate result. In many cases, it is easy to determine the minimum size of the sample to estimate a process parameter.

46
Q

Hypothesis Testing

A

In this testing, we determine whether a premise is true for the dataset or not. Analyst test the samples with the goal of accepting or rejecting the null hypothesis. Statistical Analyst starts the examinations with a random number of populations. The null hypothesis is the value that the analyst believes to be true and the alternate to be false.

For example, if a person wants to prove that a coin has a 50% chance of landing on heads, then the null hypothesis will be yes, and the alternate hypothesis will be no. Mathematically, we can represent the null hypothesis as Ho: P=0.5. And the alternate hypothesis can be denoted by Ha.

100 coins flips are taken from the random population of coins and then the null hypothesis is tested. Now if we see that the 100 coins flip is taken as 40 head and 60 tails then the analyst can conclude that the coin does not have a 50% chance of landing on the head and will accept the alternate hypothesis and reject the null hypothesis. After which a new hypothesis would be tested, and this time the penny has a 40% chance of landing on the head.

47
Q

PCA Step 1- Data normilization

A

let’s consider, for instance, the following information for a given client.

Monthly expenses: $300
Age: 27
Rating: 4.5
This information has different scales and performing PCA using such data will lead to a biased result. This is where data normalization comes in. It ensures that each attribute has the same level of contribution, preventing one variable from dominating others. For each variable, normalization is done by subtracting its mean and dividing by its standard deviation.

48
Q

Step 2 - Covariance matrix

A

As the same suggests, this step is about computing the covariable matrix from the normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the covariance between variables i and j.

49
Q

Step 3 - Eigenvectors and eigenvalues

A

Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”. An eigenvalue, on the other hand, is a number representing the amount of variance present in the data for a given direction. Each eigenvector has its corresponding eigenvalue.

50
Q

Step 4 - Selection of principal components

A

There are as many pairs of eigenvectors and eigenvalues as the number of variables in the data. In the data with only monthly expenses, age, and rate, there will be three pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue corresponds to the first principal component. The second principal component is the eigenvector with the second highest eigenvalue, and so on.

51
Q

Step 5 - Data transformation in new dimensional space

A

This step involves re-orienting the original data onto a new subspace defined by the principal components This reorientation is done by multiplying the original data by the previously computed eigenvectors.

It is important to remember that this transformation does not modify the original data itself but instead provides a new perspective to better represent the data.

52
Q

Principal components analysis, often abbreviated PCA, is an unsupervised machine learning technique that seeks to find principal components – linear combinations of the original predictors – that explain a large portion of the variation in a dataset.

The goal of PCA is to explain most of the variability in a dataset with fewer variables than the original dataset.

A

For a given dataset with p variables, we could examine the scatterplots of each pairwise combination of variables, but the sheer number of scatterplots can become large very quickly.

For p predictors, there are p(p-1)/2 scatterplots.

So, for a dataset with p = 15 predictors, there would be 105 different scatterplots!

Fortunately, PCA offers a way to find a low-dimensional representation of a dataset that captures as much of the variation in the data as possible.

If we’re able to capture most of the variation in just two dimensions, we could project all of the observations in the original dataset onto a simple scatterplot.

53
Q

The way we find the principal components is as follows:

Given a dataset with p predictors: X1, X2, … , Xp,, calculate Z1, … , ZM to be the M linear combinations of the original p predictors where:

A

Zm = ΣΦjmXj for some constants Φ1m, Φ2m, Φpm, m = 1, …, M.
Z1 is the linear combination of the predictors that captures the most variance possible.
Z2 is the next linear combination of the predictors that captures the most variance while being orthogonal (i.e. uncorrelated) to Z1.
Z3 is then the next linear combination of the predictors that captures the most variance while being orthogonal to Z2.
And so on.

54
Q

In practice, we use the following steps to calculate the linear combinations of the original predictors:

A
  1. Scale each of the variables to have a mean of 0 and a standard deviation of 1.
  2. Calculate the covariance matrix for the scaled variables.
  3. Calculate the eigenvalues of the covariance matrix.
55
Q

Using linear algebra, it can be shown that the eigenvector that corresponds to the largest eigenvalue is the first principal component. In other words, this particular combination of the predictors explains the most variance in the data.

A

The eigenvector corresponding to the second largest eigenvalue is the second principal component, and so on.

56
Q

Calculate the Principal Components
After loading the data, we can use the R built-in function prcomp() to calculate the principal components of the dataset.

Be sure to specify scale = TRUE so that each of the variables in the dataset are scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components.

Also note that eigenvectors in R point in the negative direction by default, so we’ll multiply by -1 to reverse the signs.

A

calculate principal components

results <- prcomp(USArrests, scale = TRUE)

results$rotation <- -1*results$rotation

results$rotation

           PC1        PC2        PC3         PC4 Murder   0.5358995 -0.4181809  0.3412327 -0.64922780 Assault  0.5831836 -0.1879856  0.2681484  0.74340748 UrbanPop 0.2781909  0.8728062  0.3780158 -0.13387773 Rape     0.5434321  0.1673186 -0.8177779 -0.08902432

We can see that the first principal component (PC1) has high values for Murder, Assault, and Rape which indicates that this principal component describes the most variation in these variables.

We can also see that the second principal component (PC2) has a high value for UrbanPop, which indicates that this principle component places most of its emphasis on urban population.

Note that the principal components scores for each state are stored in results$x. We will also multiply these scores by -1 to reverse the signs:

results$x <- -1*results$x

head(results$x)

              PC1        PC2         PC3          PC4 Alabama     0.9756604 -1.1220012  0.43980366 -0.154696581 Alaska      1.9305379 -1.0624269 -2.01950027  0.434175454 Arizona     1.7454429  0.7384595 -0.05423025  0.826264240 Arkansas   -0.1399989 -1.1085423 -0.11342217  0.180973554 California  2.4986128  1.5274267 -0.59254100  0.338559240 Colorado    1.4993407  0.9776297 -1.08400162 -0.001450164
57
Q

Visualize the Results with a Biplot
Next, we can create a biplot – a plot that projects each of the observations in the dataset onto a scatterplot that uses the first and second principal components as the axes:

Note that scale = 0 ensures that the arrows in the plot are scaled to represent the loadings.

biplot(results, scale = 0)

A

display states with highest murder rates in original dataset

From the plot we can see each of the 50 states represented in a simple two-dimensional space.

The states that are close to each other on the plot have similar data patterns in regards to the variables in the original dataset.

We can also see that the certain states are more highly associated with certain crimes than others. For example, Georgia is the state closest to the variable Murder in the plot.

If we take a look at the states with the highest murder rates in the original dataset, we can see that Georgia is actually at the top of the list:

head(USArrests[order(-USArrests$Murder),])

           Murder Assault UrbanPop Rape Georgia          17.4     211       60 25.8 Mississippi      16.1     259       44 17.1 Florida          15.4     335       80 31.9 Louisiana        15.4     249       66 22.2 South Carolina   14.4     279       48 22.5 Alabama          13.2     236       58 21.2
58
Q

calculate total variance explained by each principal component

Find Variance Explained by Each Principal Component
We can use the following code to calculate the total variance in the original dataset explained by each principal component:

results$sdev^2 / sum(results$sdev^2)

[1] 0.62006039 0.24744129 0.08914080 0.04335752
From the results we can observe the following:

A

calculate total variance explained by each principal component

The first principal component explains 62% of the total variance in the dataset.
The second principal component explains 24.7% of the total variance in the dataset.
The third principal component explains 8.9% of the total variance in the dataset.
The fourth principal component explains 4.3% of the total variance in the dataset.
Thus, the first two principal components explain a majority of the total variance in the data.

This is a good sign because the previous biplot projected each of the observations from the original data onto a scatterplot that only took into account the first two principal components.

Thus, it’s valid to look at patterns in the biplot to identify states that are similar to each other.

We can also create a scree plot – a plot that displays the total variance explained by each principal component – to visualize the results of PCA:

var_explained = results$sdev^2 / sum(results$sdev^2)

qplot(c(1:4), var_explained) +
geom_line() +
xlab(“Principal Component”) +
ylab(“Variance Explained”) +
ggtitle(“Scree Plot”) +
ylim(0, 1)

59
Q

Principal Components Analysis in Practice
In practice, PCA is used most often for two reasons:

A
  1. Exploratory Data Analysis – We use PCA when we’re first exploring a dataset and we want to understand which observations in the data are most similar to each other.
  2. Principal Components Regression – We can also use PCA to calculate principal components that can then be used in principal components regression. This type of regression is often used when multicollinearity exists between predictors in a dataset.
60
Q

NumPy

A

an open-source Python library for performing array computing (matrix operations). It is a wrapper around the library implemented in C and used for performing several trigonometric, algebraic, and statistical operations. NumPy objects can be easily converted to other types of objects like the Pandas data frame and the tensorflow tensor. Python list can be used for array computing, but it is much slower than NumPy. NumPy achieves its fast implementation using vectorization. One of the important features of NumPy arrays is that a developer can perform the same mathematical operation on every element with a single command.

61
Q

Python: broadcasting with a matrix and a constant

A

Defining both the matrices

import numpy as np

a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])

sub_ans = a-b-1
print(sub_ans)

sub_ans = np.subtract(a, b, 1)
print(sub_ans)
Output

[ 2 66 2 69]
[ 2 66 2 69]

62
Q

Python: mod() and power() function

A

Performing remainder on two matrices

Performing mod on two matrices
mod_ans = np.mod(a, b)
print(mod_ans)

rem_ans=np.remainder(a,b)
print(rem_ans)

pow_ans = np.power(a, b)
print(pow_ans)
Output

[ 1 2 3 10]
[ 1 2 3 10]
[ 25 1934917632 137858491849
1152921504606846976]

63
Q

Python: aggregation and statistical functions

A

Getting average of all numbers in ‘b’

Getting mean of all numbers in ‘a’
mean_a = np.mean(a)
print(mean_a)

mean_b = np.average(b)
print(mean_b)

sum_a = np.sum(a)
print(sum_a)

var_b = np.var(b)
print(var_b)

Output

47.5
11.75
190
119.1875

64
Q

In python matrix can be implemented as 2D list or 2D Array. Forming matrix from latter, gives the additional functionalities for performing various operations in matrix. These operations and array are defines in module “numpy“.

Operation on Matrix :

A

importing numpy for matrix operations

  1. add() :- This function is used to perform element wise matrix addition.
  2. subtract() :- This function is used to perform element wise matrix subtraction.
  3. divide() :- This function is used to perform element wise matrix division.

Python code to demonstrate matrix operations
# add(), subtract() and divide()

import numpy

x = numpy.array([[1, 2], [4, 5]])
y = numpy.array([[7, 8], [9, 10]])

print (“The element wise addition of matrix is : “)
print (numpy.add(x,y))

print (“The element wise subtraction of matrix is : “)
print (numpy.subtract(x,y))

print (“The element wise division of matrix is : “)
print (numpy.divide(x,y))
Output :

The element wise addition of matrix is :
[[ 8 10]
[13 15]]
The element wise subtraction of matrix is :
[[-6 -6]
[-5 -5]]
The element wise division of matrix is :
[[ 0.14285714 0.25 ]
[ 0.44444444 0.5 ]]

65
Q

Python functions:
multiply() :- This function is used to perform element wise matrix multiplication.
dot() :- This function is used to compute the matrix multiplication, rather than element wise multiplication.

A

importing numpy for matrix operations

Python code to demonstrate matrix operations
# multiply() and dot()

import numpy

x = numpy.array([[1, 2], [4, 5]])
y = numpy.array([[7, 8], [9, 10]])

print (“The element wise multiplication of matrix is : “)
print (numpy.multiply(x,y))

print (“The product of matrices is : “)
print (numpy.dot(x,y))
Output :

The element wise multiplication of matrix is :
[[ 7 16]
[36 50]]
The product of matrices is :
[[25 28]
[73 82]]

66
Q

Python functions:
sqrt() :- This function is used to compute the square root of each element of matrix.

sum(x,axis) :- This function is used to add all the elements in matrix. Optional “axis” argument computes the column sum if axis is 0 and row sum if axis is 1.

“T” :- This argument is used to transpose the specified matrix.

A

importing numpy for matrix operations

Python code to demonstrate matrix operations
# sqrt(), sum() and “T”

import numpy

x = numpy.array([[1, 2], [4, 5]])
y = numpy.array([[7, 8], [9, 10]])

print (“The element wise square root is : “)
print (numpy.sqrt(x))

print (“The summation of all matrix element is : “)
print (numpy.sum(y))

print (“The column wise summation of all matrix is : “)
print (numpy.sum(y,axis=0))

print (“The row wise summation of all matrix is : “)
print (numpy.sum(y,axis=1))

print (“The transpose of given matrix is : “)
print (x.T)
Output :

The element wise square root is :
[[ 1. 1.41421356]
[ 2. 2.23606798]]
The summation of all matrix element is :
34
The column wise summation of all matrix is :
[16 18]
The row wise summation of all matrix is :
[15 19]
The transpose of given matrix is :
[[1 4]
[2 5]]

67
Q

Python nested loops:
Approach:

Define matrices A and B.
Get the number of rows and columns of the matrices using the len() function.
Initialize matrices C, D, and E with zeros using nested loops or list comprehension.
Use nested loops or list comprehension to perform the element-wise addition, subtraction, and division of matrices.
Print the resulting matrices C, D, and E.

A

Element wise addition

A = [[1,2],[4,5]]
B = [[7,8],[9,10]]
rows = len(A)
cols = len(A[0])

C = [[0 for i in range(cols)] for j in range(rows)]
for i in range(rows):
for j in range(cols):
C[i][j] = A[i][j] + B[i][j]
print(“Addition of matrices: \n”, C)

D = [[0 for i in range(cols)] for j in range(rows)]
for i in range(rows):
for j in range(cols):
D[i][j] = A[i][j] - B[i][j]
print(“Subtraction of matrices: \n”, D)

E = [[0 for i in range(cols)] for j in range(rows)]
for i in range(rows):
for j in range(cols):
E[i][j] = A[i][j] / B[i][j]
print(“Division of matrices: \n”, E)
Output
Addition of matrices:
[[8, 10], [13, 15]]
Subtraction of matrices:
[[-6, -6], [-5, -5]]
Division of matrices:
[[0.14285714285714285, 0.25], [0.4444444444444444, 0.5]]

68
Q

Time complexity: O(n^2)

A

Space complexity: O(n^2)

69
Q

R matrix

A

Approach
Create first matrix
Syntax:

matrix_name <- matrix(data , nrow = value, ncol = value) .

Parameters:

data=includes a list/vector of elements passed as data to an matrix.
nrow= nrow represent the number of rows specified.
ncol= ncol represent the number of columns specified.
Create second matrix
Apply operation between these matrices
Display result

70
Q

R Addition matrix

A

create a matrix with 4* 4 by passing this vector1

create a vector of elements
vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

print(matrix1)

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

print(matrix2)

print(matrix1+matrix2)

71
Q

R Subtraction matrix

A

create a matrix with 4* 4 by passing this vector1

create a vector of elements
vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

print(matrix1)

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

print(matrix2)
print(“ subtraction result”)

print(matrix1-matrix2)

72
Q

R Multiplication matrix

A

create a matrix with 4* 4 by passing this vector1

create a vector of elements
vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

print(matrix1)

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

print(matrix2)
print(“ multiplication result”)

print(matrix1*matrix2)

73
Q

R division matrix

A

create a matrix with 4* 4 by passing this vector1

create a vector of elements
vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

print(matrix1)

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

print(matrix2)
print(“ Division result”)

print(matrix1/matrix2)

74
Q

R Modulo Operation
Modulo returns the remainder of the elements in a matrix. The operator used: %%. The main difference between division and modulo operator is that division returns quotient and modulo returns remainder.

A

create a matrix with 4* 4 by passing this vector1

create a vector of elements
vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

print(matrix1)

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

print(matrix2)
print(“ modulo result”)

print(matrix1%%matrix2)

75
Q

R Transpose matrix

A

To find the transpose of a matrix in R you just need to use the t function as follows:

t(A)

  [, 1] [, 2] [1, ]   10    5 [2, ]    8   12
76
Q

R Element-wise multiplication

A

The element-wise multiplication of two matrices of the same dimensions can also be computed with the * operator. The output will be a matrix of the same dimensions of the original matrices.

A * B

 [, 1] [, 2] [1, ]   50   24 [2, ]   75   72
77
Q

R matrix cross product

A

If you need to calculate the matricial product of a matrix and the transpose or other you can type t(A) %% B or A %% t(B), being A and B the names of the matrices. However, in R it is more efficient and faster using the crossprod and tcrossprod functions, respectively.

crossprod(A, B)
Equivalent to t(A) %*% B
[, 1] [, 2]
[1, ] 125 60
[2, ] 220 96

78
Q

R Exterior product

A

Similarly to the matricial multiplication, in R you can compute the exterior product of two matrices with the %o% operator. This operator is a shorcode for the default outer function.

A %o% B

Equivalent to:
outer(A, B, FUN = “*”)

, , 1, 1

 [, 1] [, 2] [1, ]   50   40 [2, ]   25   60

, , 2, 1

 [, 1] [, 2] [1, ]  150  120 [2, ]   75  180

, , 1, 2

 [, 1] [, 2] [1, ]   30   24 [2, ]   15   36

, , 2, 2

 [, 1] [, 2] [1, ]   60   48 [2, ]   30   72
79
Q

R Kronecker product

A

he Kronecker product of two matrices A and B, denoted by
A⊗B is the last type of matricial product we are going to review. In R, the calculation can be achieved with the %x% operator.

A %x% B
Kronecker product of A and B
[, 1] [, 2] [, 3] [, 4]
[1, ] 50 30 40 24
[2, ] 150 60 120 48
[3, ] 25 15 60 36
[4, ] 75 30 180 72

80
Q

Power of a matrix in R

A

There is no a built-in function in base R to calculate the power of a matrix, so we will provide two different alternatives.

On the one hand, you can make use of the %^% operator of the expm package as follows:

install.packages(“expm”)
library(expm)

A %^% 2
Power of A
[, 1] [, 2]
[1, ] 140 176
[2, ] 110 184

On the other hand the matrixcalc package provides the matrix.power function:

install.packages(“matrixcalc”)
library(matrixcalc)

matrix.power(A, 2)
Power of A
[, 1] [, 2]
[1, ] 140 176
[2, ] 110 184

You can check that the power is correct with the following code:

A %*% A

Note that if you want to calculate the element-wise power you just need to use the ^ operator. In this case the matrix don’t need to be square.

A ^ 2
Element-wise power of A
[, 1] [, 2]
[1, ] 100 64
[2, ] 25 144

81
Q

Determinant of a matrix in R

A

The determinant of a matrix A, generally denoted by ∣A∣, is a scalar value that encodes some properties of the matrix. In R you can make use of the det function to calculate it.

det(A) # 80
det(B) # -15

82
Q

Inverse of a matrix in R

A

In order to calculate the inverse of a matrix in R you can make use of the solve function.

M <- solve(A)
M
Inverse of A
[, 1] [, 2]
[1, ] 0.1500 -0.100
[2, ] -0.0625 0.125
As a matrix multiplied by its inverse is the identity matrix we can verify that the previous output is correct as follows:

A %% M
Check
[, 1] [, 2]
[1, ] 1 0
[2, ] 0 1
Moreover, as main use of the solve function is to solve a system of equations, if you want to calculate the solution to
A%
% X=B you can type:

solve(A, B)
Output
[, 1] [, 2]
[1, ] -0.7500 -0.1500
[2, ] 1.5625 0.5625

83
Q

Rank of a matrix in R

A

The rank of a matrix is maximum number of columns (rows) that are linearly independent. In R there is no base function to calculate the rank of a matrix but we can make use of the qr function, which in addition to calculating the QR decomposition, returns the rank of the input matrix. An alternative is to use the rankMatrix function from the Matrix package.

qr(A)$rank # 2
qr(B)$rank # 2

Equivalent to:
library(Matrix)
rankMatrix(A)[1] # 2

84
Q

Matrix diagonal in R

A

The diag function allows you to extract or replace the diagonal of a matrix:

Extract the diagonal
diag(A) # 10 12
diag(B) # 5 6

Replace the diagonal
# diag(A) <- c(0, 2)
Applying the rev function to the columns of the matrix you can also extract off the elements of the secondary diagonal matrix in R:

Extract the secondary diagonals
diag(apply(A, 2, rev)) # 5 8
diag(apply(B, 2, rev)) # 15 3

85
Q

Diagonal matrix

A

With the diag function you can also make a diagonal matrix, passing a vector as input of the function.

diag(c(7, 9, 2))
Output
[, 1] [, 2] [, 3]
[1, ] 7 0 0
[2, ] 0 9 0
[3, ] 0 0 2

86
Q

Identity matrix in R

A

In addition to the previous functionalities, the diag function also allows creating identity matrices, specifying the dimension of the desired matrix.

diag(4)
Output
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 0 0 0
[2, ] 0 1 0 0
[3, ] 0 0 1 0
[4, ] 0 0 0 1

87
Q

Eigenvalues and eigenvectors in R

A

Both the eigenvalues and eigenvectors of a matrix can be calculated in R with the eigen function.

On the one hand, the eigenvalues are stored on the values element of the returned list. The eigenvalues will be shown in decreasing order:

eigen(A)$values # 17.403124 4.596876
eigen(B)$values # 12.226812 -1.226812
On the other hand, the eigenvectors are stored on the vectors element:

eigen(A)$vectors
Eigenvectors of A
[, 1] [, 2]
[1, ] -0.7339565 -0.8286986
[2, ] -0.6791964 0.5596952
eigen(B)$vectors
Eigenvectors of B
[, 1] [, 2]
[1, ] -0.3833985 -0.4340394
[2, ] -0.9235830 0.9008939

88
Q

Singular, QR and Cholesky decomposition in R

A

In this final section we are going to discuss how to perform some decompositions related with matrices.

First, the Singular Value Decomposition (SVD) can be calculated with the svd function.

svd(A)
Singular value decomposition of A
$d
[1] 17.678275 4.525328

$u
[, 1] [, 2]
[1, ] -0.7010275 -0.7131342
[2, ] -0.7131342 0.7010275

$v
[, 1] [, 2]
[1, ] -0.5982454 -0.8013130
[2, ] -0.8013130 0.5982454
The function will return a list, where the element d is a vector containing the singular values sorted in decreasing order and u and v are matrices containing the left and right singular vectors of the original matrix, respectively.

Second, the qr function allows you to calculate the QR decomposition. The first element of the output will return a matrix of the same dimension as the original matrix, where the upper triangle matrix contains the R of the decomposition and the lower the Q.

qr(A)$qr

QR decomposition of A
[, 1] [, 2]
[1, ] -11.1803399 -12.521981
[2, ] 0.4472136 7.155418
Last, you can compute the Cholesky factorization of a real symmetric positive-definite square matrix with the chol function.

chol(A)
Cholesky decomposition of A
[, 1] [, 2]
[1, ] 3.162278 2.529822
[2, ] 0.000000 2.366432

89
Q

R function:
The chol function doesn’t check for symmetry.

A

However, you can make use of the isSymmetric function to check it

90
Q

Model features are the inputs that machine learning (ML) models use during training and inference to make predictions. ML model accuracy relies on a precise set and composition of features. For example, in an ML application that recommends a music playlist, features could include song ratings, which songs were listened to previously, and song listening time. It can take significant engineering effort to create features.

A

Feature engineering involves the extraction and transformation of variables from raw data, such as price lists, product descriptions, and sales volumes so that you can use features for training and prediction. The steps required to engineer features include data extraction and cleansing and then feature creation and storage.

91
Q

Feature engineering is challenging because it involves a combination of data analysis, business domain knowledge, and some intuition.

A

When creating features, it’s tempting to go immediately to available data, but often you should start by considering which data is required by speaking with experts, brainstorming, and doing third-party research. Without going through this exercise, you could miss important predictor variables.

92
Q

Collecting data is the process of assembling all the data you need for ML. Data collection can be tedious because data resides in many data sources, including on laptops, in data warehouses, in the cloud, inside applications, and on devices. Finding ways to connect to different data sources can be challenging.

A

Data volumes are also increasing exponentially, so there is a lot of data to search through. Additionally, data has vastly different formats and types depending on the source. For example, video data and tabular data are not easy to use together.

93
Q

Data labeling is the process of identifying raw data (images, text files, videos, and so on) and adding one or more meaningful and informative labels to provide context so an ML model can learn from it. For example, labels might indicate if a photo contains a bird or car, which words were mentioned in an audio recording, or if an X-ray discovered an irregularity.

A

Data labeling is required for various use cases, including computer vision, natural language processing, and speech recognition.

94
Q

After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and bar charts are all useful tools to confirm data is correct. Additionally, visualizations also help data science teams complete exploratory data analysis.

A

This process uses visualizations to discover patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not require formal modeling; instead, data science teams can use visualizations to decipher the data.

95
Q

Feature engineering is one of the most important and time-consuming steps of the machine learning process. Data scientists and analysts often find themselves spending a lot of time experimenting with different combinations of features to improve their models and to generate BI reports that drive business insights. The larger, more complex datasets with which data scientists find themselves wrangling exacerbate ongoing challenges, such as how to:

A

Define features in a simple and consistent way

Find and reuse existing features

Build upon existing features

Maintain and track versions of features and models

Manage the lifecycle of feature definitions

Maintain efficiency across feature calculations and storage

Calculate and persist wide tables (>1000 columns) efficiently

Recreate features that created a model that resulted in a decision that must be later defended (i.e. audit / interpretability)

96
Q

Feature explosion occurs when the number of identified features grows inappropriately. Common causes include:

A

Feature templates - implementing feature templates instead of coding new features

Feature combinations - combinations that cannot be represented by a linear system

Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection

97
Q

Automation of feature engineering is a research topic that dates back to the 1990s.[11] Machine learning software that incorporates automated feature engineering has been commercially available since 2016.[12] Related academic literature can be roughly separated into two types:

A

Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a decision tree.

Deep Feature Synthesis uses simpler methods

98
Q

Multi-relational decision tree learning (MRDTL)

A

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries. For instance, the algorithm might start out with

SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id
The query can then successively be refined by adding conditions, such as “WHERE t1.charge <= -0.392”.

However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.[13][14] Efficiency can be increased by using incremental updates, which eliminates redundancies.

99
Q

Open-source implementations

A

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:

featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning.[16][17][18]

OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques.

[OneBM] helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost.[20]

getML community is an open source tool for automated feature engineering on time series and relational data.[21][22] It is implemented in C/C++ with a Python interface.[23] It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats.

tsfresh is a Python library for feature extraction on time series data.[25] It evaluates the quality of the features using hypothesis testing.

tsflex is an open source Python library for extracting features from time series data.[27] Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel.

seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library.

tsfel is a Python package for feature extraction on time series data.

kats is a Python toolkit for analyzing time series data.

100
Q

Feature stores

A

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.[34]

A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[35]

Feature stores can be standalone software tools or built into machine learning platforms.