Sect 7- Feature selection, dimension reduction, statistical methods, PCA, & operations Flashcards

Question

Feature projection

Answer 1

transforms the data from the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[4][5] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning

Answer 2

The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system, because they often contribute the vast majority of the system's energy, especially in low-dimensional systems. Still, this must be proven on a case-by-case basis as not all systems exhibit this behavior. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.

Answer 3

NMF decomposes a non-negative matrix to the product of two non-negative ones, which has been a promising tool in fields where only non-negative signals exist,[7][8] such as astronomy.[9][10] NMF is well known since the multiplicative update rule by Lee & Seung,[7] which has been continuously developed: the inclusion of uncertainties,[9] the consideration of missing data and parallel computation,[11] sequential construction[11] which leads to the stability and linearity of NMF,[10] as well as other updates including handling missing data in digital image processing.[12] With a stable component basis during construction, and a linear modeling process, sequential NMF[11] is able to preserve the flux in direct imaging of circumstellar structures in astronomy,[10] as one of the methods of detecting exoplanets, especially for the direct imaging of circumstellar discs. In comparison with PCA, NMF does not remove the mean of the matrices, which leads to unphysical non-negative fluxes; therefore NMF is able to preserve more information than PCA as demonstrated by Ren et al

Answer 4

Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is called kernel PCA.

Answer 5

Other prominent nonlinear techniques include manifold learning techniques such as Isomap, locally linear embedding (LLE),[13] Hessian LLE, Laplacian eigenmaps, and methods based on tangent space analysis.[14] These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors. An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include: classical multidimensional scaling, which is identical to PCA; Isomap, which uses geodesic distances in the data space; diffusion maps, which use diffusion distances in the data space; t-distributed stochastic neighbor embedding (t-SNE), which minimizes the divergence between distributions over pairs of points; and curvilinear component analysis. A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feedforward neural networks with a bottle-neck hidden layer.[15] The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation.

Answer 6

a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.

Answer 7

GDA deals with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support-vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space.[16][17] Similar to LDA, the objective of GDA is to find a projection for the features into a lower dimensional space by maximizing the ratio of between-class scatter to within-class scatter.

Answer 8

can be used to learn nonlinear dimension reduction functions and codings together with an inverse function from the coding to the original representation.

Answer 9

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique useful for visualization of high-dimensional datasets. It is not recommended for use in analysis such as clustering or outlier detection since it does not necessarily preserve densities or distances well

Answer 10

Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique. Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.

Answer 11

For high-dimensional datasets (i.e. with number of dimensions more than 10), dimension reduction is usually performed prior to applying a K-nearest neighbors algorithm (k-NN) in order to avoid the effects of the curse of dimensionality.[19] Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF) techniques as a pre-processing step followed by clustering by K-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding.[20] For very-high-dimensional datasets (e.g. when performing similarity search on live video streams, DNA data or high-dimensional time series) running a fast approximate K-NN search using locality-sensitive hashing, random projection,[21] "sketches",[22] or other high-dimensional similarity search techniques from the VLDB conference toolbox might be the only feasible option.

Answer 12

Often in machine learning, the more features that are present in the dataset the better a classifier can learn. However, more features also means a higher computational cost. Not only can high dimensionality lead to long training times, more features often lead to an algorithm overfitting as it tries to create a model that explains all the features in the data. Because dimensionality reduction reduces the overall number of features, it can reduce the computational demands associated with training a model but also helps combat overfitting by keeping the features that will be fed to the model fairly simple. Dimensionality reduction can be used in both supervised and unsupervised learning contexts. In the case of unsupervised learning, dimensionality reduction is often used to preprocess the data by carrying out feature selection or feature extraction. The primary algorithms used to carry out dimensionality reduction for unsupervised learning are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). In the case of supervised learning, dimensionality reduction can be used to simplify the features fed into the machine learning classifier. The most common methods used to carry out dimensionality reduction for supervised learning problems is Linear Discriminant Analysis (LDA) and PCA, and it can be utilized to predict new cases. Take note that the use cases described above are general use cases and not the only conditions these techniques are used in. After all, dimensionality reduction techniques are statistical methods and their use is not restricted by machine learning models.

Answer 13

import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split import warnings warnings.filterwarnings("ignore") After we load in the data, we'll check for any null values. We'll also encode the data with the LabelEncoder. The class feature is the first column in the dataset, so we split up the features and labels accordingly: m_data = pd.read_csv('mushrooms.csv') Machine learning systems work with integers, we need to encode these # string characters into ints encoder = LabelEncoder() Now apply the transformation to all the columns: for col in m_data.columns: m_data[col] = encoder.fit_transform(m_data[col]) X_features = m_data.iloc[:,1:23] y_label = m_data.iloc[:, 0] We'll now scale the features with the standard scaler. This is optional as we aren't actually running the classifier, but it may impact how our data is analyzed by PCA: Scale the features scaler = StandardScaler() X_features = scaler.fit_transform(X_features) We'll now use PCA to get the list of features and plot which features have the most explanatory power, or have the most variance. These are the principle components. It looks like around 17 or 18 of the features explain the majority, almost 95% of our data: Visualize pca = PCA() pca.fit_transform(X_features) pca_variance = pca.explained_variance_ plt.figure(figsize=(8, 6)) plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance') plt.legend() plt.ylabel('Variance ratio') plt.xlabel('Principal components') plt.show() Let's convert the features into the 17 top features. We'll then plot a scatter plot of the data point classification based on these 17 features: pca2 = PCA(n_components=17) pca2.fit(X_features) x_3d = pca2.transform(X_features) plt.figure(figsize=(8,6)) plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class']) plt.show() Let's also do this for the top 2 features and see how the classification changes: pca3 = PCA(n_components=2) pca3.fit(X_features) x_3d = pca3.transform(X_features) plt.figure(figsize=(8,6)) plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data['class']) plt.show()

Answer 14

The purpose of Singular Value Decomposition is to simplify a matrix and make doing calculations with the matrix easier. The matrix is reduced to its constituent parts, similar to the goal of PCA. Understanding the ins and outs of SVD isn't completely necessary to implement it in your machine learning models, but having an intuition for how it works will give you a better idea of when to use it. SVD can be carried out on either complex or real-valued matrices, but to make this explanation easier to understand, we'll go over the method of decomposing a real-valued matrix. When doing SVD we have a matrix filled in with data and we want to reduce the number of columns the matrix has. This reduces the dimensionality of the matrix while still preserving as much of the variability in the data as possible. We can say that Matrix A equals the transpose of matrix V: A = U * D * V^t Assuming we have some matrix A, we can represent that matrix as three other matrices called U, V, and D. Matrix A has the original x*y elements, while Matrix U is an orthogonal matrix containing x*x elements and Matrix V is a different orthogonal matrix containing y*y elements. Finally, D is a diagonal matrix containing x*y elements. Decomposing values for a matrix involves converting the singular values in the original matrix into the diagonal values of the new matrix. Orthogonal matrices do not have their properties changed if they are multiplied by other numbers, and we can take advantage of this property to get an approximation of matrix A. When multiplying the orthogonal matrix together combined when the transpose of matrix V, we get a matrix that is equivalent to the original matrix A. When we break/decompose matrix A down into U, D, and V, we then have three different matrices that contain the information of Matrix A. It turns out that the left-most columns of the matrices hold the majority of our data, and we can select just these few columns to have a good approximation of Matrix A. This new matrix is much simpler and easier to work with, as it has far fewer dimensions.

Answer 15

One of the most common ways that SVD is used is to compress images. After all, the pixel values that make up the red, green, and blue channels in the image can just be reduced and the result will be an image that is less complex but still contains the same image content. Let's try using SVD to compress an image and render it. We'll use several functions to handle the compression of the image. We'll really only need Numpy and the Image function from the PIL library in order to accomplish this, since Numpy has a method to carry out the SVD calculation: import numpy from PIL import Image First, we'll just write a function to load in the image and turn it into a Numpy array. We then want to select the red, green, and blue color channels from the image: def load_image(image): image = Image.open(image) im_array = numpy.array(image) red = im_array[:, :, 0] green = im_array[:, :, 1] blue = im_array[:, :, 2] return red, green, blue Now that we have the colors, we need to compress the color channels. We can start by calling Numpy's SVD function on the color channel we want. We'll then create an array of zeroes that we'll fill in after the matrix multiplication is completed. We then specify the singular value limit we want to use when doing the calculations: def channel_compress(color_channel, singular_value_limit): u, s, v = numpy.linalg.svd(color_channel) compressed = numpy.zeros((color_channel.shape[0], color_channel.shape[1])) n = singular_value_limit left_matrix = numpy.matmul(u[:, 0:n], numpy.diag(s)[0:n, 0:n]) inner_compressed = numpy.matmul(left_matrix, v[0:n, :]) compressed = inner_compressed.astype('uint8') return compressed red, green, blue = load_image("dog3.jpg") singular_val_lim = 350 After this, we do matrix multiplication on the diagonal and the value limits in the U matrix, as described above. This gets us the left matrix and we then multiply it with the V matrix. This should get us the compressed values which we transform to the ‘uint8' type: def compress_image(red, green, blue, singular_val_lim): compressed_red = channel_compress(red, singular_val_lim) compressed_green = channel_compress(green, singular_val_lim) compressed_blue = channel_compress(blue, singular_val_lim) im_red = Image.fromarray(compressed_red) im_blue = Image.fromarray(compressed_blue) im_green = Image.fromarray(compressed_green) new_image = Image.merge("RGB", (im_red, im_green, im_blue)) new_image.show() new_image.save("dog3-edited.jpg") compress_image(red, green, blue, singular_val_lim) We'll be using this image of a dog to test our SVD compression on: We also need to set the singular value limit we'll use, let's start with 600 for now: red, green, blue = load_image("dog.jpg") singular_val_lim = 350 Finally, we can get the compressed values for the three color channels and transform them from Numpy arrays into image components using PIL. We then just have to join the three channels together and show the image. This image should be a little smaller and simpler than the original image: Indeed, if you inspect the size of the images, you'll notice that the compressed one is smaller, though we've also had a bit of lossy compression. You can see some noise in the image as well. You can play around with adjusting the singular value limit. The lower the chosen limit the greater the compression will be, but at a certain point image artifact-ing will show up and the image will degrade in quality: def compress_image(red, green, blue, singular_val_lim): compressed_red = channel_compress(red, singular_val_lim) compressed_green = channel_compress(green, singular_val_lim) compressed_blue = channel_compress(blue, singular_val_lim) im_red = Image.fromarray(compressed_red) im_blue = Image.fromarray(compressed_blue) im_green = Image.fromarray(compressed_green) new_image = Image.merge("RGB", (im_red, im_green, im_blue)) new_image.show() compress_image(red, green, blue, singular_val_lim)

Answer 16

operates by projecting data from a multidimensional graph onto a linear graph. The easiest way to conceive of this is with a graph filled up with data points of two different classes. Assuming that there is no line that will neatly separate the data into two classes, the two dimensional graph can be reduced down into a 1D graph. This 1D graph can then be used to hopefully achieve the best possible separation of the data points. When LDA is carried out there are two primary goals: minimizing the variance of the two classes and maximizing the distance between the means of the two data classes. In order to achieve this, a new axis will be plotted in the 2D graph. This new axis should separate the two data points based on the previously mentioned criteria. Once the new axis has been created the data points within the 2D graph are redrawn along the new axis. LDA carries out three different steps to move the original graph to the new axis. First, the separability between the classes has to be calculated, and this is based on the distance between the class means or the between-class variance. In the next step, the within class variance must be calculated, which is the distance between the mean and sample for the different classes. Finally, the lower dimensional space that maximizes the between class variance has to be constructed. LDA works best when the means of the classes are far from each other. If the means of the distribution are shared it won't be possible for LDA to separate the classes with a new linear axis.

Answer 17

import pandas as pd import numpy as np from sklearn.metrics import accuracy_score, f1_score from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression We'll now load in our training data, which we'll divide into training and validation sets. Though, we need to do a little data preprocessing first. Let's drop the Name, Cabin, and Ticket columns as they don't carry a lot of useful info. We also need to fill in any missing data, which we'll replace with median values in the case of the Age feature and an S in the case of the Embarked feature: training_data = pd.read_csv("train.csv") Let's drop the cabin and ticket columns training_data.drop(labels=['Cabin', 'Ticket'], axis=1, inplace=True) training_data["Age"].fillna(training_data["Age"].median(), inplace=True) training_data["Embarked"].fillna("S", inplace=True) We also need to encode the non-numerical features. We'll encode both the Sex and Embarked columns. Let's drop the Name column as well, since it seems unlikely to be useful in classification: encoder_1 = LabelEncoder() Fit the encoder on the data encoder_1.fit(training_data["Sex"]) Transform and replace the training data training_sex_encoded = encoder_1.transform(training_data["Sex"]) training_data["Sex"] = training_sex_encoded encoder_2 = LabelEncoder() encoder_2.fit(training_data["Embarked"]) training_embarked_encoded = encoder_2.transform(training_data["Embarked"]) training_data["Embarked"] = training_embarked_encoded Assume the name is going to be useless and drop it training_data.drop("Name", axis=1, inplace=True) We need to scale the values, but the Scaler tool takes arrays, so the values we want to reshape need to be turned into arrays first. After that, we can scale the data: Remember that the scaler takes arrays ages_train = np.array(training_data["Age"]).reshape(-1, 1) fares_train = np.array(training_data["Fare"]).reshape(-1, 1) scaler = StandardScaler() training_data["Age"] = scaler.fit_transform(ages_train) training_data["Fare"] = scaler.fit_transform(fares_train) Now to select our training and testing data features = training_data.drop(labels=['PassengerId', 'Survived'], axis=1) labels = training_data['Survived'] We can now select the training features and labels and use train_test_split to make our training and validation data. It's easy to do classification with LDA, you handle it just like you would any other classifier in Scikit-Learn. Just fit the function on the training data and have it predict on the validation/testing data. We can then print metrics for the predictions against the actual values: X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2, random_state=27) model = LDA() model.fit(X_train, y_train) preds = model.predict(X_val) acc = accuracy_score(y_val, preds) f1 = f1_score(y_val, preds) print("Accuracy: {}".format(acc)) print("F1 Score: {}".format(f1)) Here's the print out: Accuracy: 0.8100558659217877 F1 Score: 0.734375 When it comes to transforming the data and reducing dimensionality, let's run a Logistic Regression classifier on the data first so we can see what our performance is prior to dimensionality reduction: logreg_clf = LogisticRegression() logreg_clf.fit(X_train, y_train) preds = logreg_clf.predict(X_val) acc = accuracy_score(y_val, preds) f1 = f1_score(y_val, preds) print("Accuracy: {}".format(acc)) print("F1 Score: {}".format(f1)) Here's the results: Accuracy: 0.8100558659217877 F1 Score: 0.734375 Now we will transform the data features by specifying a number of desired components for LDA and fitting the model on the features and labels. We then just transform the features and save it into a new variable. Let's print out the original and reduced number of features: LDA_transform = LDA(n_components=1) LDA_transform.fit(features, labels) features_new = LDA_transform.transform(features) Print the number of features print('Original feature #:', features.shape[1]) print('Reduced feature #:', features_new.shape[1]) Print the ratio of explained variance print(LDA_transform.explained_variance_ratio_) Here's the print out for the above code: Original feature #: 7 Reduced feature #: 1 [1.] We now just have to do train/test split again with the new features and run the classifier again to see how performance changed: X_train, X_val, y_train, y_val = train_test_split(features_new, labels, test_size=0.2, random_state=27) logreg_clf = LogisticRegression() logreg_clf.fit(X_train, y_train) preds = logreg_clf.predict(X_val) acc = accuracy_score(y_val, preds) f1 = f1_score(y_val, preds) print("Accuracy: {}".format(acc)) print("F1 Score: {}".format(f1)) Accuracy: 0.8212290502793296 F1 Score: 0.7500000000000001

Answer 18

It is used for estimating the relationship between the dependent and independent variables. It is useful in determining the strength of the relationship among these variables and to model the future relationship between them. It has multiple variants like Linear Regression, Multi Linear Regression, and Non-Linear Regression, where Linear and Multi Linear are the most common ones. It offers numerous applications in discipline, including finance. uppose you are asked to predict how much snowfall will happen this year. Knowing the fact that global warming is reducing the average snowfall in your city. You are provided with the following tabular data of the year with respect to the amount of snowfall that happened each year in inches. Looking at the table you can estimate that the average snowfall will be 20-40 inches, which is a fair estimate but this can be even better-using regression. Since regression is fitting points to the graph, look at the following graph, From the regression line, it is clear to visualize that our initial estimate of 20-40 inch for 2015 is nowhere closer to the possible value. Since following the regression line, we can estimate that the snow falls for the year 2015 will be somewhere around 5-10 inches. Along with this, estimation regression also provides the line equation which in this case is: y = -2.2923x + 4624.4. This means that we can plug in the year as x value in the equation and get the estimate for that year. Let’s say: if you want to predict the snowfall in the year 2016 then by putting 2016 in-place of x we get: y = -2.2923*2016 + 4624.4 = 3.132 inches.

Answer 19

Standard deviations measure how the data are concentrated around their mean. More concentration will result in a smaller standard deviation and vice-versa. In other words, we can say that it is the summary measure of the difference of each observation from the mean. If we add the differences, the positive would be exactly equal to the negative and adding both will result is zero. In order to calculate the standard deviation, we need the computer the variance first. Variance is a measure of how far the set numbers are spread out.

Answer 20

Mean is the sum of the list of numbers divided by the total number of items on the list. It is also known as the average. Assume that we have 3 different ratings for a movie. First is 7.0, The second rating is 9.0, and the third 6.0. The mean of these ratings is calculated by summing up these ratings and then dividing it by the number of ratings.

Answer 21

Sample size determination as the name suggests is the samples of the dataset which is used for the analysis of data. This technique is useful when we have a large amount of dataset and we don’t want to go through each feature of the dataset. Instead of that, we select few data from the dataset such that those data are not bias. The whole idea is to get the right amount of data for the samples, because if that is not current then the whole data analysis will be affected. So, determining sample size is an important issue because a large sample size will result in a waste of time, money, and resource while a small sample size will result in an inaccurate result. In many cases, it is easy to determine the minimum size of the sample to estimate a process parameter.

Answer 22

In this testing, we determine whether a premise is true for the dataset or not. Analyst test the samples with the goal of accepting or rejecting the null hypothesis. Statistical Analyst starts the examinations with a random number of populations. The null hypothesis is the value that the analyst believes to be true and the alternate to be false. For example, if a person wants to prove that a coin has a 50% chance of landing on heads, then the null hypothesis will be yes, and the alternate hypothesis will be no. Mathematically, we can represent the null hypothesis as Ho: P=0.5. And the alternate hypothesis can be denoted by Ha. 100 coins flips are taken from the random population of coins and then the null hypothesis is tested. Now if we see that the 100 coins flip is taken as 40 head and 60 tails then the analyst can conclude that the coin does not have a 50% chance of landing on the head and will accept the alternate hypothesis and reject the null hypothesis. After which a new hypothesis would be tested, and this time the penny has a 40% chance of landing on the head.

Answer 23

let’s consider, for instance, the following information for a given client. Monthly expenses: $300 Age: 27 Rating: 4.5 This information has different scales and performing PCA using such data will lead to a biased result. This is where data normalization comes in. It ensures that each attribute has the same level of contribution, preventing one variable from dominating others. For each variable, normalization is done by subtracting its mean and dividing by its standard deviation.

Answer 24

As the same suggests, this step is about computing the covariable matrix from the normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the covariance between variables i and j.

Answer 25

Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”. An eigenvalue, on the other hand, is a number representing the amount of variance present in the data for a given direction. Each eigenvector has its corresponding eigenvalue.

Answer 26

There are as many pairs of eigenvectors and eigenvalues as the number of variables in the data. In the data with only monthly expenses, age, and rate, there will be three pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue corresponds to the first principal component. The second principal component is the eigenvector with the second highest eigenvalue, and so on.

Answer 27

This step involves re-orienting the original data onto a new subspace defined by the principal components This reorientation is done by multiplying the original data by the previously computed eigenvectors. It is important to remember that this transformation does not modify the original data itself but instead provides a new perspective to better represent the data.

Answer 28

For a given dataset with p variables, we could examine the scatterplots of each pairwise combination of variables, but the sheer number of scatterplots can become large very quickly. For p predictors, there are p(p-1)/2 scatterplots. So, for a dataset with p = 15 predictors, there would be 105 different scatterplots! Fortunately, PCA offers a way to find a low-dimensional representation of a dataset that captures as much of the variation in the data as possible. If we’re able to capture most of the variation in just two dimensions, we could project all of the observations in the original dataset onto a simple scatterplot.

Answer 29

Zm = ΣΦjmXj for some constants Φ1m, Φ2m, Φpm, m = 1, …, M. Z1 is the linear combination of the predictors that captures the most variance possible. Z2 is the next linear combination of the predictors that captures the most variance while being orthogonal (i.e. uncorrelated) to Z1. Z3 is then the next linear combination of the predictors that captures the most variance while being orthogonal to Z2. And so on.

Answer 30

1. Scale each of the variables to have a mean of 0 and a standard deviation of 1. 2. Calculate the covariance matrix for the scaled variables. 3. Calculate the eigenvalues of the covariance matrix.

Answer 31

The eigenvector corresponding to the second largest eigenvalue is the second principal component, and so on.

Answer 32

#calculate principal components results <- prcomp(USArrests, scale = TRUE) #reverse the signs results$rotation <- -1*results$rotation #display principal components results$rotation PC1 PC2 PC3 PC4 Murder 0.5358995 -0.4181809 0.3412327 -0.64922780 Assault 0.5831836 -0.1879856 0.2681484 0.74340748 UrbanPop 0.2781909 0.8728062 0.3780158 -0.13387773 Rape 0.5434321 0.1673186 -0.8177779 -0.08902432 We can see that the first principal component (PC1) has high values for Murder, Assault, and Rape which indicates that this principal component describes the most variation in these variables. We can also see that the second principal component (PC2) has a high value for UrbanPop, which indicates that this principle component places most of its emphasis on urban population. Note that the principal components scores for each state are stored in results$x. We will also multiply these scores by -1 to reverse the signs: #reverse the signs of the scores results$x <- -1*results$x #display the first six scores head(results$x) PC1 PC2 PC3 PC4 Alabama 0.9756604 -1.1220012 0.43980366 -0.154696581 Alaska 1.9305379 -1.0624269 -2.01950027 0.434175454 Arizona 1.7454429 0.7384595 -0.05423025 0.826264240 Arkansas -0.1399989 -1.1085423 -0.11342217 0.180973554 California 2.4986128 1.5274267 -0.59254100 0.338559240 Colorado 1.4993407 0.9776297 -1.08400162 -0.001450164

Answer 33

From the plot we can see each of the 50 states represented in a simple two-dimensional space. The states that are close to each other on the plot have similar data patterns in regards to the variables in the original dataset. We can also see that the certain states are more highly associated with certain crimes than others. For example, Georgia is the state closest to the variable Murder in the plot. If we take a look at the states with the highest murder rates in the original dataset, we can see that Georgia is actually at the top of the list: #display states with highest murder rates in original dataset head(USArrests[order(-USArrests$Murder),]) Murder Assault UrbanPop Rape Georgia 17.4 211 60 25.8 Mississippi 16.1 259 44 17.1 Florida 15.4 335 80 31.9 Louisiana 15.4 249 66 22.2 South Carolina 14.4 279 48 22.5 Alabama 13.2 236 58 21.2

Answer 34

The first principal component explains 62% of the total variance in the dataset. The second principal component explains 24.7% of the total variance in the dataset. The third principal component explains 8.9% of the total variance in the dataset. The fourth principal component explains 4.3% of the total variance in the dataset. Thus, the first two principal components explain a majority of the total variance in the data. This is a good sign because the previous biplot projected each of the observations from the original data onto a scatterplot that only took into account the first two principal components. Thus, it’s valid to look at patterns in the biplot to identify states that are similar to each other. We can also create a scree plot – a plot that displays the total variance explained by each principal component – to visualize the results of PCA: #calculate total variance explained by each principal component var_explained = results$sdev^2 / sum(results$sdev^2) #create scree plot qplot(c(1:4), var_explained) + geom_line() + xlab("Principal Component") + ylab("Variance Explained") + ggtitle("Scree Plot") + ylim(0, 1)

Answer 35

1. Exploratory Data Analysis – We use PCA when we’re first exploring a dataset and we want to understand which observations in the data are most similar to each other. 2. Principal Components Regression – We can also use PCA to calculate principal components that can then be used in principal components regression. This type of regression is often used when multicollinearity exists between predictors in a dataset.

Answer 36

an open-source Python library for performing array computing (matrix operations). It is a wrapper around the library implemented in C and used for performing several trigonometric, algebraic, and statistical operations. NumPy objects can be easily converted to other types of objects like the Pandas data frame and the tensorflow tensor. Python list can be used for array computing, but it is much slower than NumPy. NumPy achieves its fast implementation using vectorization. One of the important features of NumPy arrays is that a developer can perform the same mathematical operation on every element with a single command.

Answer 37

import numpy as np # Defining both the matrices a = np.array([5, 72, 13, 100]) b = np.array([2, 5, 10, 30]) # Performing subtraction using arithmetic operator sub_ans = a-b-1 print(sub_ans) # Performing subtraction using numpy function sub_ans = np.subtract(a, b, 1) print(sub_ans) Output [ 2 66 2 69] [ 2 66 2 69]

Answer 38

Performing mod on two matrices mod_ans = np.mod(a, b) print(mod_ans) #Performing remainder on two matrices rem_ans=np.remainder(a,b) print(rem_ans) # Performing power of two matrices pow_ans = np.power(a, b) print(pow_ans) Output [ 1 2 3 10] [ 1 2 3 10] [ 25 1934917632 137858491849 1152921504606846976]

Answer 39

Getting mean of all numbers in 'a' mean_a = np.mean(a) print(mean_a) # Getting average of all numbers in 'b' mean_b = np.average(b) print(mean_b) # Getting sum of all numbers in 'a' sum_a = np.sum(a) print(sum_a) # Getting variance of all number in 'b' var_b = np.var(b) print(var_b) Output 47.5 11.75 190 119.1875

Answer 40

1. add() :- This function is used to perform element wise matrix addition. 2. subtract() :- This function is used to perform element wise matrix subtraction. 3. divide() :- This function is used to perform element wise matrix division. Python code to demonstrate matrix operations # add(), subtract() and divide() # importing numpy for matrix operations import numpy # initializing matrices x = numpy.array([[1, 2], [4, 5]]) y = numpy.array([[7, 8], [9, 10]]) # using add() to add matrices print ("The element wise addition of matrix is : ") print (numpy.add(x,y)) # using subtract() to subtract matrices print ("The element wise subtraction of matrix is : ") print (numpy.subtract(x,y)) # using divide() to divide matrices print ("The element wise division of matrix is : ") print (numpy.divide(x,y)) Output : The element wise addition of matrix is : [[ 8 10] [13 15]] The element wise subtraction of matrix is : [[-6 -6] [-5 -5]] The element wise division of matrix is : [[ 0.14285714 0.25 ] [ 0.44444444 0.5 ]]

Answer 41

Python code to demonstrate matrix operations # multiply() and dot() # importing numpy for matrix operations import numpy # initializing matrices x = numpy.array([[1, 2], [4, 5]]) y = numpy.array([[7, 8], [9, 10]]) # using multiply() to multiply matrices element wise print ("The element wise multiplication of matrix is : ") print (numpy.multiply(x,y)) # using dot() to multiply matrices print ("The product of matrices is : ") print (numpy.dot(x,y)) Output : The element wise multiplication of matrix is : [[ 7 16] [36 50]] The product of matrices is : [[25 28] [73 82]]

Answer 42

Python code to demonstrate matrix operations # sqrt(), sum() and "T" # importing numpy for matrix operations import numpy # initializing matrices x = numpy.array([[1, 2], [4, 5]]) y = numpy.array([[7, 8], [9, 10]]) # using sqrt() to print the square root of matrix print ("The element wise square root is : ") print (numpy.sqrt(x)) # using sum() to print summation of all elements of matrix print ("The summation of all matrix element is : ") print (numpy.sum(y)) # using sum(axis=0) to print summation of all columns of matrix print ("The column wise summation of all matrix is : ") print (numpy.sum(y,axis=0)) # using sum(axis=1) to print summation of all columns of matrix print ("The row wise summation of all matrix is : ") print (numpy.sum(y,axis=1)) # using "T" to transpose the matrix print ("The transpose of given matrix is : ") print (x.T) Output : The element wise square root is : [[ 1. 1.41421356] [ 2. 2.23606798]] The summation of all matrix element is : 34 The column wise summation of all matrix is : [16 18] The row wise summation of all matrix is : [15 19] The transpose of given matrix is : [[1 4] [2 5]]

Answer 43

A = [[1,2],[4,5]] B = [[7,8],[9,10]] rows = len(A) cols = len(A[0]) # Element wise addition C = [[0 for i in range(cols)] for j in range(rows)] for i in range(rows): for j in range(cols): C[i][j] = A[i][j] + B[i][j] print("Addition of matrices: \n", C) # Element wise subtraction D = [[0 for i in range(cols)] for j in range(rows)] for i in range(rows): for j in range(cols): D[i][j] = A[i][j] - B[i][j] print("Subtraction of matrices: \n", D) # Element wise division E = [[0 for i in range(cols)] for j in range(rows)] for i in range(rows): for j in range(cols): E[i][j] = A[i][j] / B[i][j] print("Division of matrices: \n", E) Output Addition of matrices: [[8, 10], [13, 15]] Subtraction of matrices: [[-6, -6], [-5, -5]] Division of matrices: [[0.14285714285714285, 0.25], [0.4444444444444444, 0.5]]

Answer 44

Space complexity: O(n^2)

Answer 45

Approach Create first matrix Syntax: matrix_name <- matrix(data , nrow = value, ncol = value) . Parameters: data=includes a list/vector of elements passed as data to an matrix. nrow= nrow represent the number of rows specified. ncol= ncol represent the number of columns specified. Create second matrix Apply operation between these matrices Display result

Answer 46

create a vector of elements vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) # create a matrix with 4* 4 by passing this vector1 matrix1 <- matrix(vector1, nrow = 4, ncol = 4) # display matrix print(matrix1) # create a vector of elements vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5) # create a matrix with 4* 4 by passing vector2 matrix2 <- matrix(vector2, nrow = 4, ncol = 4) # display matrix print(matrix2) # add matrices print(matrix1+matrix2)

Answer 47

create a vector of elements vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) # create a matrix with 4* 4 by passing this vector1 matrix1 <- matrix(vector1, nrow = 4, ncol = 4) # display matrix print(matrix1) # create a vector of elements vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5) # create a matrix with 4* 4 by passing vector2 matrix2 <- matrix(vector2, nrow = 4, ncol = 4) # display matrix print(matrix2) print(" subtraction result") # subtract matrices print(matrix1-matrix2)

Answer 48

create a vector of elements vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) # create a matrix with 4* 4 by passing this vector1 matrix1 <- matrix(vector1, nrow = 4, ncol = 4) # display matrix print(matrix1) # create a vector of elements vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5) # create a matrix with 4* 4 by passing vector2 matrix2 <- matrix(vector2, nrow = 4, ncol = 4) # display matrix print(matrix2) print(" multiplication result") # multiply matrices print(matrix1*matrix2)

Answer 49

create a vector of elements vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) # create a matrix with 4* 4 by passing this vector1 matrix1 <- matrix(vector1, nrow = 4, ncol = 4) # display matrix print(matrix1) # create a vector of elements vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5) # create a matrix with 4* 4 by passing vector2 matrix2 <- matrix(vector2, nrow = 4, ncol = 4) # display matrix print(matrix2) print(" Division result") # divide the matrices print(matrix1/matrix2)

Answer 50

create a vector of elements vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) # create a matrix with 4* 4 by passing this vector1 matrix1 <- matrix(vector1, nrow = 4, ncol = 4) # display matrix print(matrix1) # create a vector of elements vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5) # create a matrix with 4* 4 by passing vector2 matrix2 <- matrix(vector2, nrow = 4, ncol = 4) # display matrix print(matrix2) print(" modulo result") print(matrix1%%matrix2)

Answer 51

To find the transpose of a matrix in R you just need to use the t function as follows: t(A) [, 1] [, 2] [1, ] 10 5 [2, ] 8 12

Answer 52

The element-wise multiplication of two matrices of the same dimensions can also be computed with the * operator. The output will be a matrix of the same dimensions of the original matrices. A * B [, 1] [, 2] [1, ] 50 24 [2, ] 75 72

Answer 53

If you need to calculate the matricial product of a matrix and the transpose or other you can type t(A) %*% B or A %*% t(B), being A and B the names of the matrices. However, in R it is more efficient and faster using the crossprod and tcrossprod functions, respectively. crossprod(A, B) Equivalent to t(A) %*% B [, 1] [, 2] [1, ] 125 60 [2, ] 220 96

Answer 54

Similarly to the matricial multiplication, in R you can compute the exterior product of two matrices with the %o% operator. This operator is a shorcode for the default outer function. A %o% B Equivalent to: outer(A, B, FUN = "*") , , 1, 1 [, 1] [, 2] [1, ] 50 40 [2, ] 25 60 , , 2, 1 [, 1] [, 2] [1, ] 150 120 [2, ] 75 180 , , 1, 2 [, 1] [, 2] [1, ] 30 24 [2, ] 15 36 , , 2, 2 [, 1] [, 2] [1, ] 60 48 [2, ] 30 72

Answer 55

he Kronecker product of two matrices A and B, denoted by A⊗B is the last type of matricial product we are going to review. In R, the calculation can be achieved with the %x% operator. A %x% B Kronecker product of A and B [, 1] [, 2] [, 3] [, 4] [1, ] 50 30 40 24 [2, ] 150 60 120 48 [3, ] 25 15 60 36 [4, ] 75 30 180 72

Answer 56

There is no a built-in function in base R to calculate the power of a matrix, so we will provide two different alternatives. On the one hand, you can make use of the %^% operator of the expm package as follows: install.packages("expm") library(expm) A %^% 2 Power of A [, 1] [, 2] [1, ] 140 176 [2, ] 110 184 On the other hand the matrixcalc package provides the matrix.power function: install.packages("matrixcalc") library(matrixcalc) matrix.power(A, 2) Power of A [, 1] [, 2] [1, ] 140 176 [2, ] 110 184 You can check that the power is correct with the following code: A %*% A Note that if you want to calculate the element-wise power you just need to use the ^ operator. In this case the matrix don’t need to be square. A ^ 2 Element-wise power of A [, 1] [, 2] [1, ] 100 64 [2, ] 25 144

Answer 57

The determinant of a matrix A, generally denoted by ∣A∣, is a scalar value that encodes some properties of the matrix. In R you can make use of the det function to calculate it. det(A) # 80 det(B) # -15

Answer 58

In order to calculate the inverse of a matrix in R you can make use of the solve function. M <- solve(A) M Inverse of A [, 1] [, 2] [1, ] 0.1500 -0.100 [2, ] -0.0625 0.125 As a matrix multiplied by its inverse is the identity matrix we can verify that the previous output is correct as follows: A %*% M Check [, 1] [, 2] [1, ] 1 0 [2, ] 0 1 Moreover, as main use of the solve function is to solve a system of equations, if you want to calculate the solution to A%*% X=B you can type: solve(A, B) Output [, 1] [, 2] [1, ] -0.7500 -0.1500 [2, ] 1.5625 0.5625

Answer 59

The rank of a matrix is maximum number of columns (rows) that are linearly independent. In R there is no base function to calculate the rank of a matrix but we can make use of the qr function, which in addition to calculating the QR decomposition, returns the rank of the input matrix. An alternative is to use the rankMatrix function from the Matrix package. qr(A)$rank # 2 qr(B)$rank # 2 Equivalent to: library(Matrix) rankMatrix(A)[1] # 2

Answer 60

The diag function allows you to extract or replace the diagonal of a matrix: Extract the diagonal diag(A) # 10 12 diag(B) # 5 6 Replace the diagonal # diag(A) <- c(0, 2) Applying the rev function to the columns of the matrix you can also extract off the elements of the secondary diagonal matrix in R: Extract the secondary diagonals diag(apply(A, 2, rev)) # 5 8 diag(apply(B, 2, rev)) # 15 3

Answer 61

With the diag function you can also make a diagonal matrix, passing a vector as input of the function. diag(c(7, 9, 2)) Output [, 1] [, 2] [, 3] [1, ] 7 0 0 [2, ] 0 9 0 [3, ] 0 0 2

Answer 62

In addition to the previous functionalities, the diag function also allows creating identity matrices, specifying the dimension of the desired matrix. diag(4) Output [, 1] [, 2] [, 3] [, 4] [1, ] 1 0 0 0 [2, ] 0 1 0 0 [3, ] 0 0 1 0 [4, ] 0 0 0 1

Answer 63

Both the eigenvalues and eigenvectors of a matrix can be calculated in R with the eigen function. On the one hand, the eigenvalues are stored on the values element of the returned list. The eigenvalues will be shown in decreasing order: eigen(A)$values # 17.403124 4.596876 eigen(B)$values # 12.226812 -1.226812 On the other hand, the eigenvectors are stored on the vectors element: eigen(A)$vectors Eigenvectors of A [, 1] [, 2] [1, ] -0.7339565 -0.8286986 [2, ] -0.6791964 0.5596952 eigen(B)$vectors Eigenvectors of B [, 1] [, 2] [1, ] -0.3833985 -0.4340394 [2, ] -0.9235830 0.9008939

Answer 64

In this final section we are going to discuss how to perform some decompositions related with matrices. First, the Singular Value Decomposition (SVD) can be calculated with the svd function. svd(A) Singular value decomposition of A $d [1] 17.678275 4.525328 $u [, 1] [, 2] [1, ] -0.7010275 -0.7131342 [2, ] -0.7131342 0.7010275 $v [, 1] [, 2] [1, ] -0.5982454 -0.8013130 [2, ] -0.8013130 0.5982454 The function will return a list, where the element d is a vector containing the singular values sorted in decreasing order and u and v are matrices containing the left and right singular vectors of the original matrix, respectively. Second, the qr function allows you to calculate the QR decomposition. The first element of the output will return a matrix of the same dimension as the original matrix, where the upper triangle matrix contains the R of the decomposition and the lower the Q. qr(A)$qr QR decomposition of A [, 1] [, 2] [1, ] -11.1803399 -12.521981 [2, ] 0.4472136 7.155418 Last, you can compute the Cholesky factorization of a real symmetric positive-definite square matrix with the chol function. chol(A) Cholesky decomposition of A [, 1] [, 2] [1, ] 3.162278 2.529822 [2, ] 0.000000 2.366432

Answer 65

However, you can make use of the isSymmetric function to check it

Answer 66

Feature engineering involves the extraction and transformation of variables from raw data, such as price lists, product descriptions, and sales volumes so that you can use features for training and prediction. The steps required to engineer features include data extraction and cleansing and then feature creation and storage.

Answer 67

When creating features, it's tempting to go immediately to available data, but often you should start by considering which data is required by speaking with experts, brainstorming, and doing third-party research. Without going through this exercise, you could miss important predictor variables.

Answer 68

Data volumes are also increasing exponentially, so there is a lot of data to search through. Additionally, data has vastly different formats and types depending on the source. For example, video data and tabular data are not easy to use together.

Answer 69

Data labeling is required for various use cases, including computer vision, natural language processing, and speech recognition.

Answer 70

This process uses visualizations to discover patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not require formal modeling; instead, data science teams can use visualizations to decipher the data.

Answer 71

Define features in a simple and consistent way Find and reuse existing features Build upon existing features Maintain and track versions of features and models Manage the lifecycle of feature definitions Maintain efficiency across feature calculations and storage Calculate and persist wide tables (>1000 columns) efficiently Recreate features that created a model that resulted in a decision that must be later defended (i.e. audit / interpretability)

Answer 72

Feature templates - implementing feature templates instead of coding new features Feature combinations - combinations that cannot be represented by a linear system Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection

Answer 73

Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a decision tree. Deep Feature Synthesis uses simpler methods

Answer 74

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries. For instance, the algorithm might start out with SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392". However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.[13][14] Efficiency can be increased by using incremental updates, which eliminates redundancies.

Answer 75

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series: featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning.[16][17][18] OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. [OneBM] helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost.[20] getML community is an open source tool for automated feature engineering on time series and relational data.[21][22] It is implemented in C/C++ with a Python interface.[23] It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats. tsfresh is a Python library for feature extraction on time series data.[25] It evaluates the quality of the features using hypothesis testing. tsflex is an open source Python library for extracting features from time series data.[27] Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel. seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library. tsfel is a Python package for feature extraction on time series data. kats is a Python toolkit for analyzing time series data.

Answer 76

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.[34] A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[35] Feature stores can be standalone software tools or built into machine learning platforms.

Sect 7- Feature selection, dimension reduction, statistical methods, PCA, & operations Flashcards

(100 cards)