Data Science Interview Questions Flashcards
What does PCA stand for?
What is the goal of PCA?
How does it achieve the goal?
Limitations?
PCA: Principal Component Analysis
Goal: Reduce the dimension of the dataset because modern datasets are large and often have overlapping infromation.
How:
1) Standardize the dataset
2) Run PCA analysis. PCA computes using infromation like the dimension of the dataset, mean, eigenvectors and eigen values. After sorting, PCA regroups variables in such a way that the first component contains a maximum of variation. The second component contains the second-largest amount of variation, etc etc
Limitations: Unable tot deal with categorical data
What is Factor Analysis?
Just like PCA, Factor Analysis is also a model that allows reducing information in a larger number of variables into a smaller number of variables. In Factor Analysis we call those “latent variables”.
Differences between PCA and Factor Analysis
Mathematical differences: PCA does not estimate specific effects, so it simply finds the mathematical definition of the “best” components (components who maximize variance). Factor Analysis will also estimate the components, but we now call them common factors. Besides that, it also estimates the specific factors.
Application differences:
1 ) In PCA, there is one fixed outcome that orders the components from the highest explanatory value to the lowest explanatory value. In Factor Analysis, we can apply rotations to our solution, which will allow for finding a solution that has a more coherent business explication to each of the factors that was identified.
2) Factor Analysis is much more flexible for interpretation makes it a great tool for exploration and interpretation. PCA on the other hand is used in cases where we want to retain the largest amount of variation in the smallest number of variables possible. This is used to simplify further analysis ie ML.
Data Leakage?
Techniques To Minimize Data Leakage When Building Models?
Data leakage is when information from outside the training dataset is used to create the model.
-when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict
- Perform data preparation within your cross validation folds.
- Hold back a validation dataset for final sanity check of your developed models.
- Use pipelines that transform data with every cross validation fold
What is the 4 main elements of reinforcement learning
An agent
A policy
A reward signal, and
A value function
On-Policy VS Off-Policy
On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data.
On policy Reinforcement Learning Example
SARSA (state-action-reward-state-action) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. In this algorithm, the agent grasps the optimal policy and uses the same to act. The policy that is used for updating and the policy used for acting is the same, unlike in Q-learning. This is an example of on-policy learning.
How to check for multicollinearity?
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. Mathematically, the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.
Explain Regression
Regression shows a line or curve that passes through all the data points on a target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum.
3 types of regression: linear, poly, logistic
Explain Linear Regression
Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis)
If there is a single input variable (x), such linear regression is called simple linear regression. And if there is more than one input variable, such linear regression is called multiple linear regression.
To calculate best-fit line linear regression uses a traditional slope-intercept form.y=mx+c
To figure out the best values for m and c, a cost function is needed. The cost function optimizes the regression coefficients or weights and measures how a linear regression model is performing.
In Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of squared error that occurred between the predicted values and actual values.
Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.
1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.
Assumptions of Linear Regression
A1. The linear regression model is “linear in parameters.”
A2. There is a random sampling of observations.
A3. The conditional mean should be zero.
A4. There is no multi-collinearity (or perfect collinearity).
A5. Spherical errors: There is homoscedasticity and no autocorrelation
A6: Optional Assumption: Error terms should be normally distributed.
What is the R-squared in linear regression
R-squared is the measurement of how much of the independent variable is explained by changes in our dependent variables. In percentage terms, 0.338 would mean our model explains 33.8% of the change in our ‘Lottery’ variable.
1- ssr/sst
Adjusted R squared
Linear regression has the quality that your model’s R-squared value will never go down with additional variables, only equal or higher. Therefore, your model could look more accurate with multiple variables even if they are poorly contributing. The adjusted R-squared penalizes the R-squared formula based on the number of variables, therefore a lower adjusted score may be telling you some variables are not contributing to your model’s R-squared properly.
1 - [(1-r^2)(n-1) / n-k-1]
N is the number of points in your data sample.
K is the number of independent regressors,
P>|t| in regression
It uses the t statistic to produce thep value, a measurement of how likely your coefficient is measured through our model by chance. The p value of 0.378 for Wealth is saying there is a 37.8% chance the Wealth variable has no affect on the dependent variable, Lottery, and our results are produced by chance.
Precision
Precision -> P -> Tp/Predicted positive -> TP/TP+FP
Recall
Recall -> R -> TP/Real positives -> TP/TP + FN
= Sensitivity
Specificity
Opposite of recall
SPIN -> TN/Real Negatives -> TN/TN+FP
Explain regularisation in regression
Two types:
L1 -> Lasso Regression
L2 -> Ridge Regression
The key difference between these two is the penalty term.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue. l2 -> shrinks the parameters, therefore it is mostly used to prevent multicollinearity. It reduces the model complexity by coefficient shrinkage.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. This property is known as feature selection and which is absent in case of ridge. It is generally used when we have more number of features, because it automatically does feature selection.
Bias and Variance in regression models
As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal to increase in variance.
To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias. Now, how can we overcome Overfitting for a regression model?
Basically there are two methods to overcome overfitting,
- Reduce the model complexity
- Regularization
Elastic Net Regression
Elastic net is basically a combination of both L1 and L2 regularization. Elastic regression generally works well when we have a big dataset.
Let’ say, we have a bunch of correlated independent variables in a dataset, then elastic net will simply form a group consisting of these correlated variables. Now if any one of the variable of this group is a strong predictor (meaning having a strong relationship with dependent variable), then we will include the entire group in the model building, because omitting other variables (like what we did in lasso) might result in losing some information in terms of interpretation ability, leading to a poor model performance.
Logistic Regression
In logistic regression, we generally compute the probability which lies between the interval 0 and 1 (inclusive of both). Then probability can be used to classify the data.
3 types: binomial, multinomial. ordinal
Equation:
log odds -> log(p/1-p) = a + bx
p = e^a+bx / (1^e a+bx)
Loss function:
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value
Minimised by gradient descent.
How can you avoid overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting
Decision Tree Steps
Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized
What is entropy
Entropy is the measurement of disorder or impurities in the information processed in machine learning.
Ranged between 0-1 (can be greater than 1 if more than 2 classes)
Higher > more disorder
Parent entropy-Child nodes entropy is summed up = infor gained
formula = sum for each class p logp
ROC AUC
The graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis is called the ROC curve at each threshold level.
It tells how much the mode is capable of separating the classes.
The area range under the ROC curve has a range between 0 and 1. A completely random model, which is represented by a straight line, has a 0.5 ROC.
SVM
SVMs finds the best line in two dimensions or the best hyperplane in more than two dimensions in order to help us separate our space into classes. The hyperplane (line) is found through the maximum margin, i.e., the maximum distance between data points of both classes.
The vector points closest to the hyperplane are known as the support vector points because only these two points are contributing to the result of the algorithm, and other points are not.
In order to find the maximal margin, we need to maximize the margin between the data points and the hyperplane.
The hyperplane equation is wT+b=0, the margin is calculated by multiplying the weight unit vector and the data points in each class.
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss (“Hinge” describes the fact that the error is 0 if the data point is classified correctly (and is not too close to the decision boundary). The function of the first term, hinge loss, is to penalize misclassifications. It measures the error due to misclassification (or data points being closer to the classification boundary than the margin). The second term is the regularization term, which is a technique to avoid overfitting by penalizing large coefficients in the solution vector. The λ(lambda) is the regularization coefficient, and its major role is to determine the trade-off between increasing the margin size and ensuring that the xi lies on the correct side of the margin.
SGD works by initializing a set of coefficients with random values, calculating the gradient of the loss function through partial derivatives, and updating those coefficients by taking a “step” of a defined size. The algorithm iteratively updates the coefficients such that they are moving opposite the direction of steepest ascent (away from the maximum of the loss function) and toward the minimum, approximating a solution for the optimization problem.
CNN
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.
A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.
The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.
During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.
The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling
The fully connecred layer helps to map the representation between the input and the output.
Why LSTM
LSTM stands for Long-Short Term Memory
LSTM is a type of recurrent neural network but is better than traditional recurrent neural networks in terms of memory.
Traditional Neural Networks suffer from short term memory, LSTMs efficiently improves performance by memorizing the relevant information that is important and finds the pattern.
What are the feature selection methods used to select the right variables?
Linear discrimination analysis
ANOVA
Chi-Square
Wrapper Methods
Accuracy
Accuracy = (True Positive + True Negative) / Total Observations
What are eigenvalue and eigenvector?
Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.
Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.