Data Science Interview Questions Flashcards

1
Q

What does PCA stand for?
What is the goal of PCA?
How does it achieve the goal?
Limitations?

A

PCA: Principal Component Analysis
Goal: Reduce the dimension of the dataset because modern datasets are large and often have overlapping infromation.
How:
1) Standardize the dataset
2) Run PCA analysis. PCA computes using infromation like the dimension of the dataset, mean, eigenvectors and eigen values. After sorting, PCA regroups variables in such a way that the first component contains a maximum of variation. The second component contains the second-largest amount of variation, etc etc
Limitations: Unable tot deal with categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Factor Analysis?

A

Just like PCA, Factor Analysis is also a model that allows reducing information in a larger number of variables into a smaller number of variables. In Factor Analysis we call those “latent variables”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Differences between PCA and Factor Analysis

A

Mathematical differences: PCA does not estimate specific effects, so it simply finds the mathematical definition of the “best” components (components who maximize variance). Factor Analysis will also estimate the components, but we now call them common factors. Besides that, it also estimates the specific factors.
Application differences:
1 ) In PCA, there is one fixed outcome that orders the components from the highest explanatory value to the lowest explanatory value. In Factor Analysis, we can apply rotations to our solution, which will allow for finding a solution that has a more coherent business explication to each of the factors that was identified.
2) Factor Analysis is much more flexible for interpretation makes it a great tool for exploration and interpretation. PCA on the other hand is used in cases where we want to retain the largest amount of variation in the smallest number of variables possible. This is used to simplify further analysis ie ML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Leakage?

Techniques To Minimize Data Leakage When Building Models?

A

Data leakage is when information from outside the training dataset is used to create the model.
-when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict

  • Perform data preparation within your cross validation folds.
  • Hold back a validation dataset for final sanity check of your developed models.
  • Use pipelines that transform data with every cross validation fold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the 4 main elements of reinforcement learning

A

An agent
A policy
A reward signal, and
A value function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

On-Policy VS Off-Policy

A

On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

On policy Reinforcement Learning Example

A

SARSA (state-action-reward-state-action) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. In this algorithm, the agent grasps the optimal policy and uses the same to act. The policy that is used for updating and the policy used for acting is the same, unlike in Q-learning. This is an example of on-policy learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to check for multicollinearity?

A

Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. Mathematically, the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable. This ratio is calculated for each independent variable. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Regression

A

Regression shows a line or curve that passes through all the data points on a target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum.

3 types of regression: linear, poly, logistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain Linear Regression

A

Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis)

If there is a single input variable (x), such linear regression is called simple linear regression. And if there is more than one input variable, such linear regression is called multiple linear regression.

To calculate best-fit line linear regression uses a traditional slope-intercept form.y=mx+c

To figure out the best values for m and c, a cost function is needed. The cost function optimizes the regression coefficients or weights and measures how a linear regression model is performing.

In Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of squared error that occurred between the predicted values and actual values.

Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.

1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assumptions of Linear Regression

A

A1. The linear regression model is “linear in parameters.”

A2. There is a random sampling of observations.

A3. The conditional mean should be zero.

A4. There is no multi-collinearity (or perfect collinearity).

A5. Spherical errors: There is homoscedasticity and no autocorrelation

A6: Optional Assumption: Error terms should be normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the R-squared in linear regression

A

R-squared is the measurement of how much of the independent variable is explained by changes in our dependent variables. In percentage terms, 0.338 would mean our model explains 33.8% of the change in our ‘Lottery’ variable.

1- ssr/sst

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Adjusted R squared

A

Linear regression has the quality that your model’s R-squared value will never go down with additional variables, only equal or higher. Therefore, your model could look more accurate with multiple variables even if they are poorly contributing. The adjusted R-squared penalizes the R-squared formula based on the number of variables, therefore a lower adjusted score may be telling you some variables are not contributing to your model’s R-squared properly.

1 - [(1-r^2)(n-1) / n-k-1]

N is the number of points in your data sample.
K is the number of independent regressors,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

P>|t| in regression

A

It uses the t statistic to produce thep value, a measurement of how likely your coefficient is measured through our model by chance. The p value of 0.378 for Wealth is saying there is a 37.8% chance the Wealth variable has no affect on the dependent variable, Lottery, and our results are produced by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Precision

A

Precision -> P -> Tp/Predicted positive -> TP/TP+FP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Recall

A

Recall -> R -> TP/Real positives -> TP/TP + FN

= Sensitivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Specificity

A

Opposite of recall

SPIN -> TN/Real Negatives -> TN/TN+FP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain regularisation in regression

A

Two types:
L1 -> Lasso Regression
L2 -> Ridge Regression

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue. l2 -> shrinks the parameters, therefore it is mostly used to prevent multicollinearity. It reduces the model complexity by coefficient shrinkage.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. This property is known as feature selection and which is absent in case of ridge. It is generally used when we have more number of features, because it automatically does feature selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Bias and Variance in regression models

A

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal to increase in variance.

To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias. Now, how can we overcome Overfitting for a regression model?

Basically there are two methods to overcome overfitting,

  1. Reduce the model complexity
  2. Regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Elastic Net Regression

A

Elastic net is basically a combination of both L1 and L2 regularization. Elastic regression generally works well when we have a big dataset.

Let’ say, we have a bunch of correlated independent variables in a dataset, then elastic net will simply form a group consisting of these correlated variables. Now if any one of the variable of this group is a strong predictor (meaning having a strong relationship with dependent variable), then we will include the entire group in the model building, because omitting other variables (like what we did in lasso) might result in losing some information in terms of interpretation ability, leading to a poor model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Logistic Regression

A

In logistic regression, we generally compute the probability which lies between the interval 0 and 1 (inclusive of both). Then probability can be used to classify the data.

3 types: binomial, multinomial. ordinal

Equation:
log odds -> log(p/1-p) = a + bx
p = e^a+bx / (1^e a+bx)

Loss function:
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value

Minimised by gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can you avoid overfitting your model?

A

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Decision Tree Steps

A

Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is entropy

A

Entropy is the measurement of disorder or impurities in the information processed in machine learning.

Ranged between 0-1 (can be greater than 1 if more than 2 classes)

Higher > more disorder

Parent entropy-Child nodes entropy is summed up = infor gained

formula = sum for each class p logp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ROC AUC

A

The graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis is called the ROC curve at each threshold level.

It tells how much the mode is capable of separating the classes.

The area range under the ROC curve has a range between 0 and 1. A completely random model, which is represented by a straight line, has a 0.5 ROC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

SVM

A

SVMs finds the best line in two dimensions or the best hyperplane in more than two dimensions in order to help us separate our space into classes. The hyperplane (line) is found through the maximum margin, i.e., the maximum distance between data points of both classes.

The vector points closest to the hyperplane are known as the support vector points because only these two points are contributing to the result of the algorithm, and other points are not.

In order to find the maximal margin, we need to maximize the margin between the data points and the hyperplane.

The hyperplane equation is wT+b=0, the margin is calculated by multiplying the weight unit vector and the data points in each class.

In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss (“Hinge” describes the fact that the error is 0 if the data point is classified correctly (and is not too close to the decision boundary). The function of the first term, hinge loss, is to penalize misclassifications. It measures the error due to misclassification (or data points being closer to the classification boundary than the margin). The second term is the regularization term, which is a technique to avoid overfitting by penalizing large coefficients in the solution vector. The λ(lambda) is the regularization coefficient, and its major role is to determine the trade-off between increasing the margin size and ensuring that the xi lies on the correct side of the margin.

SGD works by initializing a set of coefficients with random values, calculating the gradient of the loss function through partial derivatives, and updating those coefficients by taking a “step” of a defined size. The algorithm iteratively updates the coefficients such that they are moving opposite the direction of steepest ascent (away from the maximum of the loss function) and toward the minimum, approximating a solution for the optimization problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

CNN

A

A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.

A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.

The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.

During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.

The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling

The fully connecred layer helps to map the representation between the input and the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why LSTM

A

LSTM stands for Long-Short Term Memory

LSTM is a type of recurrent neural network but is better than traditional recurrent neural networks in terms of memory.

Traditional Neural Networks suffer from short term memory, LSTMs efficiently improves performance by memorizing the relevant information that is important and finds the pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the feature selection methods used to select the right variables?

A

Linear discrimination analysis
ANOVA
Chi-Square
Wrapper Methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Accuracy

A

Accuracy = (True Positive + True Negative) / Total Observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are eigenvalue and eigenvector?

A

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Bias ness and ml algo

A

Some of the popular machine learning algorithms which are low on the bias scale are -

Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.

Algorithms that are high on the bias scale -

Logistic Regression and Linear Regression.

33
Q

What do you do when you first get data?

A
34
Q

Definition of outliers

A

If a value is a certain number of standard deviations away from the mean, that data point is identified as an outlier. The default is 3.

Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.

3sd = 1%, 2sd = 5%

35
Q

How to deal with outliers/missing values?

A

Investigate the data points. If the data collection process is not inaccurate and if the data point is not illogical, keep.
If the data point is a illogical and data is not scarce, remove the rows.
If data is scarce, consider replacing the value with a sample mean or median.

For preprocessing, can consider standardizing the data rather than normalising with min-max scalar as normalisation is highly sensitive to outliers.

36
Q

How do you know if the model is overfitted?

A

Overfitting is a scenario where your model performs well on training data but performs poorly on data not seen during training.

Overfitting is easy to diagnose with the accuracy visualizations you have available. If “Accuracy” (measured against the training set) is very good and “Validation Accuracy” (measured against a validation set) is not as good, then your model is overfitting

Techniques to reduce overfitting:

  • Reduce the number of trainable parameters, this will reduce the complexity of the model
  • Regulasation techniques

regularisation for ml: l1, l2
regularisation for dl: use fewer layers (shallower networks), fewer neurons per layer, sparser connections between the layers (as in convolutional nets), or regularization techniques like dropout.

37
Q

Decision Tree Steps

A

It begins with the original set S as the root node.
On each iteration of the algorithm, it iterates through the very unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy or Largest Information gain.
The set S is then split by the selected attribute to produce a subset of the data.
The algorithm continues to recur on each subset, considering only attributes never selected before.

38
Q

Decision Tree 4 Assumptions

A

1 In the beginning, the whole training set is considered as the root.
2 Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
3 Records are distributed recursively on the basis of attribute values.
4 Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

39
Q

Attribute Selection Measures for decision tree

A

Entropy,
-> Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
Information gain
-> Entropy of parent - entropy of children nodes
Gini index,
->It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values. Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits.
Gain ratio
Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values.

C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account.

40
Q

How to avoid/counter Overfitting in Decision Trees?

A

Pruning Decision Trees
- In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed.
Random Forest -> bagging

41
Q

Explain Random Forest

A

A random forest is built up of a number of decision trees. Data is being split into subsets and trained on each tree. The final results is considered based on Majority Voting or Averaging for Classification and regression respectively.

Steps to build a random forest model:

1) Randomly select ‘k’ features from a total of ‘m’ features where k &laquo_space;m
2) Among the ‘k’ features, calculate the node D using the best split point
3) Split the node into daughter nodes using the best split
4) Repeat steps two and three until leaf nodes are finalized
5) Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

42
Q

Features of random forest

A

Important Features of Random Forest
1. Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.

  1. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.
  2. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.
  3. Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
  4. Stability- Stability arises because the result is based on majority voting/ averaging.
43
Q

Bagging or Boosting decrease bias and variance?

A

bagging -> vairance

boosting -> bias

44
Q

What is bagging

A

Bootstrapping andAggregation is used to form one ensemble model. Given a sample of data, multiple bootstrapped subsamples are pulled, subsequently models are aggregated using voting or averaging

45
Q

What is boosting

A

Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors.

A weak classifier is one that performs better than random guessing, but still performs poorly at designating classes to objects.

46
Q

Explain Adaboost

A

Stands for AdaBoost (Adaptive Boosting).

Step 1: A weak classifier (e.g. a decision stump) is made on top of the training data based on the weighted samples. Here, the weights of each sample indicate how important it is to be correctly classified. Initially, for the first stump, we give all the samples equal weights.

Step 2: We create a decision stump for each variable and see how well each stump classifies samples to their target classes.

Step 3: More weight is assigned to the incorrectly classified samples so that they’re classified correctly in the next decision stump. Weight is also assigned to each classifier based on the accuracy of the classifier, which means high accuracy = high weight! -> Alpha is how much influence this stump will have in the final classification -> 1/2 ln (1-total error / totalerror)

Notice that when a Decision Stump does well, or has no misclassifications (a perfect stump!) this results in an error rate of 0 and a relatively large, positive alpha value.

If the stump just classifies half correctly and half incorrectly (an error rate of 0.5, no better than random guessing!) then the alpha value will be 0. Finally, when the stump ceaselessly gives misclassified results (just do the opposite of what the stump says!) then the alpha would be a large negative value.

To update the sample weight, the new sample weight will be equal to the old sample weight multiplied by Euler’s number, raised to plus or minus alpha (which we just calculated in the previous step).

The two cases for alpha (positive or negative) indicate:

Alpha is positive when the predicted and the actual output agree (the sample was classified correctly). In this case we decrease the sample weight from what it was before, since we’re already performing well.
Alpha is negative when the predicted output does not agree with the actual class (i.e. the sample is misclassified). In this case we need to increase the sample weight so that the same misclassification does not repeat in the next stump. This is how the stumps are dependent on their predecessors.

Step 4: Reiterate from Step 2 until all the data points have been correctly classified, or the maximum iteration level has been reached.

47
Q

SVM Kernels

A

The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional space i.e. it converts not separable problem to separable problem. It is mostly useful in non-linear separation problem. Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs you’ve defined.

- linear
- polu
- gaussian
- radial
48
Q

Explain LSTM Algorithm

A
  1. FORGET Gate
    1. This gate is responsible for deciding which information is kept for calculating the cell state and which is not relevant and can be discarded.
    2. Two inputs
      1. ht-1 is the information from the previous hidden state (previous cell)
      2. xtis the information from the current cell.
  2. INPUT Gate
    1. Input Gate updates the cell state and decides which information is important and which is not.
    2. As forget gate helps to discard the information, the input gate helps to find out important information and store certain data in the memory that relevant.
    3. Inputs:
      1. ht-1 passed through sigmoid
      2. xt passed through tanh functions tanh function regulates the network and reduces bias.
  3. Cell State
    1. All the information gained is then used to calculate the new cell state.
    2. The cell state is first multiplied with the output of the forget gate. This has a possibility of dropping values in the cell state if it gets multiplied by values near 0.
    3. Then a pointwise addition with the output from the input gate updates the cell state to new values that the neural network finds relevant.
  4. OUTPUT Gate
    1. The last gate which is the Output gate decides what the next hidden state should be. ht-1and xtare passed to a sigmoid function.
    2. Then the newly modified cell state is passed through the tanh function and is multiplied with the sigmoid output to decide what information the hidden state should carry.
49
Q

F1 score formula

A
F1 = 2/ [(1/recall) + (1/precision)]
F1= 2x (precision x recall)/(precision + recall)

Harmonic mean of precision and recall

50
Q

True positive rate

A

Recall = sensitivity = TP/real positives

51
Q

FPR / false alarm rate

A

FP/real negative

52
Q

Explain gradient boosting algorithm

A

Gradient boosting algorithm can be used for predicting not only continuous target variable (as a Regressor) but also categorical target variable (as a Classifier). When it is used as a regressor, the cost function is Mean Square Error (MSE) and when it is used as a classifier then the cost function is Log loss.

Regression steps:

1) calculate the average of the target label (this is the value that reduces the squared residuals argmin summation Loss<actual, predicted>)

2) calculate the pseudo residuals (actual-predicted)

3) Next, we build a tree with the goal of predicting the residuals. every leaf will contain a prediction as to the value of the residual. Residuals in the same leaf are averaged

4) Predict the target label using all of the trees within the ensemble. Each sample passes through the decision nodes of the newly formed tree until it reaches a given leaf. The residual in said leaf is used to predict the house price. When we make a prediction, each residual is multiplied by the learning rate. Prediction at leaf = average + learning rate * average residual of that leaf

The idea behind the learning rate is to make a small step in the right direction. This allows an overall lower variance.

5) compute new residuals. The residuals will then be used for the leaves of the next decision tree as described in step 3.

6) repeat 3-5 until max trees is reached

7) Once trained, use all of the trees in the ensemble to make a final prediction as to the value of the target variable. The final prediction will be equal to the mean we computed in the first step, plus all of the residuals predicted by the trees that make up the forest multiplied by the learning rate.

ypred =¯ytrain+lr×respred1+lr×respred 2

Classification

1) calculate the initial log odds = log (class 1/ class2)

2) convert to probability p= e^odds/ 1 + e^odds

3) calc residual (class - prob)

4) transform tree. For each leaf, sum residuals / sum (prev prob * (1-prev prob))

5) new log odds = initial log odds + lr * new log odds

6) calc residuals and continue until max number of treees or until residuals are small

53
Q

TF-IDF

A
  • stands for term frequency, inverse document frequency
  • isa statistical measure that evaluates how relevant a word is to a document in a collection of documents.
  • InTF-IDF weighting, words that are unique to a particular document would have higher weights compared to words that are used commonly across documents.
  • each doc is represented by the TF-IDF of each word creating a vector
  • tf-idf formula = tf * log ( total number of documents/number of doc containing the words)
54
Q

Naive bayes algorithm explained

A
Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits. Class/total
Step 2: Compute the probability of likelihood of evidences that goes in the numerator. It is the product of conditional probabilities of the 3 features.

Probability of Likelihood for Banana P(x1=Long | Y=Banana) = 400 / 500 = 0.80 P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70 P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90.

So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504

Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the probability that it is a banana.

Multiply the 0.504 with the prior prob of the banana.

Repeat w all the fruits. The one w the highest prob is the predicted class

If we assume that the X follows a particular distribution, then you can plug in the probability density function of that distribution to compute the probability of likelihoods.

55
Q

K nearest neighbour

A

Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.

The computation cost is high because of calculating the distance between the data points for all the training samples.

56
Q

K means clustering

A

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

57
Q

What is Entity Embedding

A
58
Q

Laplace smoothing

A

The value of P(Orange | Long, Sweet and Yellow) was zero in the above example, because, P(Long | Orange) was zero.

That is, there were no ‘Long’ oranges in the training data.

It makes sense, but when you have a model with many features, the entire probability will become zero because one of the feature’s value was zero. To avoid this, we increase the count of the variable with zero to a small value (usually 1) in the numerator, so that the overall probability doesn’t become zero. This approach is called ‘Laplace Correction’.

59
Q

Naive bayes assumptions

A

Naive Bayes requires a strong assumption of independent predictors, so when the model has a bad performance, the reason leading to that may be the dependence between predictors.

60
Q

Euclidean distance

A

Square root sum of (x2-x1)^2 (y2-y1)^2

61
Q

software development lifecycle

A

planning, analysis, design, implementation, testing and integration, maintenance

62
Q

K means clustering elbow method

A

y-axis: distortion score (It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.)
x-axis: number of centroids

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 3.

63
Q

NLP Feature extractions methods

A

Count vectorizer (not useful)
TF-IDF
Word2Vec (CBOW/SkipGram)
Word Embeddings

64
Q

Word2Vec: CBOW vs Skip Gram

A

In theCBOWmodel, the distributed representations of context (or surrounding words) are combined topredict the word in the middle.

in theSkip-grammodel, the distributed representation of the input word is used topredict the context

65
Q

Cosine Similarity

A

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

66
Q

Popular algorithms that can be used for binary classification include:

A
Logistic Regression
k-Nearest Neighbors
Decision Trees
Support Vector Machine
Naive Bayes
67
Q

Popular algorithms that can be used for multi-class classification include:

A
k-Nearest Neighbors.
Decision Trees.
Naive Bayes.
Random Forest.
Gradient Boosting.
68
Q

Algorithms for Regression Based Problems

A

Linear Regression : Model is created for the relationship between multiple independent input variables (feature variables) and an output dependent variable. The model remains linear in that the output is a linear combination of the input variables

Pros : 1. Learning is fast 2. Easy to implement and understand 3. Tuning hyper-parameters is easy and fast
Cons: 1. Cannot model complex functions 2. Non Linear relationship requires lots of transformation
Regression Trees : Regression trees include a whole variety of tree based algorithm such as Decision Trees, Random Forest, XG Boost, Gradient Boosting Machines, Light GBM etc. Trees can model highly complex functions and can give high perfomance (Do read Regularization)

Pros : 1. Learns complex, highly non-linear relationships 2. Decision boundaries and model can be easy to undertand
Cons: 1.Prone to over fitting 2. Training can be slow and time consuming 3. Too many hyper-parameters to tune and slow
Neural Network : Neural nets has a input layer, output layer and number of hidden layers with neurons(chosen by user). It helps you find f(x) = y using the combination of neurons and hidden layer. Neural network also uses gradient descent to find best parameters

Pros: 1. Learn Complex function 2. Adding & Augmenting more data leads to improvements 3. Independent if forms of input data(No feature engineering required)
Cons : 1. Black Box, models are difficult to understand 2. Training time is high with high computational cost
Now we know few algorithm in ML for regression we can use based on our problem

if you were looking out for names here is the list (not exhaustive)

Lasso Regression
Ridge Regression
Elastic Net
Decision Trees
Random Forest
GBM
Light GBM
XGboost
Adaboost
Neural Networks
69
Q

p value

A

P-value expresses the probability that an observation made about a dataset is a random chance. Any p-value under 5% is strong evidence supporting the observation and against the null hypothesis. The higher the p-value, the less likely that a result is valid.

70
Q

Explain Normal Distribution

A

A normal distribution is a probability distribution where the values are symmetric on either side of the mean of the data. This implies that values closer to the mean are more common than values that are further away from it.

71
Q

power

A

statistical power is used in binary hypo test

probability correctly rejects the null hypothesis when the alternative hypo is true

likelihood that a test detect an effect when the effect is present

the higher the stat poweer the better the test is -> used in experiment design to calc minimum sample size

72
Q

Type 1 error

A

False positive

mistakenly reject true null hypothesis

conclude finding are significant when occured by chance in actual fact

larger value -> less reliable

73
Q

Type 2 errir

A

False negative

fail to reject null hypothesis -> conclude no significant effect

used in AB testing

74
Q

Confidence Interval

A

is a range that covers how likely the true value lies.

condifence level is the provbability the CI covers the true value

75
Q

Nerual Networks

A

In the human brain, different neurons are present. These neurons combine and perform various tasks. The Neural Network in deep learning tries to imitate human brain neurons. The neural network learns the patterns from the data and uses the knowledge that it gains from various patterns to predict the output for new data, without any human assistance.

A perceptron is the simplest neural network that contains a single neuron that performs 2 functions. The first function is to perform the weighted sum of all the inputs and the second is an activation function.

There are some other neural networks that are more complicated. Such networks consist of the following three layers:

Input Layer: The neural network has the input layer to receive the input.
Hidden Layer: There can be multiple hidden layers between the input layer and the output layer. The initially hidden layers are used for detecting the low-level patterns whereas the further layers are responsible for combining output from previous layers to find more patterns.
Output Layer: This layer outputs the prediction.

76
Q

What are Exploding Gradients and Vanishing Gradients?

A

Exploding Gradients: Let us say that you are training an RNN. Say, you saw exponentially growing error gradients that accumulate, and as a result of this, very large updates are made to the neural network model weights. These exponentially growing error gradients that update the neural network weights to a great extent are called Exploding Gradients.

Vanishing Gradients: Let us say again, that you are training an RNN. Say, the slope became too small. This problem of the slope becoming too small is called Vanishing Gradient. It causes a major increase in the training time and causes poor performance and extremely low accuracy.

77
Q

What are the differences between correlation and covariance?

A

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

Correlation is a statistical term describing the degree to which two variables move in coordination with one another. If the two variables move in the same direction, then those variables are said to have a positive correlation. If they move in opposite directions, then they have a negative correlation. -1 to 1
Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Correlation, like covariance, is a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells how strong the relationship is.

78
Q

How do you approach solving any data analytics based project?

A

Generally, we follow the below steps:

The first step is to thoroughly understand the business requirement/problem
Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business.
Data cleanup and preparation step is to be performed next which is then used for modelling. Here, the missing values are found and the variables are transformed.
Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights.
Release the model implementation, and track the results and performance over a specified period to analyze the usefulness.
Perform cross-validation of the model.