Data Science Interview Questions Flashcards

Question

ROC AUC

Answer 1

The graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis is called the ROC curve at each threshold level. It tells how much the mode is capable of separating the classes. The area range under the ROC curve has a range between 0 and 1. A completely random model, which is represented by a straight line, has a 0.5 ROC.

Answer 2

SVMs finds the best line in two dimensions or the best hyperplane in more than two dimensions in order to help us separate our space into classes. The hyperplane (line) is found through the maximum margin, i.e., the maximum distance between data points of both classes. The vector points closest to the hyperplane are known as the support vector points because only these two points are contributing to the result of the algorithm, and other points are not. In order to find the maximal margin, we need to maximize the margin between the data points and the hyperplane. The hyperplane equation is wT+b=0, the margin is calculated by multiplying the weight unit vector and the data points in each class. In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss (“Hinge” describes the fact that the error is 0 if the data point is classified correctly (and is not too close to the decision boundary). The function of the first term, hinge loss, is to penalize misclassifications. It measures the error due to misclassification (or data points being closer to the classification boundary than the margin). The second term is the regularization term, which is a technique to avoid overfitting by penalizing large coefficients in the solution vector. The λ(lambda) is the regularization coefficient, and its major role is to determine the trade-off between increasing the margin size and ensuring that the xi lies on the correct side of the margin. SGD works by initializing a set of coefficients with random values, calculating the gradient of the loss function through partial derivatives, and updating those coefficients by taking a “step” of a defined size. The algorithm iteratively updates the coefficients such that they are moving opposite the direction of steepest ascent (away from the maximum of the loss function) and toward the minimum, approximating a solution for the optimization problem.

Answer 3

A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be. A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer. The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field. During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region. The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling The fully connecred layer helps to map the representation between the input and the output.

Answer 4

LSTM stands for Long-Short Term Memory LSTM is a type of recurrent neural network but is better than traditional recurrent neural networks in terms of memory. Traditional Neural Networks suffer from short term memory, LSTMs efficiently improves performance by memorizing the relevant information that is important and finds the pattern.

Answer 5

Linear discrimination analysis ANOVA Chi-Square Wrapper Methods

Answer 6

Accuracy = (True Positive + True Negative) / Total Observations

Answer 7

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

Answer 8

Some of the popular machine learning algorithms which are low on the bias scale are - Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees. Algorithms that are high on the bias scale - Logistic Regression and Linear Regression.

Answer 9

If a value is a certain number of standard deviations away from the mean, that data point is identified as an outlier. The default is 3. Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean. 3sd = 1%, 2sd = 5%

Answer 10

Investigate the data points. If the data collection process is not inaccurate and if the data point is not illogical, keep. If the data point is a illogical and data is not scarce, remove the rows. If data is scarce, consider replacing the value with a sample mean or median. For preprocessing, can consider standardizing the data rather than normalising with min-max scalar as normalisation is highly sensitive to outliers.

Answer 11

Overfitting is a scenario where your model performs well on training data but performs poorly on data not seen during training. Overfitting is easy to diagnose with the accuracy visualizations you have available. If "Accuracy" (measured against the training set) is very good and "Validation Accuracy" (measured against a validation set) is not as good, then your model is overfitting Techniques to reduce overfitting: - Reduce the number of trainable parameters, this will reduce the complexity of the model - Regulasation techniques regularisation for ml: l1, l2 regularisation for dl: use fewer layers (shallower networks), fewer neurons per layer, sparser connections between the layers (as in convolutional nets), or regularization techniques like dropout.

Answer 12

It begins with the original set S as the root node. On each iteration of the algorithm, it iterates through the very unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute. It then selects the attribute which has the smallest Entropy or Largest Information gain. The set S is then split by the selected attribute to produce a subset of the data. The algorithm continues to recur on each subset, considering only attributes never selected before.

Answer 13

1 In the beginning, the whole training set is considered as the root. 2 Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model. 3 Records are distributed recursively on the basis of attribute values. 4 Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

Answer 14

Entropy, -> Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Information gain -> Entropy of parent - entropy of children nodes Gini index, ->It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values. Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits. Gain ratio Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values. C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account.

Answer 15

Pruning Decision Trees - In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. Random Forest -> bagging

Answer 16

A random forest is built up of a number of decision trees. Data is being split into subsets and trained on each tree. The final results is considered based on Majority Voting or Averaging for Classification and regression respectively. Steps to build a random forest model: 1) Randomly select 'k' features from a total of 'm' features where k << m 2) Among the 'k' features, calculate the node D using the best split point 3) Split the node into daughter nodes using the best split 4) Repeat steps two and three until leaf nodes are finalized 5) Build forest by repeating steps one to four for 'n' times to create 'n' number of trees

Answer 17

Important Features of Random Forest 1. Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different. 2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced. 3. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests. 4. Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree. 5. Stability- Stability arises because the result is based on majority voting/ averaging.

Answer 18

bagging -> vairance | boosting -> bias

Answer 19

Bootstrapping and Aggregation is used to form one ensemble model. Given a sample of data, multiple bootstrapped subsamples are pulled, subsequently models are aggregated using voting or averaging

Answer 20

Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors. A weak classifier is one that performs better than random guessing, but still performs poorly at designating classes to objects.

Answer 21

Stands for AdaBoost (Adaptive Boosting). Step 1: A weak classifier (e.g. a decision stump) is made on top of the training data based on the weighted samples. Here, the weights of each sample indicate how important it is to be correctly classified. Initially, for the first stump, we give all the samples equal weights. Step 2: We create a decision stump for each variable and see how well each stump classifies samples to their target classes. Step 3: More weight is assigned to the incorrectly classified samples so that they're classified correctly in the next decision stump. Weight is also assigned to each classifier based on the accuracy of the classifier, which means high accuracy = high weight! -> Alpha is how much influence this stump will have in the final classification -> 1/2 ln (1-total error / totalerror) Notice that when a Decision Stump does well, or has no misclassifications (a perfect stump!) this results in an error rate of 0 and a relatively large, positive alpha value. If the stump just classifies half correctly and half incorrectly (an error rate of 0.5, no better than random guessing!) then the alpha value will be 0. Finally, when the stump ceaselessly gives misclassified results (just do the opposite of what the stump says!) then the alpha would be a large negative value. To update the sample weight, the new sample weight will be equal to the old sample weight multiplied by Euler's number, raised to plus or minus alpha (which we just calculated in the previous step). The two cases for alpha (positive or negative) indicate: Alpha is positive when the predicted and the actual output agree (the sample was classified correctly). In this case we decrease the sample weight from what it was before, since we're already performing well. Alpha is negative when the predicted output does not agree with the actual class (i.e. the sample is misclassified). In this case we need to increase the sample weight so that the same misclassification does not repeat in the next stump. This is how the stumps are dependent on their predecessors. Step 4: Reiterate from Step 2 until all the data points have been correctly classified, or the maximum iteration level has been reached.

Answer 22

The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional space i.e. it converts not separable problem to separable problem. It is mostly useful in non-linear separation problem. Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs you’ve defined. - linear - polu - gaussian - radial

Answer 23

1. FORGET Gate 1. This gate is responsible for deciding **which information is kept for calculating the cell state** and which is not relevant and can be discarded. 2. Two inputs 1. ht-1 is the information from the previous hidden state (previous cell) 2. xt is the information from the current cell. 2. INPUT Gate 1. Input Gate **updates the cell state** and **decides which information is important** and which is not. 2. As forget gate helps to discard the information, the input gate helps to **find out important information** and **store certain data in the memory** that relevant. 3. Inputs: 1. ht-1 passed through sigmoid 2. xt passed through tanh functions tanh function regulates the network and reduces bias. 3. Cell State 1. All the information gained is then used to calculate the new cell state. 2. **The cell state is first multiplied with the output of the forget gate.** This has a possibility of dropping values in the cell state if it gets multiplied by values near 0. 3. **Then a pointwise addition with the output from the input gate updates the cell state to new values that the neural network finds relevant.** 4. OUTPUT Gate 1. **The last gate which is the Output gate decides what the next hidden state should be. ht-1 and xt are passed to a sigmoid function.** 2. **Then the newly modified cell state is passed through the tanh function and is multiplied with the sigmoid output to decide what information the hidden state should carry.**

Answer 24

``` F1 = 2/ [(1/recall) + (1/precision)] F1= 2x (precision x recall)/(precision + recall) ``` Harmonic mean of precision and recall

Answer 25

Recall = sensitivity = TP/real positives

Answer 26

FP/real negative

Answer 27

Gradient boosting algorithm can be used for predicting not only continuous target variable (as a Regressor) but also categorical target variable (as a Classifier). When it is used as a regressor, the cost function is Mean Square Error (MSE) and when it is used as a classifier then the cost function is Log loss. Regression steps: 1) calculate the average of the target label (this is the value that reduces the squared residuals argmin summation Loss) 2) calculate the pseudo residuals (actual-predicted) 3) Next, we build a tree with the goal of predicting the residuals. every leaf will contain a prediction as to the value of the residual. Residuals in the same leaf are averaged 4) Predict the target label using all of the trees within the ensemble. Each sample passes through the decision nodes of the newly formed tree until it reaches a given leaf. The residual in said leaf is used to predict the house price. When we make a prediction, each residual is multiplied by the learning rate. Prediction at leaf = average + learning rate * average residual of that leaf The idea behind the learning rate is to make a small step in the right direction. This allows an overall lower variance. 5) compute new residuals. The residuals will then be used for the leaves of the next decision tree as described in step 3. 6) repeat 3-5 until max trees is reached 7) Once trained, use all of the trees in the ensemble to make a final prediction as to the value of the target variable. The final prediction will be equal to the mean we computed in the first step, plus all of the residuals predicted by the trees that make up the forest multiplied by the learning rate. ypred =¯ytrain+lr×respred1+lr×respred 2 Classification 1) calculate the initial log odds = log (class 1/ class2) 2) convert to probability p= e^odds/ 1 + e^odds 3) calc residual (class - prob) 4) transform tree. For each leaf, sum residuals / sum (prev prob * (1-prev prob)) 5) new log odds = initial log odds + lr * new log odds 6) calc residuals and continue until max number of treees or until residuals are small

Answer 28

- stands for term frequency, inverse document frequency - is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. - In TF-IDF weighting, words that are unique to a particular document would have higher weights compared to words that are used commonly across documents. - each doc is represented by the TF-IDF of each word creating a vector - tf-idf formula = tf * log ( total number of documents/number of doc containing the words)

Answer 29

``` Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits. Class/total Step 2: Compute the probability of likelihood of evidences that goes in the numerator. It is the product of conditional probabilities of the 3 features. ``` Probability of Likelihood for Banana P(x1=Long | Y=Banana) = 400 / 500 = 0.80 P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70 P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90. So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504 Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the probability that it is a banana. Multiply the 0.504 with the prior prob of the banana. Repeat w all the fruits. The one w the highest prob is the predicted class If we assume that the X follows a particular distribution, then you can plug in the probability density function of that distribution to compute the probability of likelihoods.

Answer 30

Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. The computation cost is high because of calculating the distance between the data points for all the training samples.

Answer 31

Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. Step-4: Calculate the variance and place a new centroid of each cluster. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. Step-7: The model is ready.

Answer 32

The value of P(Orange | Long, Sweet and Yellow) was zero in the above example, because, P(Long | Orange) was zero. That is, there were no ‘Long’ oranges in the training data. It makes sense, but when you have a model with many features, the entire probability will become zero because one of the feature’s value was zero. To avoid this, we increase the count of the variable with zero to a small value (usually 1) in the numerator, so that the overall probability doesn’t become zero. This approach is called ‘Laplace Correction’.

Answer 33

Naive Bayes requires a strong assumption of independent predictors, so when the model has a bad performance, the reason leading to that may be the dependence between predictors.

Answer 34

Square root sum of (x2-x1)^2 (y2-y1)^2

Answer 35

planning, analysis, design, implementation, testing and integration, maintenance

Answer 36

y-axis: distortion score (It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.) x-axis: number of centroids To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 3.

Answer 37

Count vectorizer (not useful) TF-IDF Word2Vec (CBOW/SkipGram) Word Embeddings

Answer 38

In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. in the Skip-gram model, the distributed representation of the input word is used to predict the context

Answer 39

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

Answer 40

``` Logistic Regression k-Nearest Neighbors Decision Trees Support Vector Machine Naive Bayes ```

Answer 41

``` k-Nearest Neighbors. Decision Trees. Naive Bayes. Random Forest. Gradient Boosting. ```

Answer 42

Linear Regression : Model is created for the relationship between multiple independent input variables (feature variables) and an output dependent variable. The model remains linear in that the output is a linear combination of the input variables Pros : 1. Learning is fast 2. Easy to implement and understand 3. Tuning hyper-parameters is easy and fast Cons: 1. Cannot model complex functions 2. Non Linear relationship requires lots of transformation Regression Trees : Regression trees include a whole variety of tree based algorithm such as Decision Trees, Random Forest, XG Boost, Gradient Boosting Machines, Light GBM etc. Trees can model highly complex functions and can give high perfomance (Do read Regularization) Pros : 1. Learns complex, highly non-linear relationships 2. Decision boundaries and model can be easy to undertand Cons: 1.Prone to over fitting 2. Training can be slow and time consuming 3. Too many hyper-parameters to tune and slow Neural Network : Neural nets has a input layer, output layer and number of hidden layers with neurons(chosen by user). It helps you find f(x) = y using the combination of neurons and hidden layer. Neural network also uses gradient descent to find best parameters Pros: 1. Learn Complex function 2. Adding & Augmenting more data leads to improvements 3. Independent if forms of input data(No feature engineering required) Cons : 1. Black Box, models are difficult to understand 2. Training time is high with high computational cost Now we know few algorithm in ML for regression we can use based on our problem if you were looking out for names here is the list (not exhaustive) ``` Lasso Regression Ridge Regression Elastic Net Decision Trees Random Forest GBM Light GBM XGboost Adaboost Neural Networks ```

Answer 43

P-value expresses the probability that an observation made about a dataset is a random chance. Any p-value under 5% is strong evidence supporting the observation and against the null hypothesis. The higher the p-value, the less likely that a result is valid.

Answer 44

A normal distribution is a probability distribution where the values are symmetric on either side of the mean of the data. This implies that values closer to the mean are more common than values that are further away from it.

Answer 45

statistical power is used in binary hypo test probability correctly rejects the null hypothesis when the alternative hypo is true likelihood that a test detect an effect when the effect is present the higher the stat poweer the better the test is -> used in experiment design to calc minimum sample size

Answer 46

False positive mistakenly reject true null hypothesis conclude finding are significant when occured by chance in actual fact larger value -> less reliable

Answer 47

False negative fail to reject null hypothesis -> conclude no significant effect used in AB testing

Answer 48

is a range that covers how likely the true value lies. condifence level is the provbability the CI covers the true value

Answer 49

In the human brain, different neurons are present. These neurons combine and perform various tasks. The Neural Network in deep learning tries to imitate human brain neurons. The neural network learns the patterns from the data and uses the knowledge that it gains from various patterns to predict the output for new data, without any human assistance. A perceptron is the simplest neural network that contains a single neuron that performs 2 functions. The first function is to perform the weighted sum of all the inputs and the second is an activation function. There are some other neural networks that are more complicated. Such networks consist of the following three layers: Input Layer: The neural network has the input layer to receive the input. Hidden Layer: There can be multiple hidden layers between the input layer and the output layer. The initially hidden layers are used for detecting the low-level patterns whereas the further layers are responsible for combining output from previous layers to find more patterns. Output Layer: This layer outputs the prediction.

Answer 50

Exploding Gradients: Let us say that you are training an RNN. Say, you saw exponentially growing error gradients that accumulate, and as a result of this, very large updates are made to the neural network model weights. These exponentially growing error gradients that update the neural network weights to a great extent are called Exploding Gradients. Vanishing Gradients: Let us say again, that you are training an RNN. Say, the slope became too small. This problem of the slope becoming too small is called Vanishing Gradient. It causes a major increase in the training time and causes poor performance and extremely low accuracy.

Answer 51

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them: Correlation is a statistical term describing the degree to which two variables move in coordination with one another. If the two variables move in the same direction, then those variables are said to have a positive correlation. If they move in opposite directions, then they have a negative correlation. -1 to 1 Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable. Correlation, like covariance, is a measure of how two variables change in relation to each other, but it goes one step further than covariance in that correlation tells how strong the relationship is.

Answer 52

Generally, we follow the below steps: The first step is to thoroughly understand the business requirement/problem Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business. Data cleanup and preparation step is to be performed next which is then used for modelling. Here, the missing values are found and the variables are transformed. Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights. Release the model implementation, and track the results and performance over a specified period to analyze the usefulness. Perform cross-validation of the model.

Data Science Interview Questions Flashcards

(78 cards)