Data Science Interview Flashcards

Question

How do we evaluate classification models? 👶

Answer 1

Accuracy Precision Recall F1 Score Logistic loss (also known as Cross-entropy loss) Jaccard similarity coefficient score

Answer 2

Accuracy is a metric for evaluating classification models. It is calculated by dividing the number of correct predictions by the number of total predictions.

Answer 3

Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.

Answer 4

Confusion table (or confusion matrix) shows how many True positives (TP), True Negative (TN), False Positive (FP) and False Negative (FN) model has made.

Answer 5

Precision and recall are classification evaluation metrics: P = TP / (TP + FP) and R = TP / (TP + FN). Where TP is true positives, FP is false positives and FN is false negatives In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives. F1 is a combination of both precision and recall in one score (harmonic mean): F1 = 2 * PR / (P + R). Max F score is 1 and min is 0, with 1 being the best.

Answer 6

Tradeoff means increasing one parameter would lead to decreasing of other. Precision-recall tradeoff occur due to increasing one of the parameter(precision or recall) while keeping the model same. In an ideal scenario where there is a perfectly separable data, both precision and recall can get maximum value of 1.0. But in most of the practical situations, there is noise in the dataset and the dataset is not perfectly separable. There might be some points of positive class closer to the negative class and vice versa. In such cases, shifting the decision boundary can either increase the precision or recall but not both. Increasing one parameter leads to decreasing of the other.

Answer 7

ROC stands for Receiver Operating Characteristics. The diagrammatic representation that shows the contrast between true positive rate vs false positive rate. It is used when we need to predict the probability of the binary outcome.

Answer 8

AUC stands for Area Under the ROC Curve. ROC is a probability curve and AUC represents degree or measure of separability. It's used when we need to value how much model is capable of distinguishing between classes. The value is between 0 and 1, the higher the better.

Answer 9

AUC score is the value of Area Under the ROC Curve. An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability. When AUC score is 0.5, it means model has no class separation capacity whatsoever.

Answer 10

A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.

Answer 11

A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

Answer 12

What is different however is that AU ROC looks at a true positive rate TPR and false positive rate FPR while AU PR looks at positive predictive value PPV and true positive rate TPR. Typically, if true negatives are not meaningful to the problem or you care more about the positive class, AU PR is typically going to be more useful; otherwise, If you care equally about the positive and negative class or your dataset is quite balanced, then going with AU ROC is a good idea.

Answer 13

Categorical variables must be encoded before they can be used as features to train a machine learning model. There are various encoding techniques, including: One-hot encoding Label encoding Ordinal encoding Target encoding

Answer 14

If we simply encode categorical variables with a Label encoder, they become ordinal which can lead to undesirable consequences. In this case, linear models will treat category with id 4 as twice better than a category with id 2. One-hot encoding allows us to represent a categorical variable in a numerical vector space which ensures that vectors of each category have equal distances between each other. This approach is not suited for all situations, because by using it with categorical variables of high cardinality (e.g. customer id) we will encounter problems that come into play because of the curse of dimensionality.

Answer 15

The curse of dimensionality is an issue that arises when working with high-dimensional data. It is often said that "the curse of dimensionality" is one of the main problems with machine learning. The curse of dimensionality refers to the fact that, as the number of dimensions (features) in a data set increases, the number of data points required to accurately learn the relationships between those features increases exponentially. A simple example where we have a data set with two features, x1 and x2. If we want to learn the relationship between these two features, we need to have enough data points so that we can accurately estimate the parameters of that relationship. However, if we add a third feature, x3, then the number of data points required to accurately learn the relationships between all three features increases exponentially. This is because there are now more parameters to estimate, and the number of data points needed to accurately estimate those parameters increases exponentially with the number of parameters. Simply put, the curse of dimensionality basically means that the error increases with the increase in the number of features.

Answer 16

We would not be able to perform the resgression. Because z is linearly dependent on x and y so when performing the regression XTX would be a singular (not invertible) matrix.

Answer 17

Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.

Answer 18

There are mainly two types of regularization, L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function. L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function.

Answer 19

AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD, Dantzig Selector,SLOPE

Answer 20

L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.

Answer 21

Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.

Answer 22

L2 regularization penalizes larger weights more severely (due to the squared penalty term), which encourages weight values to decay toward zero.

Answer 23

L1 regularization adds a penalty term to our cost function which is equal to the sum of modules of models coefficients multiplied by a lambda hyperparamete

Answer 24

Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared. Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not. Computational efficiency: L2 has an analytical solution, while L1 does not. Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.

Answer 25

Yes, elastic net regularization combines L1 and L2 regularization.

Answer 26

Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.

Answer 27

Without normalizing weights or variables, if you increase the corresponding predictor by one unit, the coefficient represents on average how much the output changes. By the way, this interpretation still works for logistic regression - if you increase the corresponding predictor by one unit, the weight represents the change in the log of the odds. If the variables are normalized, we can interpret weights in linear models like the importance of this variable in the predicted result.

Answer 28

Yes - if your predictor variables are normalized. Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation's GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others.

Answer 29

Feature normalization is necessary for L1 and L2 regularizations. The idea of both methods is to penalize all the features relatively equally. This can't be done effectively if every feature is scaled differently. Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it.

Answer 30

Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.

Answer 31

Yes, It is. It can make model performance better through selecting the most importance features and remove irrelanvant features in order to make a prediction and it can also avoid overfitting, underfitting and bias-variance tradeoff.

Answer 32

Here are some of the feature selections: Principal Component Analysis Neighborhood Component Analysis ReliefF Algorithm

Answer 33

Yes, because the nature of L1 regularization will lead to sparse coefficients of features. Feature selection can be done by keeping only features with non-zero coefficients.

Answer 34

No, Because L2 regularization doesnot make the weights zero but only makes them very very small. L2 regularization can be used to solve multicollinearity since it stablizes the model.

Answer 35

This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable. Various techniques : like Gini, Information Gain, Chi-square, entropy.

Answer 36

Start at the root node. For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X*,S*} that gives the minimum over all X and S. If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn.

Answer 37

maximum tree depth minimum samples per leaf node impurity criterion

Answer 38

Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder.

Answer 39

easy to implement fast training fast inference good explainability

Answer 40

Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain.

Answer 41

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).

Answer 42

Random forest in an extention of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)). Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.

Answer 43

max_depth: Longest Path between root node and the leaf min_sample_split: The minimum number of observations needed to split a given node max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees min_samples_leaf: minimum number of samples in the leaf node n_estimators: Number of trees max_sample: Fraction of original dataset given to any individual tree in the given model max_features: Limits the maximum number of features provided to trees in random forest model

Answer 44

The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting: limit the maximum depth of a tree limit the number of test nodes limit the minimum number of objects at a node required to split do not split a node when, at least, one of the resulting subsample sizes is below a given threshold stop developing a node if it does not sufficiently improve the fit.

Answer 45

The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.

Answer 46

In random forest, since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features. In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to "do the same job" i.e. explain some variance, reduce entropy, etc.

Answer 47

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Answer 48

Random Forests builds each tree independently while Gradient Boosting builds one tree at a time. Random Forests combine results at the end of the process (by averaging or "majority rules") while Gradient Boosting combines results along the way.

Answer 49

Yes, different frameworks provide different options to make training faster, using GPUs to speed up the process by making it highly parallelizable.For example, for XGBoost tree_method = 'gpu_hist' option makes training faster by use of GPUs.

Answer 50

There are many parameters, but below are a few key defaults. learning_rate=0.1 (shrinkage). n_estimators=100 (number of trees). max_depth=3. min_samples_split=2. min_samples_leaf=1. subsample=1.0.

Answer 51

Depending upon the dataset, parameter tuning can be done manually or using hyperparameter optimization frameworks such as optuna and hyperopt. In manual parameter tuning, we need to be aware of max-depth, min_samples_leaf and min_samples_split so that our model does not overfit the data but try to predict generalized characteristics of data (basically keeping variance and bias low for our model).

Answer 52

Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands. Using scikit-learn we can perform a grid search of the n_estimators model parameter

Answer 53

There are several strategies for hyper-tuning but I would argue that the three most popular nowadays are the following: Grid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it's easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge. Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored. In a completely different framework, Bayesian Optimization is thought of as a more statistical way of optimization and is commonly used when using neural networks, specifically since one evaluation of a neural network can be computationally costly. In numerous research papers, this method heavily outperforms Grid Search and Random Search and is currently used on the Google Cloud Platform as well as AWS. Because an in-depth explanation requires a heavy background in bayesian statistics and gaussian processes (and maybe even some game theory), a "simple" explanation is that a much simpler/faster acquisition function intelligently chooses (using a surrogate function such as probability of improvement or GP-UCB) which hyper-parameter values to try on the computationally expensive, original algorithm. Using the result of the initial combination of values on the expensive/original function, the acquisition function takes the result of the expensive/original algorithm into account and uses it as its prior knowledge to again come up with another set of hyper-parameters to choose during the next iteration. This process continues either for a specified number of iterations or for a specified amount of time and similarly the combination of hyper-parameters that performs the best on the expensive/original algorithm is chosen.

Answer 54

Neural nets are good at solving non-linear problems. Some good examples are problems that are relatively easy for humans (because of experience, intuition, understanding, etc), but difficult for traditional regression models: speech recognition, handwriting recognition, image identification, etc.

Answer 55

In a usual fully-connected feed-forward network, each neuron receives input from every element of the previous layer and thus the receptive field of a neuron is the entire previous layer. They are usually used to represent feature vectors for input data in classification problems but can be expensive to train because of the number of computations involved.

Answer 56

The main idea of using neural networks is to learn complex nonlinear functions. If we are not using an activation function in between different layers of a neural network, we are just stacking up multiple linear layers one on top of another and this leads to learning a linear function. The Nonlinearity comes only with the activation function, this is the reason we need activation functions.

Answer 57

The derivative of the sigmoid function for large positive or negative numbers is almost zero. From this comes the problem of vanishing gradient — during the backpropagation our net will not learn (or will learn drastically slow). One possible way to solve this problem is to use ReLU activation function.

Answer 58

ReLU is an abbreviation for Rectified Linear Unit. It is an activation function which has the value 0 for all negative values and the value f(x) = x for all positive values. The ReLU has a simple activation function which makes it fast to compute and while the sigmoid and tanh activation functions saturate at higher values, the ReLU has a potentially infinite activation, which addresses the problem of vanishing gradients.

Answer 59

Proper initialization of weight matrix in neural network is very necessary. Simply we can say there are two ways for initializtions. Initializing weights with zeroes. Setting weights to zero makes your network no better than a linear model. It is important to note that setting biases to 0 will not create any troubles as non zero weights take care of breaking the symmetry and even if bias is 0, the values in every neuron are still different. Initializing weights randomly. Assigning random values to weights is better than just 0 assignment. a) If weights are initialized with very high values the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time. b) If weights are initialized with low values it gets mapped to 0, where the case is the same as above. This problem is often referred to as the vanishing gradient.

Answer 60

If all the weights of a neural network are set to zero, the output of each connection is same (W*x = 0). This means the gradients which are backpropagated to each connection in a layer is same. This means all the connections/weights learn the same thing, and the model never converges.

Answer 61

L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded. L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay. Data Augmentation - This requires some fake data to be created as a part of training set. Drop Out : This is most effective regularization technique for newral nets. Few randome nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations.

Answer 62

Dropout is a technique that at each training step turns off each neuron with a certain probability of p. This way at each iteration we train only 1-p of neurons, which forces the network not to rely only on the subset of neurons for feature representation. This leads to regularizing effects that are controlled by the hyperparameter p.

Answer 63

The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem. We need backpropogation because, Calculate the error – How far is your model output from the actual output. Minimum Error – Check whether the error is minimized or not. Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error. Repeat the process until the error becomes minimum. Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.

Answer 64

Gradient Descent Stochastic Gradient Descent Mini-Batch Gradient Descent(best among gradient descents) Nesterov Accelerated Gradient Momentum Adagrad AdaDelta Adam(best one. less time, more efficient)

Answer 65

SGD approximates the expectation with few randomly selected samples (instead of the full data). In comparison to batch gradient descent, we can efficiently approximate the expectation in large data sets using SGD. For neural networks this reduces the training time a lot even considering that it will converge later as the random sampling adds noise to the gradient descent.

Answer 66

The learning rate is an important hyperparameter that controls how quickly the model is adapted to the problem during the training. It can be seen as the "step width" during the parameter updates, i.e. how far the weights are moved into the direction of the minimum of our optimization problem.

Answer 67

A large learning rate can accelerate the training. However, it is possible that we "shoot" too far and miss the minimum of the function that we want to optimize, which will not result in the best solution. On the other hand, training with a small learning rate takes more time but it is possible to find a more precise minimum. The downside can be that the solution is stuck in a local minimum, and the weights won't update even if it is not the best possible global solution.

Answer 68

There is no straightforward way of finding an optimum learning rate for a model. It involves a lot of hit and trial. Usually starting with a small values such as 0.01 is a good starting point for setting a learning rate and further tweaking it so that it doesn't overshoot or converge too slowly.

Answer 69

Adam (Adaptive Moment Estimation) is a optimization technique for training neural networks. on an average, it is the best optimizer .It works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. Adam tends to converge faster, while SGD often converges to more optimal solutions. SGD's high variance disadvantages gets rectified by Adam (as advantage for Adam).

Answer 70

Adam tends to converge faster, while SGD often converges to more optimal solutions.

Answer 71

You can change the learning rate to have a superior learning rate in the first layers and a lower lr in the final layers

Answer 72

Simply stop training when the validation error is the minimum.

Answer 73

Saving the weights learned by a model mid training for long running processes is known as model checkpointing so that you can resume your training from a certain checkpoint.

Answer 74

Neural nets used in the area of computer vision are generally Convolutional Neural Networks(CNN's). You can learn about convolutions below. It appears that convolutions are quite powerful when it comes to working with images and videos due to their ability to extract and learn complex features. Thus CNN's are a go-to method for any problem in computer vision.

Answer 75

The idea of the convolutional layer is the assumption that the information needed for making a decision often is spatially close and thus, it only takes the weighted sum over nearby inputs. It also assumes that the networks’ kernels can be reused for all nodes, hence the number of weights can be drastically reduced. To counteract only one feature being learnt per layer, multiple kernels are applied to the input which creates parallel channels in the output. Consecutive layers can also be stacked to allow the network to find more high-level features.

Answer 76

A fully-connected layer needs one weight per inter-layer connection, which means the number of weights which needs to be computed quickly balloons as the number of layers and nodes per layer is increased.

Answer 77

Pooling is a technique to downsample the feature map. It allows layers which receive relatively undistorted versions of the input to learn low level features such as lines, while layers deeper in the model can learn more abstract features such as texture.

Answer 78

Max pooling is a technique where the maximum value of a receptive field is passed on in the next feature map. The most commonly used receptive field is 2 x 2 with a stride of 2, which means the feature map is downsampled from N x N to N/2 x N/2. Receptive fields larger than 3 x 3 are rarely employed as too much information is lost. Other pooling techniques include: Average pooling, the output is the average value of the receptive field. Min pooling, the output is the minimum value of the receptive field. Global pooling, where the receptive field is set to be equal to the input size, this means the output is equal to a scalar and can be used to reduce the dimensionality of the feature map.

Answer 79

CNNs are not resistant to rotation by design. However, we can make our models resistant by augmenting our datasets with different rotations of the raw data. The predictions of a CNN will change if an image is rotated and we did not augment our dataset accordingly. A demonstration of this occurence can be seen in this video, where a CNN changes its predicted class between a duck and a rabbit based on the rotation of the image.

Answer 80

Augmentations are an artifical way of expanding the existing datasets by performing some transformations, color shifts or many other things on the data. It helps in diversifying the data and even increasing the data when there is scarcity of data for a model to train on.

Answer 81

There are many kinds of augmentations which can be used according to the type of data you are working on some of which are geometric and numerical transformation, PCA, cropping, padding, shifting, noise injection etc.

Answer 82

Augmentations really depend on the type of output classes and the features you want your model to learn. For eg. if you have mostly properly illuminated images in your dataset and want your model to predict poorly illuminated images too, you can apply channel shifting on your data and include the resultant images in your dataset for better results.

Answer 83

mage Classification Inception v3 Xception DenseNet AlexNet VGG16 ResNet SqueezeNet EfficientNet MobileNet The last three are designed so they use smaller number of parameters which is helpful for edge AI.

Answer 84

Given a source domain D_S and learning task T_S, a target domain D_T and learning task T_T, transfer learning aims to help improve the learning of the target predictive function f_T in D_T using the knowledge in D_S and T_S, where D_S ≠ D_T,or T_S ≠ T_T. In other words, transfer learning enables to reuse knowledge coming from other domains or learning tasks. In the context of CNNs, we can use networks that were pre-trained on popular datasets such as ImageNet. We then can use the weights of the layers that learn to represent features and combine them with a new set of layers that learns to map the feature representations to the given classes. Two popular strategies are either to freeze the layers that learn the feature representations completely, or to give them a smaller learning rate.

Answer 85

Object detection is finding Bounding Boxes around objects in an image. Architectures : YOLO, Faster RCNN, Center Net

Answer 86

Object Segmentation is predicting masks. It does not differentiate objects. Architectures : Mask RCNN, UNet

Answer 87

Machine learning classification algorithms predict a class based on a numerical feature representation. This means that in order to use machine learning for text classification, we need to extract numerical features from our text data first before we can apply machine learning algorithms. Common approaches to extract numerical features from text data are bag of words, N-grams or word embeddings.

Answer 88

Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature.

Answer 89

Advantages: Simple to understand and implement. Disadvantages: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations. Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”).

Answer 90

The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence.

Answer 91

Between 3 and 4

Answer 92

Term Frequency (TF) is a scoring of the frequency of the word in the current document. Inverse Document Frequency(IDF) is a scoring of how rare the word is across documents. It is used in scenario where highly recurring words may not contain as much informational content as the domain specific words. For example, words like “the” that are frequent across all documents therefore need to be less weighted. The TF-IDF score highlights words that are distinct (contain useful information) in a given document.

Answer 93

Bag Of Words model Word2Vec Embeddings fastText Embeddings Convolutional Neural Networks (CNN) Long Short-Term Memory (LSTM) Bidirectional Encoder Representations from Transformers (BERT)

Answer 94

Usually logistic regression is better because bag of words creates a matrix with large number of columns. For a huge number of columns logistic regression is usually faster than gradient boosting trees.

Answer 95

Word Embeddings are vector representations for words. Each word is mapped to one vector, this vector tries to capture some characteristics of the word, allowing similar words to have similar vector representations. Word Embeddings helps in capturing the inter-word semantics and represents it in real-valued vectors. Word2Vec is a method to construct such an embedding. It takes a text corpus as input and outputs a set of vectors which represents words in that corpus. It can be generated using two methods: Common Bag of Words (CBOW) Skip-Gram

Answer 96

TF-IDF GloVe BERT

Answer 97

Approaches ranked from simple to more complex: Take an average over all words Take a weighted average over all words. Weighting can be done by inverse document frequency (idf part of tf-idf). Use ML model like LSTM or Transformer.

Answer 98

Logistic regression as it is pretty straightforward and trees are not suited for very high dimensions

Answer 99

You represent your text into embeddings and then use Dense layers (words are considered independently)

Answer 100

Once you have your text represented into embeddings. You can use conv2d layers to take into account the close context near the word

Answer 101

Unsupervised learning aims to detect paterns in data where no labels are given.

Answer 102

Clustering algorithms group objects such that similar feature points are put into the same groups (clusters) and dissimilar feature points are put into different clusters.

Answer 103

Partition points into k subsets. Compute the seed points as the new centroids of the clusters of the current partitioning. Assign each point to the cluster with the nearest seed point. Go back to step 2 or stop when the assignment does not change.

Answer 104

Domain knowledge, i.e. an expert knows the value of k Elbow method: compute the clusters for different values of k, for each k, calculate the total within-cluster sum of square, plot the sum according to the number of clusters and use the band as the number of clusters. Average silhouette method: compute the clusters for different values of k, for each k, calculate the average silhouette of observations, plot the silhouette according to the number of clusters and select the maximum as the number of clusters.

Answer 105

k-medoids: Takes the most central point instead of the mean value as the center of the cluster. This makes it more robust to noise. Agglomerative Hierarchical Clustering (AHC): hierarchical clusters combining the nearest clusters starting with each point as its own cluster. DIvisive ANAlysis Clustering (DIANA): hierarchical clustering starting with one cluster containing all points and splitting the clusters until each point describes its own cluster. Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Cluster defined as maximum set of density-connected points.

Answer 106

Two input parameters epsilon (neighborhood radius) and minPts (minimum number of points in an epsilon-neighborhood) Cluster defined as maximum set of density-connected points. Points p_j and p_i are density-connected w.r.t. epsilon and minPts if there is a point o such that both, i and j are density-reachable from o w.r.t. epsilon and minPts. p_j is density-reachable from p_i w.r.t. epsilon, minPts if there is a chain of points p_i -> p_i+1 -> p_i+x = p_j such that p_i+x is directly density-reachable from p_i+x-1. p_j is a directly density-reachable point of the neighborhood of p_i if dist(p_i,p_j) <= epsilon.

Answer 107

DBScan is more robust to noise. DBScan is better when the amount of clusters is difficult to guess. K-means has a lower complexity, i.e. it will be much faster, especially with a larger amount of points.

Answer 108

Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spread the data even further making high dimensional data extremely sparse. We care about it, because it is difficult to use machine learning in sparse spaces.

Answer 109

Singular Value Decomposition (SVD) Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) T-distributed Stochastic Neighbor Embedding (t-SNE) Autoencoders Fourier and Wavelet Transforms

Answer 110

Singular Value Decomposition (SVD) is a general matrix decomposition method that factors a matrix X into three matrices L (left singular values), Σ (diagonal matrix) and R^T (right singular values). For machine learning, Principal Component Analysis (PCA) is typically used. It is a special type of SVD where the singular values correspond to the eigenvectors and the values of the diagonal matrix are the squares of the eigenvalues. We use these features as they are statistically descriptive. Having calculated the eigenvectors and eigenvalues, we can use the Kaiser-Guttman criterion, a scree plot or the proportion of explained variance to determine the principal components (i.e. the final dimensionality) that are useful for dimensionality reduction.

Answer 111

MAP, precision, recall, p@10, AUC ROC

Answer 112

Precision at k and recall at k are evaluation metrics for ranking algorithms. Precision at k shows the share of relevant items in the first k results of the ranking algorithm. And Recall at k indicates the share of relevant items returned in top k results out of all correct answers for a given query. Example: For a search query "Car" there are 3 relevant products in your shop. Your search algorithm returns 2 of those relevant products in the first 5 search results. Precision at 5 = # num of relevant products in search result / k = 2/5 = 40% Recall at 5 = # num of relevant products in search result / # num of all relevant products = 2/3 = 66.6%

Answer 113

APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query In the APK metric, the order of the result set matters, in that the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems

Answer 114

Recommender systems are software tools and techniques that provide suggestions for items that are most likely of interest to a particular user.

Answer 115

A good recommer system should give relevant and personalized information. It should not recommend items the user knows well or finds easily. It should make diverse suggestions. A user should explore new items.

Answer 116

Collaborative filtering is the most prominent approach to generate recommendations. It uses the wisdom of the crowd, i.e. it gives recommendations based on the experience of others. A recommendation is calculated as the average of other experiences. Say we want to give a score that indicates how much user u will like an item i. Then we can calculate it with the experience of N other users U as r_ui = 1/N * sum(v in U) r_vi. In order to rate similar experiences with a higher weight, we can introduce a similarity between users that we use as a multiplier for each rating. Also, as users have an individual profile, one user may have an average rating much larger than another user, so we use normalization techniques (e.g. centering or Z-score normalization) to remove the users' biases. Collaborative filtering does only need a rating matrix as input and improves over time. However, it does not work well on sparse data, does not work for cold starts (see below) and usually tends to overfit.

Answer 117

In comparison to explicit feedback, implicit feedback datasets lack negative examples. For example, explicit feedback can be a positive or a negative rating, but implicit feedback may be the number of purchases or clicks. One popular approach to solve this problem is named weighted alternating least squares (wALS) [Hu, Y., Koren, Y., & Volinsky, C. (2008, December). Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on (pp. 263-272). IEEE.]. Instead of modeling the rating matrix directly, the numbers (e.g. amount of clicks) describe the strength in observations of user actions. The model tries to find latent factors that can be used to predict the expected preference of a user for an item.

Answer 118

Collaborative filterung incorporates crowd knowledge to give recommendations for certain items. Say we want to recommend how much a user will like an item, we then will calculate the score using the recommendations of other users for this certain item. We can distinguish between two different ways of a cold start problem now. First, if there is a new item that has not been rated yet, we cannot give any recommendation. Also, when there is a new user, we cannot calculate a similarity to any other user.

Answer 119

Content-based filtering incorporates features about items to calculate a similarity between them. In this way, we can recommend items that have a high similarity to items that a user liked already. In this way, we are not dependant on the ratings of other users for a given item anymore and solve the cold start problem for new items. Demographic filtering incorporates user profiles to calculate a similarity between them and solves the cold start problem for new users.

Answer 120

A time series is a set of observations ordered in time usually collected at regular intervals.

Answer 121

The principle behind causal forecasting is that the value that has to be predicted is dependant on the input features (causal factors). In time series forecasting, the to be predicted value is expected to follow a certain pattern over time.

Answer 122

Simple Exponential Smoothing: approximate the time series with an exponentional function Trend-Corrected Exponential Smoothing (Holt‘s Method): exponential smoothing that also models the trend Trend- and Seasonality-Corrected Exponential Smoothing (Holt-Winter‘s Method): exponential smoothing that also models trend and seasonality Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling varation and irregular component Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables. Deep learning approaches (RNN, LSTM, etc.)

Answer 123

We can explicitly model the trend (and/or seasonality) with approaches such as Holt's Method or Holt-Winter's Method. We want to explicitly model the trend to reach the stationarity property for the data. Many time series approaches require stationarity. Without stationarity,the interpretation of the results of these analyses is problematic [Manuca, Radu & Savit, Robert. (1996). Stationarity and nonstationarity in time series analysis. Physica D: Nonlinear Phenomena. 99. 134-161. 10.1016/S0167-2789(96)00139-X. ].

Answer 124

We want to look at the correlation between different observations of y. This measure of correlation is called autocorrelation. Autoregressive models are multiple regression models where the time-lag series of the original time series are treated like multiple independent variables.

Answer 125

Given the assumption that the set of features gives a meaningful causation to y, a causal forecasting approach such as linear regression or multiple nonlinear regression might be useful. In case there is a lot of data and the explanability of the results is not a high priority, we can also consider deep learning approaches.

Answer 126

Random Forest models are not able to extrapolate time series data and understand increasing/decreasing trends. It will provide us with average data points if the validation data has values greater than the training data points.

Answer 127

if we sample from a population using a sufficiently large sample size, the mean of the samples (also known as the sample population) will be normally distributed (assuming true random sampling), the mean tending to the mean of the population and variance equal to the variance of the population divided by the size of the sampling. What’s especially important is that this will be true regardless of the distribution of the original population.

Answer 128

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined

Answer 129

A type I error occurs when the null hypothesis is true but is rejected. A type II error occurs when the null hypothesis is false but erroneously fails to be rejected.

Answer 130

Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression

Answer 131

Correlation measures how strongly two variables are related. Covariance is a measure that indicates the extent to which two random variables change in cycle

Answer 132

Point Estimation gives us a particular value as an estimate of a population parameter A confidence interval gives us a range of values which is likely to contain the population parameter.

Answer 133

It is a hypothesis testing for a randomized experiment with two variables A and B.

Answer 134

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is the minimum significance level at which you can reject the null hypothesis. The lower the p-value, the more likely you reject the null hypothesis.

Answer 135

The key idea is to resample form the original data — either directly or via a fitted model — to create replicate datasets * Bootstrap. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set. * k-fold Cross-Validation. A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set. The k-fold cross-validation method specifically lends itself to use in the evaluation of predictive models that are repeatedly trained on one subset of the data and evaluated on a second held-out subset of the data.

Answer 136

To combat overfitting: 1. Add noise 2. Feature selection 3. Increase training set 4. L2 (ridge) or L1 (lasso) regularization; L1 drops weights, L2 no 5. Use cross-validation techniques, such as k folds cross-validation 6. Boosting and bagging 7. Dropout technique 8. Perform early stopping 9. Remove inner layers To combat underfitting: 1. Add features 2. Increase time of training

Answer 137

a confounder is a variable that influences both the dependent variable and independent variable.

Answer 138

a. Selection bias b. Under coverage bias c. Survivorship bias

Answer 139

It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence.

Answer 140

Selection bias occurs when the sample obtained is not representative of the population intended to be analyzed. For instance, you select only Asians to perform a study on the world population height. Under coverage bias occurs when some members of the population are inadequately represented in the sample.

Answer 141

It is because it takes in a vector of real numbers and returns a probability distribution. RELU because it avoids the vanishing gradient descent issue.

Answer 142

we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

Answer 143

Example 1 FN: What if Jury or judge decides to make a criminal go free? Example 2 FN: Fraud detection.

Answer 144

In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses. Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure

Answer 145

A Training Set: * to fit the parameters i.e. weights A Validation set: * part of the training set * for parameter selection * to avoid overfitting A Test set: * for testing or evaluating the performance of a trained machine learning model, i.e. evaluating the predictive power and generalization.

Answer 146

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Answer 147

Principal component analysis (PCA) is a statistical method used in Machine Learning. It consists in projecting data in a higher dimensional space into a lower dimensional space by maximizing the variance of each dimension.

Answer 148

The most popular trees are: AdaBoost, Random Forest, and eXtreme Gradient Boosting (XGBoost).

Answer 149

Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances.

Answer 150

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users. It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user In content based, we look only at the item level, recommending on similar items sold.

Answer 151

The following are the various steps involved in an analytics project: 1. Understand the Business problem 2. Explore the data and become familiar with it 3. Prepare the data for modeling by detecting outliers, treating missing values, transforming variables, etc. 4. After data preparation, start running the model, analyze the result and tweak the approach. This is an iterative step until the best possible outcome is achieved. 5. Validate the model using a new data set. 6. Start implementing the model and track the result to analyze the performance of the model over the period of time.

Answer 152

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important. If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value. If 80% of the values for a variable are missing, then you can answer that you would be dropping the variable instead of treating the missing values.

Answer 153

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS (as the sum of the squared distance between each member of the cluster and its centroid) for a range of number of clusters, you will get the plot shown below. * The Graph is generally known as Elbow Curve. * Red circled a point in above graph i.e. Number of Cluster = 3 is the point after which you don’t see any decrement in WSS. * This point is known as the bending point and taken as K in K – Means This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendrograms and identify the distinct groups from there.

Answer 154

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

Answer 155

Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population. As you expect this helps us to reduce the variance error Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

Answer 156

Pros Ø Bagging method helps when we face variance or overfitting in the model. It provides an environment to deal with variance by using N learners of same size on same algorithm. Ø During the sampling of train data, there are many observations which overlaps. So, the combination of these learners helps in overcoming the high variance. Ø Bagging uses Bootstrap sampling method (Bootstrapping is any test or metric that uses random sampling with replacement and falls under the broader class of resampling methods.) Cons Ø Bagging is not helpful in case of bias or underfitting in the data. Ø Bagging ignores the value with the highest and the lowest result which may have a wide difference and provides an average result.

Answer 157

Pros Ø Boosting technique takes care of the weightage of the higher accuracy sample and lower accuracy sample and then gives the combined results. Ø Net error is evaluated in each learning steps. It works good with interactions. Ø Boosting technique helps when we are dealing with bias or underfitting in the data set. Ø Multiple boosting techniques are available. For example: AdaBoost, LPBoost, XGBoost, GradientBoost, BrownBoost Cons Ø Boosting technique often ignores overfitting or variance issues in the data set. Ø It increases the complexity of the classification. Ø Time and computation can be a bit expensive.

Answer 158

Instead of using k-fold cross-validation, you should be aware of the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order. In case of time series data, you should use techniques like forward=chaining — Where you will be model on past data then look at forward-facing data.

Answer 159

A Box-Cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape.

Answer 160

You will want to update an algorithm when: * You want the model to evolve as data streams through infrastructure * The underlying data source is changing * There is a case of non-stationarity (mean, variance change over the time) * The algorithm underperforms/results lack accuracy

Answer 161

There are two methods here: we can either initialize the weights to zero or assign them randomly.

Answer 162

Also referred to as “loss” or “error,” cost function is a measure to evaluate how good your model’s performance is. It’s used to compute the error of the output layer during backpropagation. We push that error backwards through the neural network and use that during the different training functions.

Answer 163

When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point. If the learning rate is set too high, this causes undesirable divergent behavior to the loss function due to drastic updates in weights. It may fail to converge

Answer 164

Epoch – Represents one iteration over the entire dataset (everything put into the training model). * Batch – Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches. * Iteration – if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).

Answer 165

There are four layers in CNN: 1. Convolutional Layer – the layer that performs a convolutional operation, creating several smaller picture windows to go over the data. 2. Activation Layer (ReLU Layer) – it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map. It follows each convolutional layer. 3. Pooling Layer – pooling is a down-sampling operation that reduces the dimensionality of the feature map. Stride = how much you slide, and you get the max of the non matrix 4. Fully Connected Layer – this layer recognizes and classifies the objects in the image.

Answer 166

RNNs are a type of artificial neural networks designed to recognize the pattern from the sequence of data such as Time series, stock market and government agencies etc.

Answer 167

Encoder Decoder or Sequence to Sequence RNNs are used a lot in translation services. The basic idea is that there are two RNNs, one an encoder that keeps updating its hidden state and produces a final single “Context” output. This is then fed to the decoder, which translates this context to a sequence of outputs.

Answer 168

Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning longterm dependencies, remembering information for long periods as its default behavior. There are three steps in an LSTM network: * Step 1: The network decides what to forget and what to remember. * Step 2: It selectively updates cell state values. * Step 3: The network decides what part of the current state makes it to the output.

Answer 169

The goal of the gradient descent is to minimize a given function which, in our case, is the loss function of the neural network. To achieve this goal, it performs two steps iteratively. 1. Compute the slope (gradient) that is the first-order derivative of the function at the current point 2. Move-in the opposite direction of the slope increase from the current point by the computed amount

Answer 170

The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0. Solution: 3. Use Gradient Clipping: Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths. If exploding gradients are still occurring, you can check for and limit the size of gradients during the training of your network. This is called gradient clipping. Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.

Answer 171

Dropout is a technique of dropping out hidden and visible nodes of a network randomly to prevent overfitting of data (typically dropping 20 per cent of the nodes). It doubles the number of iterations needed to converge the network. It used to avoid overfitting, as it increases the capacity of generalization. Batch normalization is the technique to improve the performance and stability of neural networks by normalizing the inputs in every layer so that they have mean output activation of zero and standard deviation of one.

Data Science Interview Flashcards

(196 cards)