Machine Learning Flashcards

Question

What is the Filter method of feature selection?

Answer 1

A subset of features is selected based on their relationship to the target variable. The selection is not dependent of any machine learning algorithm; instead filter methods measure the “relevance” of the features with the output via statistical tests, such as: Pearson’s Correlation: Linear correlation between two continuous variables LDA (Linear Discriminant Analysis): a supervised linear algorithm that projects the data into a smaller subspace k (k < N-1) while maximising the separation between the classes. More specifically, the model finds linear combinations of the features that achieve maximum separability between the classes and minimum variance within each class. ANOVA (Analysis of Variance): tests whether different input categories have significantly different values for the output variable. χ² (Chi-squared): tests whether the occurrences of a specific feature and a specific class are independent using their frequency distribution. The null hypothesis is that the two variables are independent. However, large values of χ² indicate that the null hypothesis should be rejected. When selecting features, we wish to extract those that are highly dependent on the output.

Answer 2

Selecting features by measuring the importance of a subset of features by actually training a model on it. Greedy search and evaluation criterion: the method at each iteration chooses the locally optimal subset of features. Then, the evaluation criterion (evaluation metric such as p-value of R-squared etc) plays the role of the judge. The combination of features that gives the optimal results, according to the evaluation criterion, will be selected. Four approaches: - Forward selection - Backward elimination - Boruta - Genetic algorithm Computationally intensive.

Answer 3

Embedded methods combines the advantageous aspects of both Filter and Wrapper methods. Similar to Wrapper methods, however: - perform feature selection and algorithm training in parallel rather than iteratively based on the evaluation metric (feature selection is integral part of the model) - less subject to overfitting than Wrapper methods Methods include: - LASSO (Least Absolute Shrinkage and Selection Operator) - Feature Importance - Tree-based Methods (e.g. random forest) - Permutation Importance

Answer 4

Least Absolute Shrinkage and Selection Operator (LASSO) is a shrinkage method that performs both feature selection AND regularization at the same time. It penalises features likely to cause overfitting. It is Linear Regression with L1 regularization. Regularization is a process that shrinks the coefficients (weights) towards zero. This means that you are penalizing more complex models to avoid overfitting. But how does this translate to feature selection? You may have heard of other regularization techniques like Ridge Regression or Elastic net, but LASSO enables coefficients to be set to 0. If a coefficient is zero then the feature is not taken into consideration, thus, it is in a way discarded - helps with feature selection

Answer 5

Goal of regularisation is to help the model generalise Regularization is an umbrella term that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model (learning too much noise), so as to avoid the risk of overfitting. Used when: - Multicollinearity - To filter out noise from data - To prevent overfitting L2: Ridge Regression - RSS is modified by adding the shrinkage quantity L1: LASSO - differs from ridge regression only in penalizing the high coefficients and shrinking to zero (can use for feature selection) ElasticNet - combines LASSO and Ridge

Answer 6

The fitting procedure of linear regression involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

Answer 7

Usually, the threshold of a classifier is 0.5, but in some cases, we need to fine-tune it to improve accuracy. The 0.5 threshold means that if the probability is equal to or above 0.5, it is spam, and if it is lower, then it is not spam. To find the optimal threshold, we can use Precision-Recall curves and ROC curves, grid search, and by manually changing the value to get a better CV.

Answer 8

Ensemble learning combines insights of multiple machine learning models to improve the accuracy and performance metrics. Simple methods: - Mean/average: average predictions from multiple high-performing models. - Weighted average: assign different weights to machine learning models based on their performance and then combine them. Advanced ensemble methods: - Bagging: used to minimize variance errors. It randomly creates subsets of training data and trains a model on each subset. The combination of models reduces the variance and makes it more reliable compared to a single model. - Boosting: used to reduce bias errors. It is an iterative ensemble technique that adjusts the weights based on the last classification (sequential models). Boosting algorithms give more weight to observations that the previous model predicted inaccurately.

Answer 9

Pseudo labelling is a method to generate labelled data 1. Train a model with labelled data. 2. Use the trained model to predict labels for the unlabeled data, which creates pseudo-labeled data. 3. Retrain the model with the pseudo-labeled and labeled data together. This process happens iteratively as the model improves and is able to perform with a greater degree of accuracy.

Answer 10

Self-training is a variation of pseudo labeling. The difference with self-training is that we accept only the predictions that have a high confidence and we iterate through this process several times. In pseudo-labeling, however, there is no boundary of confidence that must be met for a prediction to be used in a model

Answer 11

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are: Mean Squared Error (MSE) - average of the squared differences between predicted and expected target values in a dataset Root Mean Squared Error (RMSE) - units are the same as the original units of the target value that is being predicted (unlike MSE) Mean Absolute Error (MAE) - units match target value units, and changes in MAE are linear and therefore intuitive (MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score)

Answer 12

Assesses the performance and generalization ability of a predictive model. It involves splitting the available dataset into multiple subsets or "folds" to evaluate the model on different combinations of training and validation data. By using cross-validation, you can get a more robust and reliable estimation of a model's performance compared to evaluating it on a single train-test split. It helps to detect overfitting or underfitting issues, assess model stability, and make informed decisions about hyperparameter tuning or model selection.

Answer 13

1. Search engines 2. Recommender systems 3. Travel agencies (finding best rooms etc)

Answer 14

Ranking models typically work by predicting a relevance score s = f(x) for each input x = (q, d) where q is a query and d is a document.

Answer 15

1. Vector Space Models LM vector embeddings for each query and document, then compute the relevance score f(x) = f(q, d) as the cosine similarity between the vectors embeddings of q and d. 2. Learning to Rank A Machine Learning model that learns to predict a score s given an input x = (q, d) during a training phase where some sort of ranking loss is minimized.

Answer 16

For binary relevance: 1. Mean Average Precision (MAP) 2. Hit Ratio For graded relevance: 3. Discounted Cumulative Gain (DCG) 4. Mean Reciprocal Rank 5. Precision@K 6. Recall@K

Answer 17

Used for tasks with binary relevance, i.e. when the true score y of a document d can be only 0 (non relevant) or 1 (relevant). For a given query q and corresponding documents D = {d₁, …, dₙ}, we check how many of the top k retrieved documents are relevant (y=1) or not (y=0)., in order to compute precision Pₖ and recall Rₖ. For k = 1…n we get different Pₖ and Rₖ values that define the precision-recall curve: the area under this curve is the Average Precision (AP). Finally, by computing the average of AP values for a set of m queries, we obtain the Mean Average Precision (MAP).

Answer 18

Used for tasks with graded relevance, i.e. typical scale is 0 (bad), 1 (fair), 2 (good), 3 (excellent), 4 (perfect). For a given query q and corresponding documents D = {d₁, …, dₙ}, we consider the the k-th top retrieved document. Gain Gₖ = 2^yₖ – 1 measures how useful is this document Discount Dₖ = 1/log(k+1) penalizes documents that are retrieved with a lower rank. Discounted Gain GₖDₖ for k = 1…n (each doc) The sum of the Discounted Gain is the Discounted Cumulative Gain (DCG). Normalized DCG= DCG/Ideal DCG (ideal scoreif we ranked documents by the true value yₖ) Finally, we usually compute the average of DCG or NDCG values for a set of m queries to obtain a mean value.

Answer 19

The base Machine Learning model is usually Decision Tree or Neural Network to compute s = f(x). The choice of loss function is the distinctive element for Learning to Rank models. In general, we have 3 approaches, depending on how the loss is computed: 1. Pointwise Methods 2. Pairwise Methods 3. Listwise Methods

Answer 20

The total loss is computed as the sum of loss terms defined on each document dᵢ (hence pointwise) as the distance between the predicted score sᵢ and the ground truth yᵢ, for i=1…n. By doing this, we transform our task into a regression problem, where we train a model to predict y. This is the simplest loss to implement, and is used with regression metrics such a MSE, MAE etc Problem is that we need true relevant scores to train the model. In most cases, we might only know the ordering of items and not the absolute relevant scores.

Answer 21

The total loss is computed as the sum of loss terms defined on each pair of documents dᵢ, dⱼ (hence pairwise) , for i, j=1…n. The objective on which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two documents is more relevant. By doing this, we transform our task into a binary classification problem. This only works with relative preference: given two documents, we want to predict if the first is more relevant than the second, and not with absolute relevance. Binary classification evaluation metrics can be used for the loss, e.g. Binary Cross Entropy loss, or you can use Gradient Descent directly (no loss needed - only Gradients)

Answer 22

The loss is directly computed on the whole list of documents (hence listwise) with corresponding predicted ranks. In this way, ranking metrics such as DCG can be more directly incorporated into the loss. In contrast to pointwise/pairwise, solves the problem more directly by maximizing the evaluation metric - state of the art results achieved with this approach eg. LambdaLoss RankNet, LambdaRank, SoftRank, ListNet

Answer 23

Cross-Entropy loss is a most important cost function. It is used to optimize classification models. Cross-Entropy takes the output probabilities (P) from SoftMax and measures the distance from the truth values. Also called logarithmic loss, log loss or logistic loss. Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0. Cross-entropy loss is used when adjusting model weights during training. The aim is to minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-entropy loss of 0.

Answer 24

A key measure in information theory Entropy of a random variable X is the level of uncertainty inherent in the variables possible outcome. The greater the value of entropy,H(x) , the greater the uncertainty for probability distribution and the smaller the value the less the uncertainty. Entropy is between 0 and 1

Answer 25

Both have the same cross entropy loss function. The only difference is how truth labels are defined. - Categorical cross-entropy is used when true labels are one-hot encoded, for example, we have the following true values for 3-class classification problem [1,0,0], [0,1,0] and [0,0,1]. - Sparse categorical cross-entropy uses truth labels that are integer encoded, for example, [1], [2] and [3] for 3-class problem.

Answer 26

Gk = 2^yk -1 Measures how useful is this document

Answer 27

Dk = 1 / log(k+1) Penalises documents that are retrieved with a lower rank

Answer 28

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent.

Answer 29

Where the ML goal is to "teach" a model, F, to predict values of the form y = F(x) by minimizing the mean squared error

Answer 30

Variant of ensemble methods where you create multiple weak models and combine them to get better performance as a whole. Specific to regression problems with squared loss (MSE etc) Qualities: - Powerful enough to find any nonlinear relationship between your model target and features - Can deal with missing values, outliers, and high cardinality categorical values without any special treatment. e.g. XGBoost or LightGBM

Answer 31

prediction errors

Answer 32

Information retrieval is the process of retrieving relevant information from a collection of unstructured or semi-structured data, typically text documents, in response to user queries or information needs. It involves techniques and algorithms to efficiently search, analyze, and present information to users.

Answer 33

In the context of information retrieval, "relevance" refers to the degree to which a document or information item satisfies the information needs or requirements of a user. It is a measure of how closely a document aligns with the user's query or information-seeking intent.

Answer 34

1. Document Collection 2. Indexing: structured representation of the document collection to facilitate efficient searching 3. Query Processing: parsing and understanding the user's query and formulating an appropriate search strategy. 4. Retrieval Model: mathematical or statistical framework used to assess the relevance of documents to a given query- determines how documents are scored or ranked based on their similarity or match to the query 5. Ranking Algorithm: sorts retrieved documents based on their relevance scores 6. User Interface: provides the means for users to interact with the information retrieval system 7. Evaluation Metrics: Common metrics include precision, recall, F1 score, mean average precision (MAP), and normalized discounted cumulative gain (NDCG). 8. Relevance Feedback: feedback on the retrieved results

Answer 35

It is simply the fraction of queries for which the correct answer is included in the recommendation list of length L.

Answer 36

Measures how far down the ranking the first relevant document is. If MRR is close to 1, it means relevant results are close to the top of search results - what we want! Lower MRRs indicate poorer search quality, with the right answer farther down in the search results.

Answer 37

In the top K documents, what is the precision/recall i.e. precision - proportion of top k that are TP (indeed relevant) recall - proportion of all positives (relevant documents) that are in top k

Answer 38

These are ranking algorithms, which are in-between pairwise and listwise methods. LambdaRank: based on gradient boosting decision tree models LambdaRank: based on neural networks

Answer 39

1. Explicit - assessors indicate relevance of results explicitly using a binary or graded relevance system 2. Implicit - based on user behaviours (documents selected for view) 3. Pseudo - - Take the top 10-50 results returned by initial query - Select top 20-30 terms from these documents using e.g. tf-idf weights. - Do Query Expansion; add these terms to query, then match the returned documents for this query and finally return the most relevant documents

Answer 40

A web crawler is a computer program that crawls through the web in a predefined and methodical manner to collect data. The web crawler tool pulls together details about each webpage: titles, images, keywords, other linked pages, etc. It automatically maps the web to search documents, websites, RSS feeds, and email addresses. It then stores and indexes this data. Web crawlers use several algorithms to rate the value of the content or the quality of the links in its index. These rules determine its crawling behavior: which sites to crawl, how often to re-crawl a page, how many pages on a site to be indexed, and so on. When it visits a new website, it downloads its robots.txt file—the “robots exclusion standard” protocol designed to restrict unlimited access by web crawler tools. The file contains information of sitemaps (the URLs to crawl) and the search rules (which of the pages are to be crawled and which parts to ignore).

Answer 41

Query expansion is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. Query expansion involves techniques such as: - Finding synonyms of words, and searching for the synonyms as well - Finding semantically related words (e.g. antonyms, meronyms, hyponyms, hypernyms) - Finding all the various morphological forms of words by stemming each word in the search query - Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results - Re-weighting the terms in the original query

Answer 42

Data leakage occurs when information from the validation set inadvertently leaks into the training process. It can happen if the validation set is used to inform decisions during model training, such as feature selection or hyperparameter tuning. In such cases, the model's performance on the validation set can provide a more reliable evaluation of its generalization ability.

Answer 43

1. Linearity: relationship between predictor variables and response variable is assumed to be linear 2. Independence: observations are independent of each other (no correlation or dependence) 3. Homoscedasticity: variability of the residuals (differences between observed and predicted values) is constant across all levels of the predictor variables 4. Normality: residuals follow a normal distribution. This allows for estimation of confidence intervals and hypothesis tests in linear regression. **It is not necessary for the predictor variables to follow a normal distribution.** 5. No Multicollinearity: Multicollinearity refers to a high correlation between predictor variables No Endogeneity: predictor variables are exogenous, meaning they are not affected by the response variable. Endogeneity can lead to biased and inconsistent coefficient estimates.

Answer 44

1. Linear regression 2. Logistic regression 3. Ridge regression 4. LASSO regression

Answer 45

1. Decision trees 2. Random forest 3. Gradient boosting regression 4. XG Boost 5. Light GBM Regressor

Answer 46

1. K-means 2. Hierarchical clustering 3. Gaussian mixture model

Answer 47

1. Number of centroids defined (use elbow method to find optimum) 2. Randomly assign centroids 3. Find data points closest to each centroid (Euclidean distance) 4. Iterate to reduce Euclidean distance between centroid and cluster data

Answer 48

One hot encoding is used when you need to convert categorical data into numeric. When there are too many categories to encode, multicollinearity is a risk Use pandas .get_dummies() Most algos require one hot as they use numerical data Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based

Answer 49

PCA statistics is the science of analyzing all the dimensions on a dataset and reducing them as much as possible while preserving the exact information. When to use? - Dimensionality reduction - Categorize the dependent and independent variables in your data - Eliminate noise components in your dimension analysis Steps: - Standardise data - Compute covariance matrix to detect correlations - Compute eigenvectors and eigenvalues from covariance matrix to identify Principle Components - Create feature vector to define Principle Components - Recast data along Principle Components axis

Answer 50

1. The range of variables in a dataset is standardized to analyze the contribution of each variable equally. 2. Calculating the initial variables will help categorize the variables that are dominating the other variables of small ranges. 3. This will help you attain biased results at the end of the analysis. 4. To transform the variables of the same standard, you can follow the following formula: Z = (value - mean) / st dev

Answer 51

1. Take the entire data set as input (accepts numerical and categorical data) 2. Calculate entropy of the target variable, as well as the predictor attributes 3. Calculate information gain of all attributes (we gain information on sorting different objects from each other) 4. Choose attribute with the highest information gain as the root node 5. Repeat procedure on every branch until the decision node of each branch is finalized 6. Take average or consensus of nodes to infer tree prediction

Answer 52

Information gain is a measure of how much information a feature provides about a class: it measures how uncertainty in the target variable is reduced, given a set of independent variables. Information gain helps to determine the order of attributes in the nodes of a decision tree. Defined by: Gain = Entropy(parent node) - Entropy(child node) Information gain is the entropy we “lost” after splitting data at a node (how good was this split) Want to decrease entropy as progress through decision tree

Answer 53

Main difference: establishing root nodes and segregating nodes is done randomly in Random Forest. - RF = ensemble of decision tress - RF employs the bagging method to generate the required prediction - RF more time and computation intensive - RF less sensitive to overfitting

Answer 54

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it. Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables. Multivariate data involves three or more variables. It is similar to a bivariate but contains more than one dependent variable.

Answer 55

Measures error by using the average squared difference between observed and predicted values. MSE = SumOf(yi - y(hat)i) squared / n RMSE is square root of this

Answer 56

Used to define optimal number of clusters for k-means Determine the k-value by iteratively clustering k=1 to k=n (Here n is the hyperparameter that we choose as per our requirement). For every value of k, we calculate the within-cluster sum of squares (WCSS) value. WCSS - the sum of square distances between the centroids and each point in cluster. To determine optimal number of clusters(k), plot a graph of k versus WCSS value

Answer 57

K Nearest Neighbours

Answer 58

TPR defines the probability that an actual positive will turn out to be positive. TPR=TP/TP+FN FPR defines the probability that an actual negative result will be shown as a positive one i.e the probability that a model will generate a false alarm. FPR=FP/TN+FP These are used to plot the ROC curve (y=TPR, x=FPR)

Answer 59

Pruning simplifies the decision tree by reducing the rules; this helps to avoid complexity, reduce overfitting and improves accuracy. Simplifies a decision tree by removing the weakest rules. Pruning is often distinguished into: - Pre-pruning (early stopping) stops the tree before it has completed classifying the training set - Post-pruning allows the tree to classify the training set perfectly and then prunes the tree Post-pruning starts with an unpruned tree, takes a sequence of subtrees (pruned trees), and picks the best one through cross-validation.

Answer 60

Error = difference between prediction and actual value Residual = difference between arithmetic mean of a group of values and the observed group of values

Answer 61

- Occurs when the learning power of the model is too high OR the data is too small - Model learns the noise rather than the information - Model performs badly on unseen data Will have overfitting problem if have regression model and number of data points < number of features Can remedy by: - reducing learning power of - Increasing size of training data - Regularisation

Answer 62

You can have Ln regularisations, but L1 (Lasso) and L2 (ridge) are most common ElasticNet is a hybrid of these two All features must be on comparable scales for regularisation to occur Both can be applied to all parametric models (regression, SVM, neural networks) L1 is binary - adds weights of zero or 1, so cannot be used for feature selection For correlated features, L1 selects the best on whereas L2 spreads the error amongst both

Answer 63

Alpha controls the amount of feature shrinkage - the larger alpha, the greater the shrinkage

Answer 64

L2 Form of regularisation L2 shrinks parameters and reduces influence of unimportant features. It is more stable than L1. It is differentiable, so gradient descent can be used to optimise. Spreads error amongst all terms; does not shrink parameters to zero, so cannot be used for feature selection

Answer 65

Multicollinearity causes the following two basic types of problems: The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model. Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

Machine Learning Flashcards

(89 cards)