Machine Learning Flashcards
What is Semi-supervised Machine Learning?
Semi-supervised learning is the blend of supervised and unsupervised learning. The algorithm is trained on a mix of labeled and unlabeled data. Generally, it is utilized when we have a very small labeled dataset and a large unlabeled dataset.
In simple terms, the unsupervised algorithm is used to create clusters and by using existing labeled data to label the rest of the unlabelled data. A Semi-supervised algorithm assumes continuity assumption, cluster assumption, and manifold assumption.
It is generally used to save the cost of acquiring labeled data. For example, protein sequence classification, automatic speech recognition, and self-driving cars.
What is the manifold assumption in semi-supervised learning?
The manifold assumption in semi-supervised learning states that
(a) the input space is composed of multiple lower-dimensional manifolds on which all data points lie and
(b) data points lying on the same manifold have the same label
What is the continuity assumption in semi-supervised learning?
The continuity assumption states that objects near each other tend to share the same group or label.
This assumption is also used in supervised learning, and the datasets are separated by the decision boundaries.
In semi-supervised learning, the decision boundaries are added with the smoothness assumption in low-density boundaries.
What is the cluster assumption in semi-supervised learning?
The cluster assumption states that data are divided into different discrete clusters, and that points in the same cluster share the output label.
How do you choose which algorithm to use for a dataset?
This will depend on the business use case, amounts of labelled data and application requirements
What is supervised machine learning?
Supervised ML is when the algorithm is trained using a labeled dataset, which consists of pairs of input and output data.
The main classes are Regression and Classification
What are regression based algorithms?
In regression, the target variable is a continuous value. The goal of regression is to predict the value of the target variable based on the input variables.
Linear regression, polynomial regression, and decision trees are some of the examples of regression algorithms.
What are classification based algorithms?
In classification, the target variable is a categorical value. The goal of classification is to predict the class or category of the target variable based on the input variables.
Some examples of classification algorithms include logistic regression, decision trees, support vector machines, and neural networks.
What is linear regression?
Linear regression is a type of regression algorithm that is used to predict a continuous output value. It is one of the simplest and most widely used algorithms in supervised learning. In linear regression, the algorithm tries to find a linear relationship between the input features and the output value. The output value is predicted based on the weighted sum of the input features.
What is logistic regression?
Logistic regression is a type of classification algorithm that is used to predict a binary output variable. It is commonly used in machine learning applications where the output variable is either true or false, such as in fraud detection or spam filtering. In logistic regression, the algorithm tries to find a linear relationship between the input features and the output variable. The output variable is then transformed using a logistic function to produce a probability value between 0 and 1.
What are decision trees?
A decision tree is a type of algorithm that is used for both classification and regression tasks.
Consists of three components: decision nodes, leaf nodes, and a root node.
A decision tree algorithm divides a training dataset into branches, which further segregate into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further.
The nodes in the decision tree represent attributes that are used for predicting the outcome.
It is used to model decisions and their possible consequences.
Each internal node in the tree represents a decision, while each leaf node represents a possible outcome. Decision trees can be used to model complex relationships between input features and output variables.
What are random forests?
Random forests are an ensemble learning technique that is used for both classification and regression tasks. They are made up of multiple decision trees that work together to make predictions. Each tree in the forest is trained on a different subset of the input features and data (data bagging). The final prediction is made by aggregating the predictions of all the trees in the forest.
Explain the K Nearest Neighbor Algorithm
The K Nearest Neighbor (KNN) is a supervised learning classifier. It uses proximity to classify labels or predict the grouping of individual data points. We can use it for regression and classification. KNN algorithm is non-parametric, meaning it doesn’t make an underlying assumption of data distribution.
In the KNN classifier:
We find K-neighbors nearest to the white point. In the example below, we chose k=5.
To find the five nearest neighbors, we calculate the euclidean distance between the white point and the others. Then, we chose the 5 points closest to the white point.
There are three red and two green points at K=5. Since the red has a majority, we assign a red label to it.
Is it true that we need to scale our feature values when they vary greatly?
Yes. Most of the algorithms use Euclidean distance between data points, and if the feature value varies greatly, the results will be quite different. In most cases, outliers cause machine learning models to perform worse on the test dataset.
We also use feature scaling to reduce convergence time. It will take longer for gradient descent to reach local minima when features are not normalized.
What are the different types of error present in machine learning?
There are mainly two types:
- Reducible errors - can be reduced to improve model accuracy. Such errors can further be classified into bias and Variance.
- Irreducible errors: These errors will always be present in the model
What is the bias/variance trade off?
For an accurate model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:
Decreasing variance will increase bias
Decreasing bias will increase variance
Ideally, we need a model that accurately captures the regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is not possible simultaneously.
High bias + low variance = Underfitting
Low bias + high variance = Overfitting
The Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.
What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model (underfitted). It always leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before (overfitted). As a result, such models perform very well on training data but has high error rates on test data.
How can you deal with overfitting due to low bias?
Low bias occurs when the model is predicting values close to the actual value. It is mimicking the training dataset. The model has no generalization which means if the model is tested on unseen data, it will give poor results.
Bagging (parallel ensemble technique): randomly create subsets of training data and train same algorithm on each subset in parallel, taking consensus result. The combination of models reduces the variance and makes it more reliable compared to a single model.
How can you deal with overfitting due to high variance?
Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before (overfitted).
Regularization techniques: penalise higher model coefficients to lower model complexity. Includes LASSO, Ridge Regression and ElasticNet.
Dimensionality reduction via feature selection
Boosting: iterative ensemble technique that adjusts the weights based on the last classification (assigns higher weight to inaccurate predictions)
What is the interpretation of a ROC area under the curve?
Receiver operating characteristics (ROC) shows the trade-off between sensitivity and specificity.
- Sensitivity: it is the probability that the model predicts a positive outcome when the actual value is also positive (precision)
- Specificity: it is the probability that the model predicts a negative outcome when the actual value is also negative.
The curve is plotted using the False positive rate (FP/(TN + FP)) and true positive rate (TP/(TP + FN))
The area under the curve (AUC) shows the model performance. If the area under the ROC curve is 0.5, then our model is completely random. The model with AUC close to 1 is the better model.
What are the methods of reducing dimensionality?
For dimensionality reduction, we can use feature selection or feature extraction methods.
Dimensionality reduction will decrease the computational cost of training, decrease storage requirements, and may improve the generalisation performance of the model
What is feature selection?
Feature selection is a process of selecting optimal features and dropping irrelevant features. We use Filter, Wrapper, and Embedded methods to analyze feature importance and remove less important features to improve model performance.
What is feature extraction?
Feature extraction transforms the space with multiple dimensions into fewer dimensions. No information is lost during the process, and it uses fewer resources to process the data. The most common extraction techniques are Linear discriminant analysis (LDA), Kernel PCA, and Quadratic discriminant analysis.
What is the Filter method of feature selection?
A subset of features is selected based on their relationship to the target variable. The selection is not dependent of any machine learning algorithm; instead filter methods measure the “relevance” of the features with the output via statistical tests, such as:
Pearson’s Correlation: Linear correlation between two continuous variables
LDA (Linear Discriminant Analysis): a supervised linear algorithm that projects the data into a smaller subspace k (k < N-1) while maximising the separation between the classes. More specifically, the model finds linear combinations of the features that achieve maximum separability between the classes and minimum variance within each class.
ANOVA (Analysis of Variance): tests whether different input categories have significantly different values for the output variable.
χ² (Chi-squared): tests whether the occurrences of a specific feature and a specific class are independent using their frequency distribution. The null hypothesis is that the two variables are independent. However, large values of χ² indicate that the null hypothesis should be rejected. When selecting features, we wish to extract those that are highly dependent on the output.
What is the Wrapper method of feature selection?
Selecting features by measuring the importance of a subset of features by actually training a model on it.
Greedy search and evaluation criterion: the method at each iteration chooses the locally optimal subset of features. Then, the evaluation criterion (evaluation metric such as p-value of R-squared etc) plays the role of the judge.
The combination of features that gives the optimal results, according to the evaluation criterion, will be selected.
Four approaches:
- Forward selection
- Backward elimination
- Boruta
- Genetic algorithm
Computationally intensive.
What is the Embedded method of feature selection?
Embedded methods combines the advantageous aspects of both Filter and Wrapper methods.
Similar to Wrapper methods, however:
- perform feature selection and algorithm training in parallel rather than iteratively based on the evaluation metric (feature selection is integral part of the model)
- less subject to overfitting than Wrapper methods
Methods include:
- LASSO (Least Absolute Shrinkage and Selection Operator)
- Feature Importance
- Tree-based Methods (e.g. random forest)
- Permutation Importance
How does LASSO work?
Least Absolute Shrinkage and Selection Operator (LASSO) is a shrinkage method that performs both feature selection AND regularization at the same time. It penalises features likely to cause overfitting.
It is Linear Regression with L1 regularization.
Regularization is a process that shrinks the coefficients (weights) towards zero. This means that you are penalizing more complex models to avoid overfitting.
But how does this translate to feature selection? You may have heard of other regularization techniques like Ridge Regression or Elastic net, but LASSO enables coefficients to be set to 0. If a coefficient is zero then the feature is not taken into consideration, thus, it is in a way discarded - helps with feature selection
What is regularisation and what methods can do this?
Goal of regularisation is to help the model generalise
Regularization is an umbrella term that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model (learning too much noise), so as to avoid the risk of overfitting.
Used when:
- Multicollinearity
- To filter out noise from data
- To prevent overfitting
L2: Ridge Regression - RSS is modified by adding the shrinkage quantity
L1: LASSO - differs from ridge regression only in penalizing the high coefficients and shrinking to zero (can use for feature selection)
ElasticNet - combines LASSO and Ridge
What is RSS?
The fitting procedure of linear regression involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.
How do you find thresholds for a classifier?
Usually, the threshold of a classifier is 0.5, but in some cases, we need to fine-tune it to improve accuracy.
The 0.5 threshold means that if the probability is equal to or above 0.5, it is spam, and if it is lower, then it is not spam.
To find the optimal threshold, we can use Precision-Recall curves and ROC curves, grid search, and by manually changing the value to get a better CV.
What is Ensemble learning?
Ensemble learning combines insights of multiple machine learning models to improve the accuracy and performance metrics.
Simple methods:
- Mean/average: average predictions from multiple high-performing models.
- Weighted average: assign different weights to machine learning models based on their performance and then combine them.
Advanced ensemble methods:
- Bagging: used to minimize variance errors. It randomly creates subsets of training data and trains a model on each subset. The combination of models reduces the variance and makes it more reliable compared to a single model.
- Boosting: used to reduce bias errors. It is an iterative ensemble technique that adjusts the weights based on the last classification (sequential models). Boosting algorithms give more weight to observations that the previous model predicted inaccurately.
In semi-supervised machine learning, what is pseudo labelling?
Pseudo labelling is a method to generate labelled data
- Train a model with labelled data.
- Use the trained model to predict labels for the unlabeled data, which creates pseudo-labeled data.
- Retrain the model with the pseudo-labeled and labeled data together.
This process happens iteratively as the model improves and is able to perform with a greater degree of accuracy.
In semi-supervised machine learning, what is Self Training?
Self-training is a variation of pseudo labeling. The difference with self-training is that we accept only the predictions that have a high confidence and we iterate through this process several times. In pseudo-labeling, however, there is no boundary of confidence that must be met for a prediction to be used in a model
Which metrics are used for evaluating a regression model?
There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:
Mean Squared Error (MSE) - average of the squared differences between predicted and expected target values in a dataset
Root Mean Squared Error (RMSE) - units are the same as the original units of the target value that is being predicted (unlike MSE)
Mean Absolute Error (MAE) - units match target value units, and changes in MAE are linear and therefore intuitive (MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score)
What is cross-validation and why is it important?
Assesses the performance and generalization ability of a predictive model.
It involves splitting the available dataset into multiple subsets or “folds” to evaluate the model on different combinations of training and validation data.
By using cross-validation, you can get a more robust and reliable estimation of a model’s performance compared to evaluating it on a single train-test split.
It helps to detect overfitting or underfitting issues, assess model stability, and make informed decisions about hyperparameter tuning or model selection.
What are some usual applications of Rank algorithms?
- Search engines
- Recommender systems
- Travel agencies (finding best rooms etc)
What do ranking algorithms try to predict?
Ranking models typically work by predicting a relevance score s = f(x) for each input x = (q, d) where q is a query and d is a document.
What are the two approaches to ranking algorithms?
- Vector Space Models
LM vector embeddings for each query and document, then compute the relevance score f(x) = f(q, d) as the cosine similarity between the vectors embeddings of q and d. - Learning to Rank
A Machine Learning model that learns to predict a score s given an input x = (q, d) during a training phase where some sort of ranking loss is minimized.
What are the six key evaluation metrics for ranking models?
For binary relevance:
1. Mean Average Precision (MAP)
2. Hit Ratio
For graded relevance:
3. Discounted Cumulative Gain (DCG)
4. Mean Reciprocal Rank
5. Precision@K
6. Recall@K
How does Mean Average Precision work (ranking algorithm)?
Used for tasks with binary relevance, i.e. when the true score y of a document d can be only 0 (non relevant) or 1 (relevant).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we check how many of the top k retrieved documents are relevant (y=1) or not (y=0)., in order to compute precision Pₖ and recall Rₖ.
For k = 1…n we get different Pₖ and Rₖ values that define the precision-recall curve: the area under this curve is the Average Precision (AP).
Finally, by computing the average of AP values for a set of m queries, we obtain the Mean Average Precision (MAP).
How does Discounted Cumulative Gain work (ranking algorithm)?
Used for tasks with graded relevance, i.e. typical scale is 0 (bad), 1 (fair), 2 (good), 3 (excellent), 4 (perfect).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we consider the the k-th top retrieved document.
Gain Gₖ = 2^yₖ – 1 measures how useful is this document
Discount Dₖ = 1/log(k+1) penalizes documents that are retrieved with a lower rank.
Discounted Gain GₖDₖ for k = 1…n (each doc)
The sum of the Discounted Gain is the Discounted Cumulative Gain (DCG).
Normalized DCG= DCG/Ideal DCG (ideal scoreif we ranked documents by the true value yₖ)
Finally, we usually compute the average of DCG or NDCG values for a set of m queries to obtain a mean value.
What types of model are used for ranking?
The base Machine Learning model is usually Decision Tree or Neural Network to compute s = f(x).
The choice of loss function is the distinctive element for Learning to Rank models. In general, we have 3 approaches, depending on how the loss is computed:
- Pointwise Methods
- Pairwise Methods
- Listwise Methods
What are Pointwise Method loss functions in ranking algorithms?
The total loss is computed as the sum of loss terms defined on each document dᵢ (hence pointwise) as the distance between the predicted score sᵢ and the ground truth yᵢ, for i=1…n.
By doing this, we transform our task into a regression problem, where we train a model to predict y.
This is the simplest loss to implement, and is used with regression metrics such a MSE, MAE etc
Problem is that we need true relevant scores to train the model. In most cases, we might only know the ordering of items and not the absolute relevant scores.
What are Pairwise Method loss functions in ranking algorithms?
The total loss is computed as the sum of loss terms defined on each pair of documents dᵢ, dⱼ (hence pairwise) , for i, j=1…n.
The objective on which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two documents is more relevant. By doing this, we transform our task into a binary classification problem.
This only works with relative preference: given two documents, we want to predict if the first is more relevant than the second, and not with absolute relevance.
Binary classification evaluation metrics can be used for the loss, e.g. Binary Cross Entropy loss, or you can use Gradient Descent directly (no loss needed - only Gradients)
What are Listwise Method loss functions in ranking algorithms?
The loss is directly computed on the whole list of documents (hence listwise) with corresponding predicted ranks.
In this way, ranking metrics such as DCG can be more directly incorporated into the loss.
In contrast to pointwise/pairwise, solves the problem more directly by maximizing the evaluation metric - state of the art results achieved with this approach
eg. LambdaLoss RankNet, LambdaRank, SoftRank, ListNet
What is Cross-Entropy loss?
Cross-Entropy loss is a most important cost function. It is used to optimize classification models.
Cross-Entropy takes the output probabilities (P) from SoftMax and measures the distance from the truth values.
Also called logarithmic loss, log loss or logistic loss. Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0.
Cross-entropy loss is used when adjusting model weights during training. The aim is to minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-entropy loss of 0.
What is Entropy in ML?
A key measure in information theory
Entropy of a random variable X is the level of uncertainty inherent in the variables possible outcome.
The greater the value of entropy,H(x) , the greater the uncertainty for probability distribution and the smaller the value the less the uncertainty.
Entropy is between 0 and 1
What is the difference between Categorical Cross-Entropy and Sparse Categorical Cross-Entropy?
Both have the same cross entropy loss function. The only difference is how truth labels are defined.
- Categorical cross-entropy is used when true labels are one-hot encoded, for example, we have the following true values for 3-class classification problem [1,0,0], [0,1,0] and [0,0,1].
- Sparse categorical cross-entropy uses truth labels that are integer encoded, for example, [1], [2] and [3] for 3-class problem.
In Discounted Cumulative Gain, what is the Gain?
Gk = 2^yk -1
Measures how useful is this document
In Discounted Cumulative Gain, what is the Discount?
Dk = 1 / log(k+1)
Penalises documents that are retrieved with a lower rank
What is Gradient Descent?
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.
The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent.
What is least-squares regression?
Where the ML goal is to “teach” a model, F, to predict values of the form
y = F(x) by minimizing the mean squared error
What is Gradient Boosting?
Variant of ensemble methods where you create multiple weak models and combine them to get better performance as a whole.
Specific to regression problems with squared loss (MSE etc)
Qualities:
- Powerful enough to find any nonlinear relationship between your model target and features
- Can deal with missing values, outliers, and high cardinality categorical values without any special treatment.
e.g. XGBoost or LightGBM
What are residuals?
prediction errors
Define information retrieval
Information retrieval is the process of retrieving relevant information from a collection of unstructured or semi-structured data, typically text documents, in response to user queries or information needs.
It involves techniques and algorithms to efficiently search, analyze, and present information to users.
What is ‘relevance’ with regard to information retrieval?
In the context of information retrieval, “relevance” refers to the degree to which a document or information item satisfies the information needs or requirements of a user.
It is a measure of how closely a document aligns with the user’s query or information-seeking intent.
What are the key components of an information retrieval system?
- Document Collection
- Indexing: structured representation of the document collection to facilitate efficient searching
- Query Processing: parsing and understanding the user’s query and formulating an appropriate search strategy.
- Retrieval Model: mathematical or statistical framework used to assess the relevance of documents to a given query- determines how documents are scored or ranked based on their similarity or match to the query
- Ranking Algorithm: sorts retrieved documents based on their relevance scores
- User Interface: provides the means for users to interact with the information retrieval system
- Evaluation Metrics: Common metrics include precision, recall, F1 score, mean average precision (MAP), and normalized discounted cumulative gain (NDCG).
- Relevance Feedback: feedback on the retrieved results
How does Hit Ratio work in Information Retrieval?
It is simply the fraction of queries for which the correct answer is included in the recommendation list of length L.
How does Mean Reciprocal Rank work in Information Retrieval?
Measures how far down the ranking the first relevant document is.
If MRR is close to 1, it means relevant results are close to the top of search results - what we want! Lower MRRs indicate poorer search quality, with the right answer farther down in the search results.
How do precision@K and recall@K work in Information Retrieval?
In the top K documents, what is the precision/recall
i.e.
precision - proportion of top k that are TP (indeed relevant)
recall - proportion of all positives (relevant documents) that are in top k
What are LambdaRank and LambdaMart?
These are ranking algorithms, which are in-between pairwise and listwise methods.
LambdaRank: based on gradient boosting decision tree models LambdaRank: based on neural networks
What are the different types of Relevance feedback?
- Explicit - assessors indicate relevance of results explicitly using a binary or graded relevance system
- Implicit - based on user behaviours (documents selected for view)
- Pseudo -
- Take the top 10-50 results returned by initial query
- Select top 20-30 terms from these documents using e.g. tf-idf weights.
- Do Query Expansion; add these terms to query, then match the returned documents for this query and finally return the most relevant documents
What is a Web Crawler?
A web crawler is a computer program that crawls through the web in a predefined and methodical manner to collect data.
The web crawler tool pulls together details about each webpage: titles, images, keywords, other linked pages, etc. It automatically maps the web to search documents, websites, RSS feeds, and email addresses. It then stores and indexes this data.
Web crawlers use several algorithms to rate the value of the content or the quality of the links in its index. These rules determine its crawling behavior: which sites to crawl, how often to re-crawl a page, how many pages on a site to be indexed, and so on.
When it visits a new website, it downloads its robots.txt file—the “robots exclusion standard” protocol designed to restrict unlimited access by web crawler tools. The file contains information of sitemaps (the URLs to crawl) and the search rules (which of the pages are to be crawled and which parts to ignore).
What is Query Expanision in Information Retrieval?
Query expansion is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding.
Query expansion involves techniques such as:
- Finding synonyms of words, and searching for the synonyms as well
- Finding semantically related words (e.g. antonyms, meronyms, hyponyms, hypernyms)
- Finding all the various morphological forms of words by stemming each word in the search query
- Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results
- Re-weighting the terms in the original query
What is Data Leakage?
Data leakage occurs when information from the validation set inadvertently leaks into the training process. It can happen if the validation set is used to inform decisions during model training, such as feature selection or hyperparameter tuning.
In such cases, the model’s performance on the validation set can provide a more reliable evaluation of its generalization ability.
What are the assumptions of linear regression?
- Linearity: relationship between predictor variables and response variable is assumed to be linear
- Independence: observations are independent of each other (no correlation or dependence)
- Homoscedasticity: variability of the residuals (differences between observed and predicted values) is constant across all levels of the predictor variables
- Normality: residuals follow a normal distribution. This allows for estimation of confidence intervals and hypothesis tests in linear regression. It is not necessary for the predictor variables to follow a normal distribution.
- No Multicollinearity: Multicollinearity refers to a high correlation between predictor variables
No Endogeneity: predictor variables are exogenous, meaning they are not affected by the response variable. Endogeneity can lead to biased and inconsistent coefficient estimates.
What are the four main types of linear model?
- Linear regression
- Logistic regression
- Ridge regression
- LASSO regression
What are the five main types of tree-based model?
- Decision trees
- Random forest
- Gradient boosting regression
- XG Boost
- Light GBM Regressor
What are the three main types of clustering model?
- K-means
- Hierarchical clustering
- Gaussian mixture model
How does K-means clustering work?
- Number of centroids defined (use elbow method to find optimum)
- Randomly assign centroids
- Find data points closest to each centroid (Euclidean distance)
- Iterate to reduce Euclidean distance between centroid and cluster data
When is one hot encoding used and what are its limitations?
One hot encoding is used when you need to convert categorical data into numeric.
When there are too many categories to encode, multicollinearity is a risk
Use pandas .get_dummies()
Most algos require one hot as they use numerical data
Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based
What is Principle Component Analysis?
PCA statistics is the science of analyzing all the dimensions on a dataset and reducing them as much as possible while preserving the exact information.
When to use?
- Dimensionality reduction
- Categorize the dependent and independent variables in your data
- Eliminate noise components in your dimension analysis
Steps:
- Standardise data
- Compute covariance matrix to detect correlations
- Compute eigenvectors and eigenvalues from covariance matrix to identify Principle Components
- Create feature vector to define Principle Components
- Recast data along Principle Components axis
What is meant by Data Standardisation?
- The range of variables in a dataset is standardized to analyze the contribution of each variable equally.
- Calculating the initial variables will help categorize the variables that are dominating the other variables of small ranges.
- This will help you attain biased results at the end of the analysis.
- To transform the variables of the same standard, you can follow the following formula:
Z = (value - mean) / st dev
What are the key steps involved with a decision tree?
- Take the entire data set as input (accepts numerical and categorical data)
- Calculate entropy of the target variable, as well as the predictor attributes
- Calculate information gain of all attributes (we gain information on sorting different objects from each other)
- Choose attribute with the highest information gain as the root node
- Repeat procedure on every branch until the decision node of each branch is finalized
- Take average or consensus of nodes to infer tree prediction
In machine learning, what is information gain and when is it used?
Information gain is a measure of how much information a feature provides about a class: it measures how uncertainty in the target variable is reduced, given a set of independent variables.
Information gain helps to determine the order of attributes in the nodes of a decision tree.
Defined by:
Gain = Entropy(parent node) - Entropy(child node)
Information gain is the entropy we “lost” after splitting data at a node (how good was this split)
Want to decrease entropy as progress through decision tree
What are the key differences between decision trees and random forest?
Main difference: establishing root nodes and segregating nodes is done randomly in Random Forest.
- RF = ensemble of decision tress
- RF employs the bagging method to generate the required prediction
- RF more time and computation intensive
- RF less sensitive to overfitting
Differentiate between univariate, bivariate, and multivariate analysis.
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.
Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.
Multivariate data involves three or more variables. It is similar to a bivariate but contains more than one dependent variable.
How do you calculate Mean Square Error and RMSE?
Measures error by using the average squared difference between observed and predicted values.
MSE = SumOf(yi - y(hat)i) squared
/ n
RMSE is square root of this
What is the elbow method?
Used to define optimal number of clusters for k-means
Determine the k-value by iteratively clustering k=1 to k=n (Here n is the hyperparameter that we choose as per our requirement).
For every value of k, we calculate the within-cluster sum of squares (WCSS) value.
WCSS - the sum of square distances between the centroids and
each point in cluster.
To determine optimal number of clusters(k), plot a graph of k versus WCSS value
Which ML algorithm can be used to impute both categorical and numerical missing data?
K Nearest Neighbours
What are True Positive Rate and False Positive Rate?
TPR defines the probability that an actual positive will turn out to be positive.
TPR=TP/TP+FN
FPR defines the probability that an actual negative result will be shown as a positive one i.e the probability that a model will generate a false alarm.
FPR=FP/TN+FP
These are used to plot the ROC curve (y=TPR, x=FPR)
What is decision tree pruning?
Pruning simplifies the decision tree by reducing the rules; this helps to avoid complexity, reduce overfitting and improves accuracy.
Simplifies a decision tree by removing the weakest rules.
Pruning is often distinguished into:
- Pre-pruning (early stopping) stops the tree before it has completed classifying the training set
- Post-pruning allows the tree to classify the training set perfectly and then prunes the tree
Post-pruning starts with an unpruned tree, takes a sequence of subtrees (pruned trees), and picks the best one through cross-validation.
What is the difference between an error and a residual error?
Error = difference between prediction and actual value
Residual = difference between arithmetic mean of a group of values and the observed group of values
What is overfitting?
- Occurs when the learning power of the model is too high OR the data is too small
- Model learns the noise rather than the information
- Model performs badly on unseen data
Will have overfitting problem if have regression model and number of data points < number of features
Can remedy by:
- reducing learning power of
- Increasing size of training data
- Regularisation
What are L1 and L2 regularisation?
You can have Ln regularisations, but L1 (Lasso) and L2 (ridge) are most common
ElasticNet is a hybrid of these two
All features must be on comparable scales for regularisation to occur
Both can be applied to all parametric models (regression, SVM, neural networks)
L1 is binary - adds weights of zero or 1, so cannot be used for feature selection
For correlated features, L1 selects the best on whereas L2 spreads the error amongst both
In regularisation, what is alpha
Alpha controls the amount of feature shrinkage - the larger alpha, the greater the shrinkage
What is ridge regression?
L2 Form of regularisation
L2 shrinks parameters and reduces influence of unimportant features. It is more stable than L1.
It is differentiable, so gradient descent can be used to optimise.
Spreads error amongst all terms; does not shrink parameters to zero, so cannot be used for feature selection
What Problems Do Multicollinearity Cause?
Multicollinearity causes the following two basic types of problems:
The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.