algorithms and theory Flashcards by Harriet Crisp

What’s the trade-off between bias and variance?

Bias is error due to erroneous assumptions in the learning algorithm- leads to model underfitting data, making it hard to have high predictive accuracy and to generalise knowledge from the training set to the test set.

Variance is error due to too much complexity in the learning algorithm- leads to the algorithm being highly sensitive to high degrees of variation in training data, which can lead model to overfit the data. Carrying too much noise from training data for model to be useful for test data.

Learning error from any algorithm due to the bias, the variance and irreducible error due to noise in the underlying dataset. If you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.

How well did you know this?

Not at all

Perfectly

What are the different types of machine learning?

Supervised learning requires training labeled data and predicts outcome variables

Unsupervised learning requires training unlabelled data to uncover hidden structures, e.g. finding groups of photos with similar cars

Reinforcement learning involves the model learning based on the rewards it received for its previous action.

How well did you know this?

Not at all

Perfectly

How is KNN different from k-means clustering?

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. In order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into. K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

How well did you know this?

Not at all

Perfectly

Explain how a ROC curve works

The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).

Curve to the top left corner is good, line along x=y bad

How well did you know this?

Not at all

Perfectly

Define precision and recall.

Recall is the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data.

Precision is the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims.

Think of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.

How well did you know this?

Not at all

Perfectly

What is Bayes’ Theorem? How is it useful in a machine learning context?

Bayes’ Theorem gives you the posterior probability of an event given prior knowledge.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.

Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier.

How well did you know this?

Not at all

Perfectly

Why is “Naive” Bayes naive?

Naive Bayes is naive because it assumes absolute independence of features when calculating the conditional probability as the pure product of the individual probabilities of components, a condition probably never met in real life.

How well did you know this?

Not at all

Perfectly

Explain the difference between L1 and L2 regularization.

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

How well did you know this?

Not at all

Perfectly

What is regularisation?

A form of regression that shrinks the coefficient estimates towards zero. It discourages learning a more complex model so avoids the risk of overfitting.

How well did you know this?

Not at all

Perfectly

What is regression?

Estimating the relationship between known x variable and observed y variable, single output value produced using training data- FITS THE DATA

How well did you know this?

Not at all

Perfectly

What’s your favorite algorithm, and can you explain it to me in less than a minute?

Neural networks

Neural networks are designed to replicate the way human brains learn. They consist of layers of nodes that are interconnected. There is first an input layer, followed by any number of hidden layers, and finally an output layer. The input layer takes in the values of the features in the training set, the output layer produces the final predicted output.

A neural network will compute a single output for each node based on a weighted combination of all of the inputs to that node. The output is produced by taking the result of the node’s activation function and thresholding it against some value/ threshold function. The inputs are values from the features of the data or from previous layers, and the output is a single value that is passed to the next layer or is the final output if in the output layer.

Neural networks learn by continually updating the weights to minimize error. The concept of backpropagation is that the errors on your output “flow back” from the output layer to update the weights throughout the network. If the output of the network matches the label on the data then the weights are not updated. Two different methods of updating the weights are the perceptron rule and the delta rule (gradient descent).

Different activation functions can be used at each layer. Common activation functions are:

Rectified Linear Unit (ReLU) – thresholded at 0
Perceptron – Discrete -1 or 1 value, will find anything that is linearly separable
Sigmoid – Gradual continuous from 0 to 1, is differentiable so can use gradient descent

Advantages
Hidden layers can invent new features and therefore create a better representation of the problem
Good at handling large data sets

Disadvantages
Hard to interpret output
More complex/ bigger network, more likely to overfit

How well did you know this?

Not at all

Perfectly

What’s the difference between Type I and Type II error?

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

How well did you know this?

Not at all

Perfectly

What’s a Fourier transform?

A Fourier transform is a method to decompose generic functions into a superposition of symmetric functions. The Fourier transform finds the set of cycle speeds, amplitudes, and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain.

How well did you know this?

Not at all

Perfectly

What’s the difference between probability and likelihood?

Probability corresponds to finding the chance of something given a sample distribution of the data, while Likelihood refers to finding the best distribution of the data given a particular value of some feature or some situation in the data.

Given fixed parameter, what is the probability of different outcomes
vs
Given fixed outcomes, what is the likelihood of different parameter values (likelihoods proportional a probability but not one bc don’t add up to 1)

How well did you know this?

Not at all

Perfectly

What is deep learning, and how does it contrast with other machine learning algorithms?

Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data.

Deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.

How well did you know this?

Not at all

Perfectly

What’s the difference between a generative and discriminative model?

Study These Flashcards

Discriminative models learn the (hard or soft) boundary between classes whilst generative models model the distribution of individual classes.
Discriminative models will generally outperform generative models on classification tasks.

What cross-validation technique would you use on a time series dataset?

Study These Flashcards

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data—it is inherently ordered by chronological order. If a pattern emerges in later time periods, for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!

You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.

Fold 1 : training [1], test [2]
Fold 2 : training [1 2], test [3]
Fold 3 : training [1 2 3], test [4]
Fold 4 : training [1 2 3 4], test [5]
Fold 5 : training [1 2 3 4 5], test [6]

How is a decision tree pruned?

Study These Flashcards

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.

Which is more important to you: model accuracy or model performance?

Study These Flashcards

Model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model—a model designed to find fraud that asserted there was no fraud at all! Model accuracy isn’t the be-all and end-all of model performance– model performance is more important.

What’s the F1 score? How would you use it?

Study These Flashcards

The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

How would you handle an imbalanced dataset?

Study These Flashcards

An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data!

Collect more data to even the imbalances in the dataset.
Resample the dataset to correct for imbalances.
Try a different algorithm altogether on your dataset.

When should you use classification over regression?

Study These Flashcards

Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points.
You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (e.g. If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)

Name an example where ensemble techniques might be useful.

Study These Flashcards

Ensemble techniques use a combination of learning algorithms to optimise better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).

Bagging:
Parallel ensemble: each model is built independently and samples drawn with replacement
Aim to decrease variance
Suitable for high variance low bias models (complex models)
E.g. random forest, which develop fully grown trees

Boosting:
Sequential ensemble: try to add new models that do well where previous models lack
Aim to decrease bia
Suitable for low variance high bias models
E.g. Gradient boosting

How do you ensure you’re not overfitting with a model?

Study These Flashcards

Fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalisations.

There are three main methods to avoid overfitting:
Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
Use cross-validation techniques such as k-folds cross-validation.
Use regularisation techniques that penalise certain model parameters if they’re likely to cause overfitting.

What evaluation approaches would you work to gauge the effectiveness of a machine learning model?

You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics: here is a fairly comprehensive list. You could use measures such as the F1 score, the accuracy, and the confusion matrix.

SUMMARY

Supervised: REGRESSION Linear regression- used to solve regression problems (predicts the continuous dependent variable using a given set of independent variables), accuracy estimated using least squares. CLASSIFICATION Logistic regression- used to solve classification problems (predicts the categorical dependent variable using a given set of independent variables), accuracy estimated using maximum likelihood. Unsupervised: CLUSTERING Dividing the population into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. DIMENSIONALITY REDUCTION Used to find a less complex representation of the data (the data set should have a reduced amount of redundant information while the important parts may be emphasised). ASSOCIATION Identify patterns of associations between different variables or items.

What’s the “kernel trick” and how is it useful?

The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.

How do you know which algorithm to pick for a classification problem?

While there is no fixed rule to choose an algorithm for a classification problem, you can follow these guidelines: If accuracy is a concern, test different algorithms and cross-validate them If the training dataset is small, use models that have low variance and high bias If the training dataset is large, use models that have high variance and little bias

algorithms and theory Flashcards

(28 cards)