Big Ideas Flashcards

Question 1

Q

Statistical modelling

Answer

A

Modelling is the process of incorporating information into a tool that can forecast and make predictions.

Question 2

Q

Statistical modelling Equation

Answer

A

Y= f(x) +e

Question 3

Q

Prediction

Answer

A

Once we have a good estimate of F(x) we can use it to make predictions on new data we treat F as a black box since we only care about the accuracy of the predictions not necessarily how it works

Question 4

Q

Inference

Answer

A

We want to understand the relationship between X and why we can no longer treat F as a black box since we want to understand how why changes with respect to X

Question 5

Q

Reducible

Answer

A

Error that can potentially be reduced using the most appropriate statistical learning techniques to estimate F the goal is to minimise the reducible error

Question 6

Q

Irreducible

Answer

A

Error that cannot be reduced no matter how well we estimate F irreducible error is unknown and unmeasurable and will always be an upper bound for e

Question 7

Q

Parametric

Answer

A

Models that first assume the shape of f(X) and then we fit the model (e.g. we assume the data to be linear). This simplifies the problem from estimating f(X) to just estimating a set of parameters however if our initial assumption is wrong this will lead to a bad result.

Question 8

Q

Non parametric

Answer

A

Models that don’t make any assumption about the shape of FX which allow them to fit a wider range of shapes but may lead to overfitting. E.g. k-NN

Question 9

Q

Supervised

Answer

A

Models that fit input variables X to a known output variable Y.

Question 10

Q

Unsupervised

Answer

A

Models that take in input variables ex but they do not have an associated output why to supervise the training. The goal is to understand relationships between the variables or observations.

Question 11

Q

Blackbox

Answer

A

Models that make decisions , but we do not know what happens under the hood. (e.g.deep learning, neural networks )

Question 12

Q

Interpretable

Answer

A

Models that provide insight into why they make their decisions. ( e.g. Linear regression , decision trees)

Question 13

Q

Generative

Answer

A

Learns the joint probability distribution p(x, y). For example, if we wanted to distinguish between fraud or not fraud, we would build a model for what fraudulent transactions look like and one for not what non fraudulent transactions look like. Then we compare a new transaction to our two models and see which is more similar.

Question 14

Q

Discriminative

Answer

A

Learns the conditional probability distribution p(y|x). For example, we try a line that separates between two classes and do not care about how the data was generated.

Question 15

Q

Occam’s razor

Answer

A

Philosophical principle that the simplest explanation is the best explanation. In modelling if we are given two models that predict equally well we should choose the simpler one; choosing the more complex one can often result in overfitting ( or just memorising the training data). Simpler is usually defined as having less parameters or assumptions.

Question 16

Q

Curse of dimensionality

Answer

A

As the number of features d grows, points become very far apart in Euclidean distance and the entire future space is needed to find the key nearest neighbours. Eventually, open become equidistant which means all points are equally similar, which means algorithms that use distance measures are pretty much useless. This is not a problem for some high dimensional data sets since data relies on low dimensional subspace ( such as images of faces or handwritten digits). In other words, the data only sits in a small corner of the feature space open) think of how trees only grow near the surface of the earth and not the entire atmosphere).

Question 17

Q

Interpretable Models

Answer

A

Linear regression, logistic regression, generalised linear models (GLMs), generalised additive models (GAMs), decision trees, decision rules, RuleFit, naive bayes, K- nearest neighbours.

Question 18

Q

No free lunch theorem

Answer

A

The new free lunch theorem states that every successful machine learning algorithm must make assumptions. The implication of this is that no single algorithm will work for every problem and that no single algorithm will be the best for all problems. In other words, there is no “master algorithm” that will be the best algorithm for every single problem. The solution is to test multiple models in order to find the one that works best for a particular problem.

Question 19

Q

Bias-variance trade off

Answer

A

The trade off between variance and bias is the conflict in attempting to minimise both, since models with lower bias usually have higher variance and vice versa.

Question 20

Q

Bias

Answer

A

Bias is the error resulting from incorrect assumptions in the learning algorithm and the model is too simple. Example, using linear models when the data is non-linear: high bias -> missing relevant relations between inputs and outputs aka under fitting.

Question 21

Q

Variance

Answer

A

Variance is the error from sensitivity to fluctuations in the training data, or how much the target estimate would differ if different training data was used. This causes the model to capture random noise rather than the signal due to the model being too complex aka overfitting.

Question 22

Q

Parallel processing or distributed computing

Answer

A

Parallel processing is the process of breaking down a complex problem into simpler tasks that can be run simultaneously or independently on different machines. The results are then combined together at the end. Done right, paralysation can dramatically reduce processing time . The most basic example is addition: suppose you want to calculate the following equation: 1 + 1 + 1 + 1. We also assume that it takes one second to add a number. If we compute sequentially it will take 3 seconds however if done in parallel it will take 2 seconds.

Question 23

Q

Parallel processing

Question 24

Q

Distributed computing

Question 25

Q

Vectorization

Answer

A

Vectorization is the process of performing operations on a list or vector instead of scalar values . Suppose you have a list of numbers one, 2, 3, 4 and you want to add 1 to each value. You could use a for loop and iterate over the list and add one to every single value. The other way is to view it as a matrix operation and can be done in one step as opposed to n steps. Samsung packages have been optimised for vectorization (NumPy) and is the recommended way to perform calculation since the code is more readable and generally it is faster than looping.

Question 26

Q

Overfitting

Answer

A

Overfitting is a Modelling error when the model we trained basically memorises the data to see ,
which captures noise in the data and not the true underlying function. Some ways to prevent overfitting is cross validation and regularisation

Question 27

Q

Regularisation

Answer

A

Regularisation prevents a model from becoming too complex by adding a tuning parameter That shrinks the coefficient estimates. Two popular types of regularisation are lasso and ridge

Question 28

Q

Lasso

Answer

A

Adds absolute value of the magnitude of coefficients as the penalty term to the loss function.

Question 29

Q

Ridge

Answer

A

Adds the square of the magnitude of coefficients as the penalty term to the loss function

Question 30

Q

big data hubris

Answer

A

Hubris is the assumption that big data is the substitute for, rather than supplement to, traditional data collection and analysis.

Question 31

Q

Local minimas

Answer

A

suppose we have a function F ( X ) now we want to find the point that minimises that function . Local minimums are the best solutions or points in a corresponding neighbourhood; In other words, these are pretty good solutions but they are not always the best. The global solution is the best solution overall .

Question 32

Q

convex function

Answer

A

convex functions are guaranteed to have a global minimum or solution

Question 33

Q

non-convex functions

Answer

A

Do not have global minima’s. The best we can do is run an optimisation algorithm from different stating points and use the best local minima we find for the solution

Question 34

Q

Gradient descent

Answer

A

Gradient descent is an optimization algorithm to find solutions or minimums that minimise for the given value of a function.

Question 35

Q

Stochastic gradient descent

Answer

A

Stochastic gradient descent is a form of gradient descent and simply performs the gradient step on one or a few data points as opposed to all of the data points which allow us to jump around and avoid getting stuck in local minima. It is computationally more efficient and it may lead to far faster convergence, but it results in noisy results.

Question 36

Q

Minibatch stochastic gradient descent

Answer

A

minibatch stochastic gradient descent randomly chooses batches of data points between 10 and 10,000 and then performs a gradient descent step. this helps reduce noise and also helps speed up training.

Question 37

Q

maximum likelihood estimation MLE

Answer

A

a frequentist method to estimate a parameter of a probability distribution that best fits the observations. The two main steps are :

making assumption about the distribution of the observations
Find the parameters of the distribution so that the observations you have are as likely as possible. This is done by maximising the likelihood function

Question 38

Q

Maximum a Posteriori MAP

Answer

A

A Bayesian method to estimate the parameters of a probability distribution by maximising the posterior distribution . This assumes a prior distribution and updates to the prior distribution using bayes rule to arrive at theta.

Question 39

Q

MAP vs MLE

Answer

A

MAP is just MLE if we use a uniform prior. MLE works well if the the assumed distribution you are estimating four is correct and you have a large number of observations an but can be very wrong when your assume distribution is wrong and any small (can overfit). MAP what’s well when your prior is accurate but can be very wrong if any small and your prior is wrong.

Question 40

Q

cloud computing

Answer

A

Cloud computing is the delivery of on demand computing services over the Internet with pay as you go pricing. Example AWS, Microsoft Azure, Google cloud platform, IBM cloud , Salesforce and Alibaba cloud.

Question 41

Q

Half life of data

Answer

A

The usefulness of the information to take decays overtime and being aware of what kind of data you’re working with is important in determining how often you should update your data models.