Big Ideas Flashcards

1
Q

Statistical modelling

A

Modelling is the process of incorporating information into a tool that can forecast and make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistical modelling Equation

A

Y= f(x) +e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Prediction

A

Once we have a good estimate of F(x) we can use it to make predictions on new data we treat F as a black box since we only care about the accuracy of the predictions not necessarily how it works

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inference

A

We want to understand the relationship between X and why we can no longer treat F as a black box since we want to understand how why changes with respect to X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reducible

A

Error that can potentially be reduced using the most appropriate statistical learning techniques to estimate F the goal is to minimise the reducible error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Irreducible

A

Error that cannot be reduced no matter how well we estimate F irreducible error is unknown and unmeasurable and will always be an upper bound for e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Parametric

A

Models that first assume the shape of f(X) and then we fit the model (e.g. we assume the data to be linear). This simplifies the problem from estimating f(X) to just estimating a set of parameters however if our initial assumption is wrong this will lead to a bad result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Non parametric

A

Models that don’t make any assumption about the shape of FX which allow them to fit a wider range of shapes but may lead to overfitting. E.g. k-NN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Supervised

A

Models that fit input variables X to a known output variable Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unsupervised

A

Models that take in input variables ex but they do not have an associated output why to supervise the training. The goal is to understand relationships between the variables or observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Blackbox

A

Models that make decisions , but we do not know what happens under the hood. (e.g.deep learning, neural networks )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Interpretable

A

Models that provide insight into why they make their decisions. ( e.g. Linear regression , decision trees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Generative

A

Learns the joint probability distribution p(x, y). For example, if we wanted to distinguish between fraud or not fraud, we would build a model for what fraudulent transactions look like and one for not what non fraudulent transactions look like. Then we compare a new transaction to our two models and see which is more similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Discriminative

A

Learns the conditional probability distribution p(y|x). For example, we try a line that separates between two classes and do not care about how the data was generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Occam’s razor

A

Philosophical principle that the simplest explanation is the best explanation. In modelling if we are given two models that predict equally well we should choose the simpler one; choosing the more complex one can often result in overfitting ( or just memorising the training data). Simpler is usually defined as having less parameters or assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Curse of dimensionality

A

As the number of features d grows, points become very far apart in Euclidean distance and the entire future space is needed to find the key nearest neighbours. Eventually, open become equidistant which means all points are equally similar, which means algorithms that use distance measures are pretty much useless. This is not a problem for some high dimensional data sets since data relies on low dimensional subspace ( such as images of faces or handwritten digits). In other words, the data only sits in a small corner of the feature space open) think of how trees only grow near the surface of the earth and not the entire atmosphere).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Interpretable Models

A

Linear regression, logistic regression, generalised linear models (GLMs), generalised additive models (GAMs), decision trees, decision rules, RuleFit, naive bayes, K- nearest neighbours.

18
Q

No free lunch theorem

A

The new free lunch theorem states that every successful machine learning algorithm must make assumptions. The implication of this is that no single algorithm will work for every problem and that no single algorithm will be the best for all problems. In other words, there is no “master algorithm” that will be the best algorithm for every single problem. The solution is to test multiple models in order to find the one that works best for a particular problem.

19
Q

Bias-variance trade off

A

The trade off between variance and bias is the conflict in attempting to minimise both, since models with lower bias usually have higher variance and vice versa.

20
Q

Bias

A

Bias is the error resulting from incorrect assumptions in the learning algorithm and the model is too simple. Example, using linear models when the data is non-linear: high bias -> missing relevant relations between inputs and outputs aka under fitting.

21
Q

Variance

A

Variance is the error from sensitivity to fluctuations in the training data, or how much the target estimate would differ if different training data was used. This causes the model to capture random noise rather than the signal due to the model being too complex aka overfitting.

22
Q

Parallel processing or distributed computing

A

Parallel processing is the process of breaking down a complex problem into simpler tasks that can be run simultaneously or independently on different machines. The results are then combined together at the end. Done right, paralysation can dramatically reduce processing time . The most basic example is addition: suppose you want to calculate the following equation: 1 + 1 + 1 + 1. We also assume that it takes one second to add a number. If we compute sequentially it will take 3 seconds however if done in parallel it will take 2 seconds.

23
Q

Parallel processing

A

TBC

24
Q

Distributed computing

A

TBC

25
Q

Vectorization

A

Vectorization is the process of performing operations on a list or vector instead of scalar values . Suppose you have a list of numbers one, 2, 3, 4 and you want to add 1 to each value. You could use a for loop and iterate over the list and add one to every single value. The other way is to view it as a matrix operation and can be done in one step as opposed to n steps. Samsung packages have been optimised for vectorization (NumPy) and is the recommended way to perform calculation since the code is more readable and generally it is faster than looping.

26
Q

Overfitting

A

Overfitting is a Modelling error when the model we trained basically memorises the data to see ,
which captures noise in the data and not the true underlying function. Some ways to prevent overfitting is cross validation and regularisation

27
Q

Regularisation

A

Regularisation prevents a model from becoming too complex by adding a tuning parameter That shrinks the coefficient estimates. Two popular types of regularisation are lasso and ridge

28
Q

Lasso

A

Adds absolute value of the magnitude of coefficients as the penalty term to the loss function.

29
Q

Ridge

A

Adds the square of the magnitude of coefficients as the penalty term to the loss function

30
Q

big data hubris

A

Hubris is the assumption that big data is the substitute for, rather than supplement to, traditional data collection and analysis.

31
Q

Local minimas

A

suppose we have a function F ( X ) now we want to find the point that minimises that function . Local minimums are the best solutions or points in a corresponding neighbourhood; In other words, these are pretty good solutions but they are not always the best. The global solution is the best solution overall .

32
Q

convex function

A

convex functions are guaranteed to have a global minimum or solution

33
Q

non-convex functions

A

Do not have global minima’s. The best we can do is run an optimisation algorithm from different stating points and use the best local minima we find for the solution

34
Q

Gradient descent

A

Gradient descent is an optimization algorithm to find solutions or minimums that minimise for the given value of a function.

35
Q

Stochastic gradient descent

A

Stochastic gradient descent is a form of gradient descent and simply performs the gradient step on one or a few data points as opposed to all of the data points which allow us to jump around and avoid getting stuck in local minima. It is computationally more efficient and it may lead to far faster convergence, but it results in noisy results.

36
Q

Minibatch stochastic gradient descent

A

minibatch stochastic gradient descent randomly chooses batches of data points between 10 and 10,000 and then performs a gradient descent step. this helps reduce noise and also helps speed up training.

37
Q

maximum likelihood estimation MLE

A

a frequentist method to estimate a parameter of a probability distribution that best fits the observations. The two main steps are :

  1. making assumption about the distribution of the observations
  2. Find the parameters of the distribution so that the observations you have are as likely as possible. This is done by maximising the likelihood function
38
Q

Maximum a Posteriori MAP

A

A Bayesian method to estimate the parameters of a probability distribution by maximising the posterior distribution . This assumes a prior distribution and updates to the prior distribution using bayes rule to arrive at theta.

39
Q

MAP vs MLE

A

MAP is just MLE if we use a uniform prior. MLE works well if the the assumed distribution you are estimating four is correct and you have a large number of observations an but can be very wrong when your assume distribution is wrong and any small (can overfit). MAP what’s well when your prior is accurate but can be very wrong if any small and your prior is wrong.

40
Q

cloud computing

A

Cloud computing is the delivery of on demand computing services over the Internet with pay as you go pricing. Example AWS, Microsoft Azure, Google cloud platform, IBM cloud , Salesforce and Alibaba cloud.

41
Q

Half life of data

A

The usefulness of the information to take decays overtime and being aware of what kind of data you’re working with is important in determining how often you should update your data models.