Machine learning Flashcards

Question 1

Q

Explain Parametric?

Answer

A

Parametric statistical procedures rely on assumptions about the shape of the distribution (i.e., assume a normal distribution) in the underlying population and about the form or parameters (i.e., means and standard deviations) of the assumed distribution. Advantage: Restrictive models are much more interpretive.

Question 2

Q

Explain Non-parametric?

Answer

A

Nonparametric statistical procedures rely on no or few assumptions about the shape or parameters of the population distribution from which the sample was drawn. Advantage (compared to parametric methods): They may accurately fit a wider range of possible shapes for f. Disadvantage: A very large number of observations is required in order to obtain an accurate estimate of f

Question 3

Q

Explain Supervised?

Answer

A

For each observation of the predictor measurement(s)𝑥i, 𝑖=1,…,𝑛 there is an associated response measurement 𝑦i
Supervised (examples):
• Linear regression
• Logistic regression
• Support vector machines
• Neural Networks
• Collaborative filtering (Methods that try to fill in the missing values e.g. Netflix ratings)

Question 4

Q

What is Unsupervised?

Answer

A

We observe a vector of measurements 𝑥i, 𝑖=1,…,𝑛, but no associated response 𝑦i
Unsupervised (examples):
• Clustering
• PCA

Question 5

Q

What is the Bias-variance tradeoff?

Answer

A

We want a low variance and a low bias at the same time. However, when variance decreases, bias increases and vice versa.
Bias refers to the error that is introduced by approxi- mating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between Y and X1, X2, . . . , Xp. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing lin- ear regression will undoubtedly result in some bias in the estimate of f
Variance is the amount that the estimate of the target function will change given different training data

Question 6

Q

What is Quality of fit?

Answer

A

There is no free lunch in statistics –> No one method dominates all others over all possible data sets
Important task to decide for any given set of data which method produces the best results
Selecting the best approach can be one of the most challenging parts of performing machine learning in practice
In order to evaluate the performance of a machine learning method, we need to quantify the extent to which the predicted response value is close to the true response
We compute the MSE by using our training data
However, we are not interested whether our method works on the training data
Rather we are interested how it works on our test data
Suppose that we are interested in developing an algorithm to predict stock prices based on previous stock returns.
We can train the method using stock returns from the past 6 months
However, we are not interested in predicting last weeks stock return / price
We are interested in predicting next weeks prices / returns
How can we go about trying to select a method that minimizes the test MSE?
In some settings, we may have a separate test data set (set of observations which we did not use to train the model)
We can then simply evaluate our model on the test observations (compute the MSE of the test data)

Question 7

Q

How can machine learning be included in your research?

Answer

A

It could be used to figure out which consumers who use the deposit return system, and then based on that, it could be used to target advertising towards these consumers
Used for prediction - would the people using it now also use it in two years? - long-term change
Predict who wants to use it, based on previous users with similar characteristics, who are using it
However, it might be quite hard, given that ML is usually based on big data, and the act of depositing bottles is a very analogue process, at least in Denmark, because you don’t have any technical touchpoints that connect the consumer to the behavior on an online data basis.
A solution to this could be making the Deposit return system app based in the US.
For that reason, it is even more important to do consumer research projects as ours on the topic, and even branch into even more descriptive data forms after – to be able to determine which consumers to target, and how

Machine learning Flashcards

(7 cards)