Machine learning Flashcards
Explain Parametric?
Parametric statistical procedures rely on assumptions about the shape of the distribution (i.e., assume a normal distribution) in the underlying population and about the form or parameters (i.e., means and standard deviations) of the assumed distribution. Advantage: Restrictive models are much more interpretive.
Explain Non-parametric?
Nonparametric statistical procedures rely on no or few assumptions about the shape or parameters of the population distribution from which the sample was drawn. Advantage (compared to parametric methods): They may accurately fit a wider range of possible shapes for f. Disadvantage: A very large number of observations is required in order to obtain an accurate estimate of f
Explain Supervised?
For each observation of the predictor measurement(s)π₯i, π=1,β¦,π there is an associated response measurement π¦i
Supervised (examples):
β’ Linear regression
β’ Logistic regression
β’ Support vector machines
β’ Neural Networks
β’ Collaborative filtering (Methods that try to fill in the missing values e.g. Netflix ratings)
What is Unsupervised?
We observe a vector of measurements π₯i, π=1,β¦,π, but no associated response π¦i
Unsupervised (examples):
β’ Clustering
β’ PCA
What is the Bias-variance tradeoff?
- We want a low variance and a low bias at the same time. However, when variance decreases, bias increases and vice versa.
- Bias refers to the error that is introduced by approxi- mating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between Y and X1, X2, . . . , Xp. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing lin- ear regression will undoubtedly result in some bias in the estimate of f
- Variance is the amount that the estimate of the target function will change given different training data
What is Quality of fit?
- There is no free lunch in statistics β> No one method dominates all others over all possible data sets
- Important task to decide for any given set of data which method produces the best results
- Selecting the best approach can be one of the most challenging parts of performing machine learning in practice
- In order to evaluate the performance of a machine learning method, we need to quantify the extent to which the predicted response value is close to the true response
- We compute the MSE by using our training data
- However, we are not interested whether our method works on the training data
- Rather we are interested how it works on our test data
- Suppose that we are interested in developing an algorithm to predict stock prices based on previous stock returns.
- We can train the method using stock returns from the past 6 months
- However, we are not interested in predicting last weeks stock return / price
- We are interested in predicting next weeks prices / returns
- How can we go about trying to select a method that minimizes the test MSE?
- In some settings, we may have a separate test data set (set of observations which we did not use to train the model)
- We can then simply evaluate our model on the test observations (compute the MSE of the test data)
How can machine learning be included in your research?
- It could be used to figure out which consumers who use the deposit return system, and then based on that, it could be used to target advertising towards these consumers
- Used for prediction - would the people using it now also use it in two years? - long-term change
- Predict who wants to use it, based on previous users with similar characteristics, who are using it
- However, it might be quite hard, given that ML is usually based on big data, and the act of depositing bottles is a very analogue process, at least in Denmark, because you donβt have any technical touchpoints that connect the consumer to the behavior on an online data basis.
- A solution to this could be making the Deposit return system app based in the US.
- For that reason, it is even more important to do consumer research projects as ours on the topic, and even branch into even more descriptive data forms after β to be able to determine which consumers to target, and how