Week 8 DSE Flashcards

1
Q

What is machine learning?

A

field that develops algorithms designed to be applied to datasets, with the main focus being prediction, classification, clustering or grouping tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is yi in (yi,xi)?

A

dependent varaible ( OR RESPONSE VARIALE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Xi in (yi,xi)?

A

P-dimensional vector of independent variables or covariates (in ML speak: features).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For a high dimensional dataset, how does p relate to N?

A

P»N

number of potential varibales that we can use in the model is way larger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does P and N stand for

A

P: Potential variable
N: number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is supervised learning algorithm?

A

uses a training dataset (i.e., the estimation sample) (yi, xi), i = 1,2,…,N to determine the
conditional prediction (or forecast) rule Yˆ( X )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When yi is continuous, it is called a ________problem; when it is categorical, it is called a _________ problem.

A

regression
classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is unsupervised learning algorithm?

A

uses observations xi, i = 1,2,…,N of a random P-dimensional vector X with joint density p(X) to infer some properties of p(X).

trying to infer some strucutre within dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is reinforcement learning?

A

Algorithm gets told when the answer is wrong, but has no feedback on how to correct it

has to explore different possibilities until it works out how to get the answer right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the forecast distribution

A

cumulative distribution (s shaped)
F(y) = P(Y ≤ y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Yˆ

A

point forecast
best guess for the unknown value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for forecast erorr?

A

difference between the actual value and the forecast
See notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why will there always be forecast error if Y is continuous random variable?

A

Is because when you are dealing with continuous variable, then the probability of getting a value is 0

P(x) =0
Integration of density at one point is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is used to calculate cost of forecast erorr?

A

loss function L(e)

can also be written as L(Y, Y HAT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a forecasst?

A

Predictive distribution .

Action that must be constructed given loss function and forecast distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the conditions for an appropriate loss function

A

i)L(0) = 0 (minimum loss is 0) (when error is 0 and get exactly correct answer)

ii) L(e) ≥ 0 for all e

iii)Nonincreasing in e for e < 0, nondecreasing in e for e > 0: L(e1) ≤ L(e2) if e2 < e1 <0; L(e1) ≤ L(e2) if e2 > e1 >0. (Never be rewarded for making an error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are common choices for loss funciton and what is their similarities? Are they symmetric and what does it mean?

A

quadratic ( L(e)=e^2 ) and absolute (L(e) =|e| )

Both are symmetric: penalize positive and negative errors of the same magnitude in the same way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the difference between the 2 common loss function

A

Quadratic loss penalizes large errors much more severely than small errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of having asymmetric loss functions?

A

If you have something that is safer to overestimate than to undersestinate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is positive erorr and negative error?

A

If error is positive, you undershot
If error is negative, then you overshot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is risk

A

expected loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the optimal forecast under quadratic loss?

A

mean of F(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

We can improve our forecast further by ________________________

A

conditioning on predictors

24
Q

What happens when you allow model to tailor to specific noise in data?

A

overfitting

25
Q

What is the goal of ML from predictive modelling?

A

“learn” f(X) from data in a way that yields good out-of-sample forecasting performance or get a forecasting rule that minimizes expected loss

get a good estimate of E(Y|X)

26
Q

What is the optimal conditional point forecast?

A

optimal conditional point forecast Yˆ( X ) is the function of the conditional predictive distribution F(y|x) which minimizes the risk (expected loss).

27
Q

What is the optimal conditional point forecast for qudratic loss?

A

conditional mean

28
Q

What can traditional non parametrics be used to help fit ?

A

flexible nonlinear functions (kernels, series). (for estimating the structure of data)

29
Q

For modelling, we need to trade off _____ with __________

A

signal extraction (bias)
overfitting (variance).

30
Q

What is the problem an dits solution for using bayes classifier?

A

must know true F(y|x), so it is unattainable.

estimate F(y|x) and classify based on the highest
estimated conditional probability.

31
Q

What is the difference between parametric and nonparametric?

A

Parametric: “fixed” model, number of parameters fixed, faster computation, but stronger assumption on F(y|x)

Nonparametric: “flexible” model, number of parameters may grow with available data, computation becomes harder (or untractable) with larger datasets.

32
Q

Is knn parametric or nonparametric?

A

nonparametric (lazy /instance based learner)

33
Q

What are the steps for knn for k=3?

A

Find the 3 closest points of the datapoint to the pooint to classify

calculate euclidean distance and place them in the set for the k nearest neighbours

estimate conditional probabilities

classify test observation to class with highest prob

(FIND K MOST SIMILAR AND LET THEM VOTE)

34
Q

In order to make a predcition we need to ____________

A

look at the entire data

35
Q

Prediction step for knn is _______________ (speed)

Why is it such?

A

slow (need to ssort through data every time to determine nearest points)

36
Q

What must we watch out for for some non parametric methods?

A

SCALING (esp KNN)

37
Q

Why is scaling important for KNN? How to scale?

A

typically uses Euclidean distance, so we would get a very different answer depending on the scaling of X (e.g., income in SGD vs. in thousands of SGD). S

Standardiize or scale to [0,1]

38
Q

What must we watch up for euclidean distance calculation? (as in when does it not make sense to use it)

A

CATEGORICAL PREDICTORS

assign zero if classes coincide with the test observation, and 1 if not.

39
Q

What is 0-1 loss?

A

assign zero if classes coincide with the test observation, and 1 if not.

(1 can be replaced by a higher number for some mismatches based on domain knowledge.)

40
Q

Do we need to standardize variables for KNN in r?

A

NO

41
Q

What is good about knn for regression task and tuning value of k?

A

readily extends to predicting continuous variables.

42
Q

How should you choose the k value if you have many predictors?

A

OUT sample MSE

43
Q

How to discriminate between models? What are the steps ?

A

using data not used in estimation to compute out of sample risk estimate

  1. Training sample: data used to estimated prediction rule(s).
  2. Validation sample: data used to test estimated rule(s) on NEW data (a.k.a. hold-out sample).
44
Q

We should allow the algorithm minimize the loss over the training set. True/False

Why?

A

FALSE

any function that passes through all the datapoints would set it to 0

likely result in terrible out-of-sample performance – such model would be overfit to the training data and will generalize poorly, since we chase noise specific to the sample.

45
Q

What is the solution to overfitting?

A

introduce some penalty for complexity – this is referred to as regularization. In this example, regularization comes in through the choice of K.

46
Q

Generally, _________decreases with complexity, but ___________increases.

A

bias
variance

47
Q

What are the roles of conditioning?

A

By conditioning on available information we can make the forecasts more accurate.

Conditioning reduces the risk of the forecast.

Ignoring estimation error considerations, conditioning on more information is always better in the sense of reducing risk.

48
Q

What is the optimal point forecast under absolute loss ? What loss function do we normally use ? Why do we not use nonstandard losses ?

A

Under absolute loss, the optimal point forecast is the median.

Other loss functions lead to different solutions. Working out optimal forecasts under nonstandard losses could be tricky.

Sometimes we do not have an explicit loss function, so we take the simplest approach and use the quadratic loss.

In some real-world applications (e.g., policymaking) a very explicit loss function may arise. In that case, it would be best to use the estimators and forecasts tailored to the loss.

49
Q

What is the goal of most predictive modelling?

A

Get a good estimate of E(Y|X)

50
Q

What is the approach to modelling E(Y|X) esp in econometrics

A

traditional parametrics (fit flexible non linear function)
(cause most econometric model assume e(Y|X) approx linear at least in parmaeters

while trading off signal extraction(bias) with overfitting(variance)

51
Q

What are the approaches to classifciation

A

parametric and non parametric

52
Q

What is the problem of bayes classifer and the solution?

A

Problem: must known true F(y|x) so it is unattainable
Solution: estimate it and classify based on highest estimated CONDITIONAL PROBABILITY

53
Q

Why is knn called lazy learner?

A

doesn’t produce any model and hence any “understanding” of how X relates to Y: just lets the K most similar training data points “vote” on the class of the test observation.

54
Q

What is the issue of categorical predictors for knn? How to solve it?

A

Euclidean distance doesn’t make much sense when applied to categorical predictors. If assign number to different categories, the error might be differnet depending on which categoresi are assigned to which number, although all should be same error

Solution: 0-1 loss

55
Q

Can 1 be replaced by higher number in 0-1 loss?

A

Yes depending on domain knowledge

56
Q

What is regularization? How to do it in knn?

A

penalty for complexity,
To restrict choice available f(X).
Comes in through choice of k

57
Q

How does a line of higher k value look like as compared to lower k value

A

very stable from sample to sample: low var
misses many X values: high bias