Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards

Question

Describe the relationship between the decision surface and the number of variables.

Answer 1

n variables gives n-1 dimensional hyperplane

Answer 2

At a leaf if you have n positives and m negatives the frequency based probability of n is n/(n+m)

Answer 3

If you have one observation at a leaf the probability is 100%, LaPlace corrects for that. n+1 / (n+m+2) The higher the number of instances the less effect you have of the LaPlace correction

Answer 4

Weighted sum of the values for the various attributes

Answer 5

Decision boundary where you classify instances of x (e.g. + or -)

Answer 6

Decision boundaries: 2-dimension = above or below the line 3-dimension = a plane Higher-dimensions = hyperplane

Answer 7

heavier weight = more importance

Answer 8

the output of the function gives a ranking itself (the further away from the decision boundary the more certain the instance belongs to the class)

Answer 9

SVM (linear discriminants) fits the fattest bar between the classes (Maximizing margin) and the linear discriminant will be the center line.

Answer 10

- Margin-maximizing boundary gives leeway for classifying such points - Points on the wrong side will incur a penalty OR causes the margin to be changed

Answer 11

Hinge-loss function: for points beyond the margin loss increases linearly with the distance from the margin Zero-one loss function: 0 for correct decision, 1 for incorrect decision Squared error function: squares the distance (i.e. large mistakes are grossly penalized)

Answer 12

It also penalizes points far on the correct side of the decision boundary

Answer 13

Sensitive to the data (outliers cause skews)

Answer 14

odds: P(event happening)/P(event not happening) Log odds: log(odds) -> so now event not happening odds range between infinity to 0 (before log odds it is between 0 and 1)

Answer 15

- For probability estimation it uses the same linear model as for linear discriminants - Output is interpreted as the log-odds of class membership - Log odds can be translated directly into the probability of class membership

Answer 16

1/1+e^-f(x)

Answer 17

s shape (sigmoid function)

Answer 18

- Classification tree uses boundaries that are perpendicular, linear can take any direction or orientation - Classification tree is a "piecewise" classifier (cuts up instance space into smaller attributes. Linear classifier places a single decision surface through the entire space.

Answer 19

- the residual is the difference between y and y hat | - RSS is the sum off all residuals squared

Answer 20

error1^2+error2^2+error3^2....errorN^N

Answer 21

Least-squares uses the intercept (b0) and slope (b1) to minimize the errors

Answer 22

b0 - intercept - expected value of Y when X = 0 b1 - slope - avg. increase in Y for a unit rise in X error = catchall for all errors missed with this simple model

Answer 23

population regression line: best linear approximation to the true relationship of X and Y. (unobserved) least-squares-line: best estimate based on observed data

Answer 24

an unbiased predictor -> statistics calculated from a large set of observations will approach the population statistic statistics (bo, b1, avg) calculated on a small set of observations will under- or overestimate the statistic

Answer 25

Standard error (variance/n) is the average amount that the estimate differs from the actual value (the higher the n the smaller the standard error).

Answer 26

the approximation of the standard deviation from the observed data (RSE = SQRT(RSS/N-2)

Answer 27

b1 +- 2 x SE(b1) 95% chance that the interval will contain the true value of b1

Answer 28

null Hypothesis: No relationship between X and Y | alternative Hypothesis: There is some relationship between X and Y

Answer 29

(b1-0)/SE(b1) this measures the number of standard deviations b1 is away from 0.

Answer 30

a low p-value -reject null hypothesis (there is a relationship between X and Y).

Answer 31

RSE (lack of fit) smaller if there is a better fit | R^2 (absolute measure of lack of fit), higher = better

Answer 32

TSS-RSS/TSS or SSE/SST or 1-RSS/TSS RSS measures the amount of variability that is left unexplained after performing the regression.

Answer 33

Close to 1 -> large part of the variability is explained by regression.

Answer 34

Interpretational advantage (between 0 and 1)

Answer 35

Both a measure of linear relationship

Answer 36

Total variance in response to Y

Answer 37

``` H0 = B1=B2=Bn=0 H1 = at least one bj is non-zero ``` This is done by an F-statistic

Answer 38

F = [(TSS-RSS)/k]/[RSS/n-k-1]

Answer 39

High F-statistic suggest at least one variable is related to the Y variable. It is dependent on n and k.

Answer 40

High chance we will incorrectly conclude there is a relationship

Answer 41

1. Mallow's Cp 2. Akaike information criterion (AIC) 3. Bayesian information criterion (BIC) 4. Adjusted R2 5. Plotting residuals

Answer 42

Forward selection: keep adding variables that add the lowest amount of RSS. Backward selection: start with all variables and reduce the variable with the largest P-value. Mixed selection: Combination of Forward and Backward selection.

Answer 43

Square[1/(n-p-1)*RSS]

Answer 44

Turning a qualitative value into a numerical one with two possible values 0 and 1

Answer 45

You can create additional dummy variables (splitting the attributes into different predictors)

Answer 46

Additive: Association between a predictor and Y does not depend on depend on the values of other predictors. linear: the change in Y is constant for an increase in Xj

Answer 47

interaction effect = synergy between predictors

Answer 48

1. when they have large main effects. | 2. the interaction has been proven in earlier studies

Answer 49

If the interaction between X1 and X2 seem important we should always include the main effects even if p-values are low

Answer 50

Accommodates non linear relationships

Answer 51

non-linearity: real relationship might not be linear correlation of error terms: underestimates true standard errors (can lead to narrow prediction intervals) non-constant-variance: or heteroskedasticity can be reduced by taking the log or square root (to shrink outcomes to better fit the line). outliers: can have a large effect on R2 and RSE High-leverage points: unusual value for x1 collinearity: two or more variables are closely correlated. Increases standard error and thus declines the t-statistic

Answer 52

When the arithmetic returns are small there will be little difference between geometric and arithmetic returns. When volatility increases and time decreases the difference grows larger.

Answer 53

Arithmetic follows a straight line while geometric returns follow a curve

Answer 54

resolution: how densely data is covered (the finer the resolution the fatter the tails) time horizon: short periods = fatter tails

Answer 55

random walk: Yt = drift term + Yt-1 + error AR(1) model: Yt = drift + aYt-1 + error a = mean reversion

Answer 56

stationarity: no trend, and covariance doesn't change

Answer 57

Yt - Yt¡1 = ¹ + (a-1) Yt¡1 + et

Answer 58

A mixture of normal distribution with different variances

Answer 59

0 =< a =< 1, 0 =< b =< 1

Answer 60

checking significance of the parameter estimates and how well it models the volatility of the process.

Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards

(84 cards)