Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards

1
Q

Define information

A

A quantity which reduces uncertainty about something

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define prediction in the context of data science

A

A formula for estimating an unknown value of interest: the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Compare and contrast predictive modeling with descriptive modeling.

A

Predictive Modelling tries to estimate a value while Descriptive Modelling tries to gain insight into the underlying phenomenon or process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define attributes or features

A

Attributes or features are selected variables used as input to estimate the value of the target variable. In database terminology these are the columns (instances or feature values are the rows).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe model induction

A

the procedure that creates the model from the data is called the induction algorithm or learner.

Induction = generalizing from specific cases to general rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Contrast induction with deduction

A

Deduction starts with general rules and specific facts and creates other specific facts from them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define the training data and labeled data.

A

training data is the input data for the induction algorithm. They are called labeled data because the value for the target variable is known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe supervised segmentation

A

To determine which are the most informative attributes (columns) when predicting the value of the target you can use supervised segmentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

List the complications arising from selecting informative attributes.

A
  • Attributes rarely split a group perfectly
  • Not all attributes are binary
  • Some attributes take on numeric values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When is a segmented group considered pure?

A

If every member of the group has the same value for the garget, then the group is pure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do you call outcome of a formula that evaluates how well each attribute splits a set of examples into segments?

A

purity measure or splitting criterion (most common one is information gain which is based on entropy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define entropy

A

Entropy measures the general disorder of a single set and corresponds to how mixed (impure) the segment is with respect to properties of interest.

high mix = high impurity = high entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Calculate the value of entropy

A

Parent set = 10, 7 non-write off, 3 write off

P(non-write off) = 7/10 = 70%
P(write off) = 3/10 = 30%

entropy = -[0.7 x log2(0.7) + 0.3 x log2(0.3) = 0.88

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define information gain

A

a measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.

IG -> change in entropy due to any amount of new information added in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Formula information gain

A

Parent entropy - (weighted average of children’s entropy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Calculate information gain for a set of children from a parent set

A

IG(parent, children) = entropy(parent) - [p(c1) x entropy(c2) + p(c2) x entropy(c2) + …]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does entropy relate to information gain?

A

entropy is a measure of disorder in the dataset, information gain is a measure of the decrease in disorder achieved by segmenting the original data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Discuss the issues with the numerical variables for supervised segmentation

A

Does it make sense to create a segment for each number? Numeric values are often discretized by choosing a split point (e.g. larger than or equal to 50%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define variance and discuss its application to numeric variables for supervised segmentation.

A

Variance is a measure for numerical values. You can look at the information gain by reductions in variance between parents and children.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define an entropy graph/chart

A

X-axis proportion of the dataset, Y-axis is the entropy
the shaded area is the entropy when divided by some chosen attribute.

Goal is to decrease the shade.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe how an entropy chart can be used to select an informative variable.

A

Select the attribute which decreases the shaded area the most and does so for most of the values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define a classification tree and decision nodes.

A

A classification tree (supervised segmentation) starts with a root node with branches to nodes (decision nodes) and ultimately to a terminal node or leaf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define a probability estimation tree, and tree induction.

A

probability estimation tree -> leafs contain probabilities
tree induction -> at each step select an attribute to partition the current group into subgroups that are as pure as possible with regards to the target variable (e.e.g Oval Body/Square Body)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define a decision surface or decision boundaries.

A

Lines separating the regions in an instance space (scatterplot)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe the relationship between the decision surface and the number of variables.

A

n variables gives n-1 dimensional hyperplane

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Define frequency-based estimation of class membership probability

A

At a leaf if you have n positives and m negatives the frequency based probability of n is n/(n+m)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Describe how Laplace correction is used to modify the probability of a leaf node with few members.

A

If you have one observation at a leaf the probability is 100%, LaPlace corrects for that.

n+1 / (n+m+2)

The higher the number of instances the less effect you have of the LaPlace correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Define a linear classifier.

A

Weighted sum of the values for the various attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Define a linear discriminant.

A

Decision boundary where you classify instances of x (e.g. + or -)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Describe decision boundaries in 2-dimensions, 3-dimensions, and higher dimensions.

A

Decision boundaries:
2-dimension = above or below the line
3-dimension = a plane
Higher-dimensions = hyperplane

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Interpret the magnitude of a feature’s weight in a general linear model.

A

heavier weight = more importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Describe how linear discriminant functions can be used for scoring and ranking instances.

A

the output of the function gives a ranking itself (the further away from the decision boundary the more certain the instance belongs to the class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Describe the objective function of the Support Vector Machine (SVM).

A

SVM (linear discriminants) fits the fattest bar between the classes (Maximizing margin) and the linear discriminant will be the center line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Describe the important ideas behind the SVM.

A
  • Margin-maximizing boundary gives leeway for classifying such points
  • Points on the wrong side will incur a penalty OR causes the margin to be changed
35
Q

Define the hinge-loss function, zero-one loss function, and squared error.

A

Hinge-loss function: for points beyond the margin loss increases linearly with the distance from the margin

Zero-one loss function: 0 for correct decision, 1 for incorrect decision

Squared error function: squares the distance (i.e. large mistakes are grossly penalized)

36
Q

Describe the reason for not using a squared loss function in classification problems.

A

It also penalizes points far on the correct side of the decision boundary

37
Q

Describe the major drawback of the least-squares regression.

A

Sensitive to the data (outliers cause skews)

38
Q

Calculate odds and log odds.

A

odds: P(event happening)/P(event not happening)

Log odds: log(odds)
-> so now event not happening odds range between infinity to 0 (before log odds it is between 0 and 1)

39
Q

List the important features of the logistic regression.

A
  • For probability estimation it uses the same linear model as for linear discriminants
  • Output is interpreted as the log-odds of class membership
  • Log odds can be translated directly into the probability of class membership
40
Q

Calculate class probability using the logistic function.

A

1/1+e^-f(x)

41
Q

Describe the shape of the logistic function.

A

s shape (sigmoid function)

42
Q

Compare and contrast classification trees with linear classifiers.

A
  • Classification tree uses boundaries that are perpendicular, linear can take any direction or orientation
  • Classification tree is a “piecewise” classifier (cuts up instance space into smaller attributes. Linear classifier places a single decision surface through the entire space.
43
Q

Define the residual and the residual sum of squares (RSS).

A
  • the residual is the difference between y and y hat

- RSS is the sum off all residuals squared

44
Q

Calculate the value of RSS.

A

error1^2+error2^2+error3^2….errorN^N

45
Q

Calculate the least-squares coefficient estimates.

A

Least-squares uses the intercept (b0) and slope (b1) to minimize the errors

46
Q

interpret the least-squares coefficients.

A

b0 - intercept - expected value of Y when X = 0
b1 - slope - avg. increase in Y for a unit rise in X
error = catchall for all errors missed with this simple model

47
Q

Define the population regression line and least-squares line.

A

population regression line: best linear approximation to the true relationship of X and Y. (unobserved)

least-squares-line: best estimate based on observed data

48
Q

Define the concept of bias and unbiased estimators.

A

an unbiased predictor -> statistics calculated from a large set of observations will approach the population statistic

statistics (bo, b1, avg) calculated on a small set of observations will under- or overestimate the statistic

49
Q

Define standard error

A

Standard error (variance/n) is the average amount that the estimate differs from the actual value (the higher the n the smaller the standard error).

50
Q

Define residual standard error

A

the approximation of the standard deviation from the observed data (RSE = SQRT(RSS/N-2)

51
Q

Calculate the 95% confidence interval.

A

b1 +- 2 x SE(b1)

95% chance that the interval will contain the true value of b1

52
Q

Describe null and alternative hypotheses

A

null Hypothesis: No relationship between X and Y

alternative Hypothesis: There is some relationship between X and Y

53
Q

Calculate the t-statistic.

A

(b1-0)/SE(b1) this measures the number of standard deviations b1 is away from 0.

54
Q

Explain the rules for rejecting the null hypothesis using p-values.

A

a low p-value -reject null hypothesis (there is a relationship between X and Y).

55
Q

Assess the accuracy of linear regression

A

RSE (lack of fit) smaller if there is a better fit

R^2 (absolute measure of lack of fit), higher = better

56
Q

Calculate the R2 statistic given TSS and RSS.

A

TSS-RSS/TSS or SSE/SST or 1-RSS/TSS

RSS measures the amount of variability that is left unexplained after performing the regression.

57
Q

Interpret the given values of R^2

A

Close to 1 -> large part of the variability is explained by regression.

58
Q

Describe the advantages of the R2 statistic over the RSE

A

Interpretational advantage (between 0 and 1)

59
Q

Describe the relationship between R2 and correlation

A

Both a measure of linear relationship

60
Q

Define the total sum of squares

A

Total variance in response to Y

61
Q

Describe how the relationship between responses and predictors is tested in multiple linear regression.

A
H0 = B1=B2=Bn=0
H1 = at least one bj is non-zero 

This is done by an F-statistic

62
Q

Calculate the F-statistic given TSS, RSS, n, and p.

A

F = [(TSS-RSS)/k]/[RSS/n-k-1]

63
Q

Explain how the F-statistic can be used for hypothesis testing

A

High F-statistic suggest at least one variable is related to the Y variable. It is dependent on n and k.

64
Q

Explain why the value of the t-statistic can be a misleading indicator of variable importance in a multiple regression

A

High chance we will incorrectly conclude there is a relationship

65
Q

Describe how to determine the importance of variables in a given multiple regression.

A
  1. Mallow’s Cp
  2. Akaike information criterion (AIC)
  3. Bayesian information criterion (BIC)
  4. Adjusted R2
  5. Plotting residuals
66
Q

Define forward selection, backward selection, and mixed selection in the context of variable selection in MLR.

A

Forward selection: keep adding variables that add the lowest amount of RSS.

Backward selection: start with all variables and reduce the variable with the largest P-value.

Mixed selection: Combination of Forward and Backward selection.

67
Q

Calculate RSE given the values of RSS, n, and p.

A

Square[1/(n-p-1)*RSS]

68
Q

Define dummy variables.

A

Turning a qualitative value into a numerical one with two possible values 0 and 1

69
Q

Describe how to use qualitative variables with more than two levels in multiple regression.

A

You can create additional dummy variables (splitting the attributes into different predictors)

70
Q

Describe additive and linear assumptions for the linear regression model

A

Additive: Association between a predictor and Y does not depend on depend on the values of other predictors.

linear: the change in Y is constant for an increase in Xj

71
Q

Define the interaction effect

A

interaction effect = synergy between predictors

72
Q

Explain when an interaction term should be added to a multiple regression model.

A
  1. when they have large main effects.

2. the interaction has been proven in earlier studies

73
Q

Describe the hierarchical principle for multiple regression.

A

If the interaction between X1 and X2 seem important we should always include the main effects even if p-values are low

74
Q

Define polynomial regression.

A

Accommodates non linear relationships

75
Q

Describe the potential problems for a linear regression model, such as non-linearity, correlation of error terms, a non-constant variance of error terms,
outliers, high-leverage points, and collinearity.

A

non-linearity: real relationship might not be linear
correlation of error terms: underestimates true standard errors (can lead to narrow prediction intervals)

non-constant-variance: or heteroskedasticity can be reduced by taking the log or square root (to shrink outcomes to better fit the line).

outliers: can have a large effect on R2 and RSE

High-leverage points: unusual value for x1

collinearity: two or more variables are closely correlated. Increases standard error and thus declines the t-statistic

76
Q

Recognize and apply the relationship between arithmetic and geometric returns

A

When the arithmetic returns are small there will be little difference between geometric and arithmetic returns.

When volatility increases and time decreases the difference grows larger.

77
Q

Describe the shape of the plotted line when geometric returns are plotted against arithmetic returns.

A

Arithmetic follows a straight line while geometric returns follow a curve

78
Q

Define time resolution and time horizon.

A

resolution: how densely data is covered (the finer the resolution the fatter the tails)

time horizon: short periods = fatter tails

79
Q

Describe a random walk model and an autoregressive model.

A

random walk: Yt = drift term + Yt-1 + error

AR(1) model: Yt = drift + aYt-1 + error

a = mean reversion

80
Q

Define stationarity

A

stationarity: no trend, and covariance doesn’t change

81
Q

Recognize and apply the formula for the autocorrelation function.

A

Yt - Yt¡1 = ¹ + (a-1) Yt¡1 + et

82
Q

Describe a GARCH(1,1) model

A

A mixture of normal distribution with different variances

83
Q

List the conditions that must be satisfied by the parameters of a GARCH(1,1) model

A

0 =< a =< 1, 0 =< b =< 1

84
Q

Describe the goodness-of-fit for a GARCH model

A

checking significance of the parameter estimates and how well it models the volatility of the process.