Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards
Define information
A quantity which reduces uncertainty about something
Define prediction in the context of data science
A formula for estimating an unknown value of interest: the target
Compare and contrast predictive modeling with descriptive modeling.
Predictive Modelling tries to estimate a value while Descriptive Modelling tries to gain insight into the underlying phenomenon or process.
Define attributes or features
Attributes or features are selected variables used as input to estimate the value of the target variable. In database terminology these are the columns (instances or feature values are the rows).
Describe model induction
the procedure that creates the model from the data is called the induction algorithm or learner.
Induction = generalizing from specific cases to general rules
Contrast induction with deduction
Deduction starts with general rules and specific facts and creates other specific facts from them.
Define the training data and labeled data.
training data is the input data for the induction algorithm. They are called labeled data because the value for the target variable is known.
Describe supervised segmentation
To determine which are the most informative attributes (columns) when predicting the value of the target you can use supervised segmentation.
List the complications arising from selecting informative attributes.
- Attributes rarely split a group perfectly
- Not all attributes are binary
- Some attributes take on numeric values
When is a segmented group considered pure?
If every member of the group has the same value for the garget, then the group is pure.
What do you call outcome of a formula that evaluates how well each attribute splits a set of examples into segments?
purity measure or splitting criterion (most common one is information gain which is based on entropy)
Define entropy
Entropy measures the general disorder of a single set and corresponds to how mixed (impure) the segment is with respect to properties of interest.
high mix = high impurity = high entropy
Calculate the value of entropy
Parent set = 10, 7 non-write off, 3 write off
P(non-write off) = 7/10 = 70%
P(write off) = 3/10 = 30%
entropy = -[0.7 x log2(0.7) + 0.3 x log2(0.3) = 0.88
Define information gain
a measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
IG -> change in entropy due to any amount of new information added in
Formula information gain
Parent entropy - (weighted average of children’s entropy)
Calculate information gain for a set of children from a parent set
IG(parent, children) = entropy(parent) - [p(c1) x entropy(c2) + p(c2) x entropy(c2) + …]
How does entropy relate to information gain?
entropy is a measure of disorder in the dataset, information gain is a measure of the decrease in disorder achieved by segmenting the original data set
Discuss the issues with the numerical variables for supervised segmentation
Does it make sense to create a segment for each number? Numeric values are often discretized by choosing a split point (e.g. larger than or equal to 50%)
Define variance and discuss its application to numeric variables for supervised segmentation.
Variance is a measure for numerical values. You can look at the information gain by reductions in variance between parents and children.
Define an entropy graph/chart
X-axis proportion of the dataset, Y-axis is the entropy
the shaded area is the entropy when divided by some chosen attribute.
Goal is to decrease the shade.
Describe how an entropy chart can be used to select an informative variable.
Select the attribute which decreases the shaded area the most and does so for most of the values
Define a classification tree and decision nodes.
A classification tree (supervised segmentation) starts with a root node with branches to nodes (decision nodes) and ultimately to a terminal node or leaf.
Define a probability estimation tree, and tree induction.
probability estimation tree -> leafs contain probabilities
tree induction -> at each step select an attribute to partition the current group into subgroups that are as pure as possible with regards to the target variable (e.e.g Oval Body/Square Body)
Define a decision surface or decision boundaries.
Lines separating the regions in an instance space (scatterplot)
Describe the relationship between the decision surface and the number of variables.
n variables gives n-1 dimensional hyperplane
Define frequency-based estimation of class membership probability
At a leaf if you have n positives and m negatives the frequency based probability of n is n/(n+m)
Describe how Laplace correction is used to modify the probability of a leaf node with few members.
If you have one observation at a leaf the probability is 100%, LaPlace corrects for that.
n+1 / (n+m+2)
The higher the number of instances the less effect you have of the LaPlace correction
Define a linear classifier.
Weighted sum of the values for the various attributes
Define a linear discriminant.
Decision boundary where you classify instances of x (e.g. + or -)
Describe decision boundaries in 2-dimensions, 3-dimensions, and higher dimensions.
Decision boundaries:
2-dimension = above or below the line
3-dimension = a plane
Higher-dimensions = hyperplane
Interpret the magnitude of a feature’s weight in a general linear model.
heavier weight = more importance
Describe how linear discriminant functions can be used for scoring and ranking instances.
the output of the function gives a ranking itself (the further away from the decision boundary the more certain the instance belongs to the class)
Describe the objective function of the Support Vector Machine (SVM).
SVM (linear discriminants) fits the fattest bar between the classes (Maximizing margin) and the linear discriminant will be the center line.
Describe the important ideas behind the SVM.
- Margin-maximizing boundary gives leeway for classifying such points
- Points on the wrong side will incur a penalty OR causes the margin to be changed
Define the hinge-loss function, zero-one loss function, and squared error.
Hinge-loss function: for points beyond the margin loss increases linearly with the distance from the margin
Zero-one loss function: 0 for correct decision, 1 for incorrect decision
Squared error function: squares the distance (i.e. large mistakes are grossly penalized)
Describe the reason for not using a squared loss function in classification problems.
It also penalizes points far on the correct side of the decision boundary
Describe the major drawback of the least-squares regression.
Sensitive to the data (outliers cause skews)
Calculate odds and log odds.
odds: P(event happening)/P(event not happening)
Log odds: log(odds)
-> so now event not happening odds range between infinity to 0 (before log odds it is between 0 and 1)
List the important features of the logistic regression.
- For probability estimation it uses the same linear model as for linear discriminants
- Output is interpreted as the log-odds of class membership
- Log odds can be translated directly into the probability of class membership
Calculate class probability using the logistic function.
1/1+e^-f(x)
Describe the shape of the logistic function.
s shape (sigmoid function)
Compare and contrast classification trees with linear classifiers.
- Classification tree uses boundaries that are perpendicular, linear can take any direction or orientation
- Classification tree is a “piecewise” classifier (cuts up instance space into smaller attributes. Linear classifier places a single decision surface through the entire space.
Define the residual and the residual sum of squares (RSS).
- the residual is the difference between y and y hat
- RSS is the sum off all residuals squared
Calculate the value of RSS.
error1^2+error2^2+error3^2….errorN^N
Calculate the least-squares coefficient estimates.
Least-squares uses the intercept (b0) and slope (b1) to minimize the errors
interpret the least-squares coefficients.
b0 - intercept - expected value of Y when X = 0
b1 - slope - avg. increase in Y for a unit rise in X
error = catchall for all errors missed with this simple model
Define the population regression line and least-squares line.
population regression line: best linear approximation to the true relationship of X and Y. (unobserved)
least-squares-line: best estimate based on observed data
Define the concept of bias and unbiased estimators.
an unbiased predictor -> statistics calculated from a large set of observations will approach the population statistic
statistics (bo, b1, avg) calculated on a small set of observations will under- or overestimate the statistic
Define standard error
Standard error (variance/n) is the average amount that the estimate differs from the actual value (the higher the n the smaller the standard error).
Define residual standard error
the approximation of the standard deviation from the observed data (RSE = SQRT(RSS/N-2)
Calculate the 95% confidence interval.
b1 +- 2 x SE(b1)
95% chance that the interval will contain the true value of b1
Describe null and alternative hypotheses
null Hypothesis: No relationship between X and Y
alternative Hypothesis: There is some relationship between X and Y
Calculate the t-statistic.
(b1-0)/SE(b1) this measures the number of standard deviations b1 is away from 0.
Explain the rules for rejecting the null hypothesis using p-values.
a low p-value -reject null hypothesis (there is a relationship between X and Y).
Assess the accuracy of linear regression
RSE (lack of fit) smaller if there is a better fit
R^2 (absolute measure of lack of fit), higher = better
Calculate the R2 statistic given TSS and RSS.
TSS-RSS/TSS or SSE/SST or 1-RSS/TSS
RSS measures the amount of variability that is left unexplained after performing the regression.
Interpret the given values of R^2
Close to 1 -> large part of the variability is explained by regression.
Describe the advantages of the R2 statistic over the RSE
Interpretational advantage (between 0 and 1)
Describe the relationship between R2 and correlation
Both a measure of linear relationship
Define the total sum of squares
Total variance in response to Y
Describe how the relationship between responses and predictors is tested in multiple linear regression.
H0 = B1=B2=Bn=0 H1 = at least one bj is non-zero
This is done by an F-statistic
Calculate the F-statistic given TSS, RSS, n, and p.
F = [(TSS-RSS)/k]/[RSS/n-k-1]
Explain how the F-statistic can be used for hypothesis testing
High F-statistic suggest at least one variable is related to the Y variable. It is dependent on n and k.
Explain why the value of the t-statistic can be a misleading indicator of variable importance in a multiple regression
High chance we will incorrectly conclude there is a relationship
Describe how to determine the importance of variables in a given multiple regression.
- Mallow’s Cp
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- Adjusted R2
- Plotting residuals
Define forward selection, backward selection, and mixed selection in the context of variable selection in MLR.
Forward selection: keep adding variables that add the lowest amount of RSS.
Backward selection: start with all variables and reduce the variable with the largest P-value.
Mixed selection: Combination of Forward and Backward selection.
Calculate RSE given the values of RSS, n, and p.
Square[1/(n-p-1)*RSS]
Define dummy variables.
Turning a qualitative value into a numerical one with two possible values 0 and 1
Describe how to use qualitative variables with more than two levels in multiple regression.
You can create additional dummy variables (splitting the attributes into different predictors)
Describe additive and linear assumptions for the linear regression model
Additive: Association between a predictor and Y does not depend on depend on the values of other predictors.
linear: the change in Y is constant for an increase in Xj
Define the interaction effect
interaction effect = synergy between predictors
Explain when an interaction term should be added to a multiple regression model.
- when they have large main effects.
2. the interaction has been proven in earlier studies
Describe the hierarchical principle for multiple regression.
If the interaction between X1 and X2 seem important we should always include the main effects even if p-values are low
Define polynomial regression.
Accommodates non linear relationships
Describe the potential problems for a linear regression model, such as non-linearity, correlation of error terms, a non-constant variance of error terms,
outliers, high-leverage points, and collinearity.
non-linearity: real relationship might not be linear
correlation of error terms: underestimates true standard errors (can lead to narrow prediction intervals)
non-constant-variance: or heteroskedasticity can be reduced by taking the log or square root (to shrink outcomes to better fit the line).
outliers: can have a large effect on R2 and RSE
High-leverage points: unusual value for x1
collinearity: two or more variables are closely correlated. Increases standard error and thus declines the t-statistic
Recognize and apply the relationship between arithmetic and geometric returns
When the arithmetic returns are small there will be little difference between geometric and arithmetic returns.
When volatility increases and time decreases the difference grows larger.
Describe the shape of the plotted line when geometric returns are plotted against arithmetic returns.
Arithmetic follows a straight line while geometric returns follow a curve
Define time resolution and time horizon.
resolution: how densely data is covered (the finer the resolution the fatter the tails)
time horizon: short periods = fatter tails
Describe a random walk model and an autoregressive model.
random walk: Yt = drift term + Yt-1 + error
AR(1) model: Yt = drift + aYt-1 + error
a = mean reversion
Define stationarity
stationarity: no trend, and covariance doesn’t change
Recognize and apply the formula for the autocorrelation function.
Yt - Yt¡1 = ¹ + (a-1) Yt¡1 + et
Describe a GARCH(1,1) model
A mixture of normal distribution with different variances
List the conditions that must be satisfied by the parameters of a GARCH(1,1) model
0 =< a =< 1, 0 =< b =< 1
Describe the goodness-of-fit for a GARCH model
checking significance of the parameter estimates and how well it models the volatility of the process.