Test 2 W5-7 Flashcards
Tuesday, October 30
Regression Models: What happens when M is to large or when M is to small?
M is the polynomial order
When M=0 (example), underfitting occurs
When M=9 (example),
overfitting occurs
What happens as polynomials increase?
When the order of polynomials increases, models become more complex training error keeps decreasing but test error will increase.
How can you improve the situation of overfitting?
have more training data or constraining the magnitude of weights (is a typical way to prevent overfitting as it controls the complexity of models) -> if you start your weights with small values, “stop (training) earlier” corresponds to constrain weight values to be smal and can control overfitting
(svm naturally does this to some degree)
> done so using regularization (technique used to control the size of weights and hence control model complexity)
When there is not enough training data, what happens to the magnitude of the weights w as M increases?
W typically gets larger
What is regularization?
Used to control overfitting. It adds a penalty term to the error function in order to discourage the coefficients from reaching larger values.
Aims to balance between minimizing unregularized error and finding “small” weights (those close to the origin)
What is the result of the following regularization values?
ln λ = -18
ln λ = 0
ln λ = -∞
ln λ = -18 : overfitting has been suppressed and we now obtain a much closer representation of the underlying function
ln λ = 0 : used too large a value for λ and therefore obtain a poor fit
ln λ = -∞ : corresponds to a model with no regularization
What is the result of over regularization? and how do you prevent it?
underfitting (therefore you should use (cross) validation to select models by picking a good regularization coefficient)
In the case of q = 1 known as the lasso model. What happens as a result if λ is sufficiently large?
Lasso may give us small/sparse models, where some weights are zero
How to learn parameters for linear regression models, given a training data set?
- For square errors, minimize the error function with regard to parameter w.
- set the gradient to zero with regard to w
- then you can calculate “matrix” or “feature matrix” (if did not use the bases function to process your features)
How can we view linear regression? vs. a probabilistic view?
Linear regression: as finding a curve to minimize a predefined error (e.g. squared error or absolute error)
Probabilistic view: can think of regression as finding a curve that can generate t with some noise distribution (e.g. Gaussian) with the highest probability)
These two views are the same. The equivalence can be seen if we use maximum likelihood to estimate w in the probabilistic viewpoint/
What is unsupervised learning?
Machine learning with no outputs, only inputs
Def: Clustering
and what do all clustering methods require?
grouping similar training cases together and identifying a “prototype” example to represent each group (e.g., the mean of a cluster)
All methods require a way to measure distance between two data points, e.g. the euclidean distance
High level: what is K-means (the algorithm)
- for a pre-defined number of clusters k, we first randomly initialize a center for each cluster
- Then alternate between 2 steps:
1. K step: For all data points, assign each to the cluster whose center is closest to this data point
2. M step: Update each cluster’s center to be the mean of all points assigned to that cluster
The goal of k-means?
find an assignment of data points to clusters, as well as a set of centers {μ(subscript-k)} (each cluster has a center), such that the sum of the squared Euclidean distances of each data point to its closest vector μ(subscript-k) is a minimum. Where N is the number of data points. K-means minimizes the above error function by finding {μ(subscript-k)} and {r(subscript-nk)}.
In K-means how do we optimize {r(subscript-nk)}?
assign each point to the cluster whose center is closest to this point
In k-means how do we optimize {μ(subscript-k)} with {r(subscript-nk)} fixed
Set derivative of j (with regard to {μ(subscript-k)}) to be zero. (set μ(subscript-k)} to the mean of all points assigned to cluster k)
Def: k-medians
Instead of using squared Euclidean distance, if we use the error function, we get k-medians.
M step is still easy: just take median of all points assigned to cluster k
Def: K-medoids
We update the new cluster center to be one of the points assigned to that cluster.
What is the issue with k-mediods?
Have to try every possible point. Expensive!
Describe the following tricks used in k-means:
- initialization
- Ties in distance
- Pick # of clusters and error functions
- speed
- robust errors
- avoid local minima
- Initialization: set all centers randomly
- Ties in distance: add points to small clusters first
- pick # of clusters and error functions: i.e. cross-validation (cluster number selected such that increase in number leads to only small reduction in error function)
- speed: use some data structures to speed up (i.e. tree)
- robust erors: for k-means use the squared error up to some max error then constant error beyond, or use other distances
- avoid local minima: use random restarts; split and merge clusters
What is hierarchical clustering? and what are the two types?
Break the dataset into a series of nested clusters organized in a tree
- Agglomerative alg
- Divisive alg
What is the agglomerative algorithm?
A hierarchical cluster algorithm.
- start with each datapoint in its own cluster and then successively merge similar clusters until a single cluster remains
- several methods for merging. most are based on computing cluster distances from pairwise distances between all pairts of points and then merging the two clusters with smallest cluster discances
What is the divisive algorithm?
A hierarchical clustering algorithm
- start with all the data in a single cluster and successively split clusters