Describe the following tricks used in k-means: - initialization - Ties in distance - Pick of clusters and error functions - speed - robust errors - avoid local minima

- Initialization: set all centers randomly - Ties in distance: add points to small clusters first - pick of clusters and error functions: i.e. cross-validation (cluster number selected such that increase in number leads to only small reduction in error function) - speed: use some data structures to speed up (i.e. tree) - robust erors: for k-means use the squared error up to some max error then constant error beyond, or use other distances - avoid local minima: use random restarts; split and merge clusters

Test 2 W5-7 Flashcards by Michaela Wiederick

Regression Models: What happens when M is to large or when M is to small?

M is the polynomial order
When M=0 (example), underfitting occurs
When M=9 (example),
overfitting occurs

How well did you know this?

Not at all

Perfectly

What happens as polynomials increase?

When the order of polynomials increases, models become more complex training error keeps decreasing but test error will increase.

How well did you know this?

Not at all

Perfectly

How can you improve the situation of overfitting?

have more training data or constraining the magnitude of weights (is a typical way to prevent overfitting as it controls the complexity of models) -> if you start your weights with small values, “stop (training) earlier” corresponds to constrain weight values to be smal and can control overfitting
(svm naturally does this to some degree)
> done so using regularization (technique used to control the size of weights and hence control model complexity)

How well did you know this?

Not at all

Perfectly

When there is not enough training data, what happens to the magnitude of the weights w as M increases?

W typically gets larger

How well did you know this?

Not at all

Perfectly

What is regularization?

Used to control overfitting. It adds a penalty term to the error function in order to discourage the coefficients from reaching larger values.
Aims to balance between minimizing unregularized error and finding “small” weights (those close to the origin)

How well did you know this?

Not at all

Perfectly

What is the result of the following regularization values?
ln λ = -18
ln λ = 0
ln λ = -∞

ln λ = -18 : overfitting has been suppressed and we now obtain a much closer representation of the underlying function
ln λ = 0 : used too large a value for λ and therefore obtain a poor fit
ln λ = -∞ : corresponds to a model with no regularization

How well did you know this?

Not at all

Perfectly

What is the result of over regularization? and how do you prevent it?

underfitting (therefore you should use (cross) validation to select models by picking a good regularization coefficient)

How well did you know this?

Not at all

Perfectly

In the case of q = 1 known as the lasso model. What happens as a result if λ is sufficiently large?

Lasso may give us small/sparse models, where some weights are zero

How well did you know this?

Not at all

Perfectly

How to learn parameters for linear regression models, given a training data set?

For square errors, minimize the error function with regard to parameter w.
set the gradient to zero with regard to w
then you can calculate “matrix” or “feature matrix” (if did not use the bases function to process your features)

How well did you know this?

Not at all

Perfectly

How can we view linear regression? vs. a probabilistic view?

Linear regression: as finding a curve to minimize a predefined error (e.g. squared error or absolute error)
Probabilistic view: can think of regression as finding a curve that can generate t with some noise distribution (e.g. Gaussian) with the highest probability)

These two views are the same. The equivalence can be seen if we use maximum likelihood to estimate w in the probabilistic viewpoint/

How well did you know this?

Not at all

Perfectly

What is unsupervised learning?

Machine learning with no outputs, only inputs

How well did you know this?

Not at all

Perfectly

Def: Clustering

and what do all clustering methods require?

grouping similar training cases together and identifying a “prototype” example to represent each group (e.g., the mean of a cluster)

All methods require a way to measure distance between two data points, e.g. the euclidean distance

How well did you know this?

Not at all

Perfectly

High level: what is K-means (the algorithm)

for a pre-defined number of clusters k, we first randomly initialize a center for each cluster
Then alternate between 2 steps:
1. K step: For all data points, assign each to the cluster whose center is closest to this data point
2. M step: Update each cluster’s center to be the mean of all points assigned to that cluster

How well did you know this?

Not at all

Perfectly

The goal of k-means?

find an assignment of data points to clusters, as well as a set of centers {μ(subscript-k)} (each cluster has a center), such that the sum of the squared Euclidean distances of each data point to its closest vector μ(subscript-k) is a minimum. Where N is the number of data points. K-means minimizes the above error function by finding {μ(subscript-k)} and {r(subscript-nk)}.

How well did you know this?

Not at all

Perfectly

In K-means how do we optimize {r(subscript-nk)}?

assign each point to the cluster whose center is closest to this point

How well did you know this?

Not at all

Perfectly

In k-means how do we optimize {μ(subscript-k)} with {r(subscript-nk)} fixed

Set derivative of j (with regard to {μ(subscript-k)}) to be zero. (set μ(subscript-k)} to the mean of all points assigned to cluster k)

How well did you know this?

Not at all

Perfectly

Def: k-medians

Instead of using squared Euclidean distance, if we use the error function, we get k-medians.
M step is still easy: just take median of all points assigned to cluster k

How well did you know this?

Not at all

Perfectly

Def: K-medoids

We update the new cluster center to be one of the points assigned to that cluster.

How well did you know this?

Not at all

Perfectly

What is the issue with k-mediods?

Have to try every possible point. Expensive!

How well did you know this?

Not at all

Perfectly

Describe the following tricks used in k-means:

initialization
Ties in distance
Pick # of clusters and error functions
speed
robust errors
avoid local minima

Initialization: set all centers randomly
Ties in distance: add points to small clusters first
pick # of clusters and error functions: i.e. cross-validation (cluster number selected such that increase in number leads to only small reduction in error function)
speed: use some data structures to speed up (i.e. tree)
robust erors: for k-means use the squared error up to some max error then constant error beyond, or use other distances
avoid local minima: use random restarts; split and merge clusters

How well did you know this?

Not at all

Perfectly

What is hierarchical clustering? and what are the two types?

Break the dataset into a series of nested clusters organized in a tree

Agglomerative alg
Divisive alg

How well did you know this?

Not at all

Perfectly

What is the agglomerative algorithm?

A hierarchical cluster algorithm.

start with each datapoint in its own cluster and then successively merge similar clusters until a single cluster remains
several methods for merging. most are based on computing cluster distances from pairwise distances between all pairts of points and then merging the two clusters with smallest cluster discances

How well did you know this?

Not at all

Perfectly

What is the divisive algorithm?

A hierarchical clustering algorithm

- start with all the data in a single cluster and successively split clusters

How well did you know this?

Not at all

Perfectly

What is 1-of-k coding schema?

Study These Flashcards

i.e. k-means

Assigns a data point to one of k clusters

What is the basic idea of mixture of Gaussian?

All data points to belong to multiple clusters with some possibility (i.e. assignment) during the process of clustering

What does MOG assume?

MOG= Mixture of Gaussian | Assumes each cluster k has a Gaussian distribution with its own mean μ(subscript-k) and covariance matrix Σ(subscript-k)

What are the steps in learning MOG (Mixture of Gaussian's)?

E step: instead of "hard assignment", we do "soft assignment" to give a data point some probability to belong to different clusters M step: each center is moved to the weighted mean of the data, with weights given by soft assignments

What are latent variables? and why might we want to introduce them?

Variables that are always unobserved (i.e. the assignment probability in MOG) Might want to intentionally introduce them to model complex dependencies between variables without looking at the dependencies between them directly

What do we use in situations when we have some unobserved variables?

The EM algorithm (Expectation-Maximization algorithm)

What is the EM algorithm?

The iterative E-step and M-step algorithm: E-step: Estimate "expectation" of "missing" or "unobserved" variables from observed data and current parameters rs θ. (in MOG, we compute the "soft assignment" weights/ probability) M-step: use the "completely" observed data, find the maximum likelihood parameter estimates for rs θ. (in MOG, based on soft ass'n of points to different clusters, we use regular max likelihood sol'ns to estimate mean (μ(subscript-k)) and covariances (Σ(subscript-k)))

What is the difference between EM and Mixture of Gaussians?

- EM is an optimization strategy for objective functions when we have some unobserved variables (more specifically, when the models are conditional on these unobserved variables) - EM is not a model such as MOG and K-means; it is used in these models to help optimize these models

When should you resort to the EM algorithm?

When you have some missing input, labels, or latent variables in your models

What does the EM optimization algorithm used in k-means guarantee?

guarantees to model's convergence to extrema but does not guarantee finding global extrema (often, use different initializations, run the algorithm multiple times, and pick the solution with minimal error to try to avoid local extrema)

What is Dimensionality reduction?

Another type of unsupervised learning problem. -> compress data to less dimensions while maintaining information

What is Principal Component Analysis (PCA)?

A typical dimensionality reduction model (and also used in data virtualization, feature extraction, pattern learning etc.) Can be defined as the othogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized Can also be defined as the linear projection that minimizes the average projection cost, defined as the mean squared distance between the dta points and their projections

When you order your eigenvectors by eigenvalues, highest to lowest what is V1 and V2 considered? how do you reduce dimensionality?

V1 - first principal component V2 - second principal component often you get n components. To reduce dimensionality of p, ignore n-p components at the bottom of the list

What is the goal of PCA? Considering a set of data points {x(subscript-n)} where n = 1,...N and {x(subscript-n)} has dimensionality D.

Goal is to project the data onto a space having dimensionality M equal to or smaller than D, while maximizing the variance of the projected data

What is the goal of PCA?

Maximize the projected variance by finding a proper u(subscript-1) (unit vector which can be regarded as a constraint)

in PCA how can we define additional principal components?

can do so in an incremental fashion by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered.

Once we have done PCA how do we reconstruct the original data?

RowDataAdjust=(RowFeatureVector)^T X TransformedData RowDataOriginal=RowDataAdjust + OriginalMean

What happens if you perform PCA data reconstruction using the complete feature vector that include all eigenvectors (1)? vs. if you leave our some eigenvectors when applying PCA dimensionality reduction (2)?

(1) can reconstruct exactly the original data you started with (2) you can reconstruct the original data with some information loss

How to generate continuous latent variable data?

When generating data, first generate a point within the low-dimensional subspace then add noise. Coordinates in subspace are components of latent variables.

What are factor models? more specifically factor analysis?

Many dimen. reduc. models like PCA aim to find this subspace specified by lower-dimensional coordinates. These models are called "factor models" (including Factor analysis, Princible component analysis, independent component analysis etc.) Specifically when we assume that the subspace is linear and that the underlying latent variable has a Gaussian distribution, we get a model known as a factor analysis.

In general how can you think of clustering and dimensionality reduction?

Can think of clustering as classification with missing class labels dimen. red. as regression with missing targets

The mixture of Gaussians assumes that each cluster has its own mean but all clusters share the same co-variance matrix True or False?

False

When training a constant regression model y = a, we need to use the training data to estimate a. Given a training set {(x(subscriptn),t(subscript n))}, where x(subscript n) is the feature vector of a data point, and t(subscrip n) is the output at that data point. If the error function that the model aims to minimize is the absolute error e = sum|t(subscript n) - a|, the model should use the median of t(subscript n) as the estimate for a. True or False?

True

In regression, when we increase the order of polynomials, the training error will decrease (until the error is zero); the test error will decrease first and then when overfitting happens the test error will start to increase true or false

true

The ---------- clustering algorithms start with one cluster then split them to smaller clusters while the ---------- clustering algorithm start with one data point per cluster then group them to reach a single cluster that contains all data points

divisive, agglomerative

In clustering, the further-first strategy states that: a. during initialization, have cluster centers as far as possible from one another b. convergence of clustering only happens when the initial centers are as far as possible fro one another c. Data points should be moved so that clusters are as far as possible from one another d. In the E step, data points are assigned to the further cluster center

a. during initialization, have cluster centers as far as possible from one another

PCA can be used to create lossless transformation where data is transformed to a new space that has the same dimensionality as the original space. In such a situation, the original data can be fully reconstructed from the transformed data without any information loss. True or False?

True For lossless compression the number of principal components must be equal to the number of the higher dimension attributes and hence no compression

In PCA, we get the following eigenvalues and eigenvectors for the covariance matrix of the training data: ``` eigenvalue-1: = 1. and eigenvector-1 = (−0.5 0.866) eigenvalue-2 = 2.5 and eigenvector-2 = (0.866 0.5). ``` what is the first principle component?

(0.866 0.5)

Which of the following statement(s) about regularization is/are true? a. One way to do regularization is to add a penalty term to the error function in order to discourage the weights w from reaching large values b. A Lasso regularizer often give us small/sparce models, since it often set weights to be zeros c. Non-linear bases functions can extend a linear model to do non-linear regression

a. One way to do regularization is to add a penalty term to the error function in order to discourage the weights w from reaching large values b. A Lasso regularizer often give us small/sparce models, since it often set weights to be zeros c. Non-linear bases functions can extend a linear model to do non-linear regression

Which of the following statement(s) about the EM algorithm is/are true? a. The E step finds the maximum likelihood estimates for parameters b. The EM algorithm is an optimization method used in situations when we have some unobserved variable c. EM guarantees to converge and find the globally optimal solution for a problem

b. The EM algorithm is an optimization method used in situations when we have some unobserved variable

Which of the following statement(s) about clustering is/are true? (Select all correct answers.) a. Clustering models are unsupervised models. b. Clustering algorithms group similar training cases together and identifying a “prototype” example to represent each group. c. K-means always converges. d. K-means always finds the globally optimal solution(s).

a. Clustering models are unsupervised models. b. Clustering algorithms group similar training cases together and identifying a “prototype” example to represent each group. c. K-means always converges.

In mixture of Gaussians, _________ a. In the M step, all data points (which have been softly assigned to clusters in the E step) are used to update the means and covariances of all clusters. b. The initialization step always gives every cluster a different covariance matrix. c. Only data points belonging to a cluster are used to update the mean and covariances of that cluster.

In the M step, all data points (which have been softly assigned to clusters in the E step) are used to update the means and covariances of all clusters.

We can view regression from two perspectives: (1) finding a curve to minimize a predefined error function, or (2) finding a curve that generates observed output t with some noise distribution (e.g., Gaussian) with the maximum likelihood. True or False?

True

PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is minimized. True or False

False

K-medoids updates the centre of a cluster to be the median of all the data points assigned to that cluster. True or False

False

Which of the following statement(s) is/are true for regression? (Select all correct answers.) a. Compared with lower-order polynomials, one should always use higher-order polynomial regression models, since they are more complex and powerful. b. One should always choose regression models that make smallest errors on training data. c. The situation of overfitting in regression can be less severe if we have more training data.

c. The situation of overfitting in regression can be less severe if we have more training data.

Test 2 W5-7 Flashcards

Tuesday, October 30 (59 cards)