Final Exam Flashcards
Study for Final
12: Describe the Hill Climbing Algorithm
Move to the neighbor with the highest value.
- Guess an x in X.
- n* = argmax n in N(x) f(n)
- If f(n) > f(x): x=n, else stop
- Go to 2.
What are strengths and weaknesses of hill climbing?
It is very simple, very quick, and given a monotonic surface will converge.
It is NOT capable of dealing with very ‘rough’ surfaces and has difficulty finding optima with very narrow basins of attraction.
What is a method of helping hill climbing deal with rough surfaces?
Random restarts.
Describe Simulated Annealing
It starts as a random hill climb / walk and gradually segues into hill climbing by a temperature gradient / annealing schedule.
For a finite set of iterations:
- Sample a new point x_t in N(x) [Neighborhood of x]
- Jump to the new neighbor (move x to x_t) with a probability given by an acceptance probability function P(x, x_t, T)
- Decrease temperature T
P(x, x_t, T) = {
1 if f(x_t) >= f(x)
exp( f(x_t)-f(x) / T ) elsewise
}
What is the probability of ending at any point in a simulated annealing iteration?
exp(f(x)) / Z_T
This is known as the Boltzmann Distribution and is seen frequently in engineering.
As T is high it smooths out the probability density function; as the temperature falls it will narrow the probability band onto the optimum.
Describe the genetic algorithm flow.
Start with your initial population P_0 of size K.
Repeat until converged:
{
Compute fitness of all x in P_t
Select the ‘most fit’ in a consistent manner.
Pair up individuals, replacing ‘least fit’ via crossover / mutation.
}
What are different kinds of crossover?
- Uniform crossover: Each bit has a uniform probability of swapping.
- Single point: Head / tail shifted.
- Multiple point
What is the probability model for MIMIC?
P = {
1 / z_theta if f(x) >= theta
0 elsewise
}
P^(theta_min) (x) -> uniform distribution over all points.
P^(theta_max) (x) -> uniform distribution over global optima.
Describe the algorithm for MIMIC
Rising water keeps structure memory and only keeps the high points.
What is a short, idiomatic, descriptive way of separating supervised from unsupervised learning?
Supervised learning uses labeled training data to generalize labels to new instances and is function approximation. Unsupervised makes sense out of unlabeled data and is data description.
How is domain knowledge injected in most clustering algorithms?
The measure of pairwise distance or similarity.
Describe a basic clustering problem.
Given a set of objects X with inter-object distances D output a set of partitions P_D in a consistent manner such that two points in the same cluster are in the same partition.
What are some examples of trivial clustering partition methods?
- A partition for every point.
2. A single partition.
What is Single Linkage Clustering (SLC)?
A heirarchical (top down / bottom up) agglomerative (slowly fold points in) clustering algorithm
Describe the Algorithm and Runtime Complexity of Single Linkage Clustering
To create k clusters:
- Consider each object (point) a unique cluster (n objects exist.)
- Define the intercluster distance as the distance between the closest points in any two clusters. Note that two points within the same cluster have an intercluster distance of 0.
- Merge the two closest clusters.
- Repeat n-k times to make k clusters.
You eventually construct a dendrogram / tree with as many root nodes as k. As k approaches 1 a fully fleshed tree will join at a single root.
The runtime complexity is O(n^3) because there are n points which you must loop over to calculate distances for n-1 points (the distance to yourself is zero!) during the distance calculation step. You repeat this n-k times and as k approaches n/2 this will approach O(n^3) as a theoretical max and O(n^2) as a theoretical min.
Describe the Algorithm and Runtime Complexity of K-Means Linkage Clustering
To create k clusters:
- Initialize k centroids in the manner you choose (intelligently / randomly / …)
- Assign all n points to the nearest centroid. This requires n * k distance calculations
- Recalculate the centroids of each cluster by setting the centroid coordinates to be the intra cluster mean of the points.
- Go to step 2 and repeat until convergence
The runtime complexity for this is a little more complex. As k approaches n / 2 the complexity for step two will approach O(n^2) and this step is repeated until convergence. Intelligent seeding could get you close to an answer and run in far under n iterations but it’s likely a safe assumption to say a theoretical max of O(n^3) and a min of O(n^2).
Does K-Means Clustering Always Converge?
Yes, given enough time, and possibly only to a local minima depending on centroid initialization. Many implementations will run j randomized trials with different seeds. Some implementations will enforce initial clusters to be distant from one another which generally improves results.
Does K-Means Clustering Always Converge to a Good Answer?
What is good? K-means is always going to partition your space in such a manner that the points which are ‘close’ to each other as defined by your distance metric are in the same cluster assuming that k was selected correctly.
What is the Usual Distance Metric for K-Means?
Euclidean, though others may be used. Any smooth measure will work. (Source)
How is K-Means Related to Expectation Maximization?
K-Means can be seen as a special case of EM with a small all-equal, diagonal covariance matrix.
Why does Dimension Reduction Generally Improve K-Means Clustering Results?
K-Means, as a result of using Euclidean distances, suffers from the curse of dimensionality and distances will quickly exaggerate as dimensionality rises. Reducing the dimensionality can help, though possibly not solve, this issue.
Define K-Means Mathematically
P^t(x): Partition / cluster of object x
C^t_i: Set of all points in cluster i -> {x s.t. P(x)=i}
- Center^t_i = \sum(y in C_i)/|C_i| (This is a centroid, calculated with a mean)
- Center^O_i - > P^t(x) = argmin_i ||x-center^t_i||^2_2
To run this seed clusters and then repeat 2, 1 until convergence.
How are clusters scored?
Tightness / deviation is a good metric.
One metric is intra cluster squared distances:
E(P, center) = \sum_x ||Center_P(x) -x||^2_2
Which Optimization Routine is K-Means Most Like?
Hill climbing; K-Means moves in a single direction (minimizing intra cluster squared error) and only towards a peak.
What is the Behaviour of the K-Means Error Curve as the Algorithm Progresses?
It is monontically non-increasing. Every time cluster centroids are shifted the error term (sum of intra cluster squared distances) will never decrease.
How Can You Avoid Getting Stuck in Local Optima in K-Means?
Either conduct random restarts or intelligently spread the centroid seeds across the dataset.
What is Soft Clustering?
A way of using probability to assign points to clusters.
What is the Algorithm and Runtime for EM?
- Expectation:
Derive the expected value of z from the centroid means by calculating the normalized probability for every point that it was drawn from that centroid defined as a probability distribution. (In this case a normal distribution.)
Across i points and j centroids:
E[z_ij] = (P(X=x_i|mu=mu_j)) / (sum_(i=1 to k) P(X=x_i|mu=mu_j))
P(X=x_i|mu=mu_j) = e^{-1/2 (x_i - mu_j)^2 sigma^2}
- Maximization:
Derive new centroid means by calculating the average weighted by the probabilities created in the expectation sep.
mu_j = sum_i(E[Z_ij] * x_i) / sum_i(E[Z_ij])
The runtime is similar to K-Means, but it’s slower.
How are EM and K-Means Alike?
K-Means is EM with some assumptions made.
What are some properties of EM?
- Monotonically non-decreasing likelihood.
- Does not converge (define a threshold for convergence).
- Will not diverge.
- Can get stuck in local optima.
- Can work with any distribution if E,M are solvable.
What are the three primary clustering properties of interest?
Richness, scale invariance, and consistency.
You can always have two of them, but not three. With a little hand waving you can get pretty close.
Why do we do feature selection
Discover knowledge and add interpretability.
Reduce or remove the curse of dimensionality; the amount of data necessary is exponential in the number of features (2^n)
What is the upper bound of computational complexity for finding a subset of features from n features?
2^n
This is NP hard.
What is filtering?
Filtering is where the feature selection algorithm is run, once, up front and the filtered features are passed on to the learner.
What is wrapping?
Wrapping is where the feature selection algorithm and the learner work cyclically, informing the other process.
What are the differences between wrapping and filtering?
Speed: Filtering is fast; you don’t care what the learner wants and this process ignores the learning problem. This is recommended for a well known problem. Wrapping is slow