CS 7641 Final Flashcards
Supervised Learning
Use labeled training data to generalize labels to new instances
(function approximation)
unsupervised learning
making sense out of unlabeled data
(data description)
Single Linkage Clustering
- Consider each object a cluster (n objects)
- Define intercluster distance as the distance between the two closest two points in the two clusters
- Merge two closest clusters
- Repeat n-k times to make k clusters
Intercluster Distance in Single Linkage Clustering
the distance between the 2 closest points in the 2 clusters
Inter cluster distance represents
domain knowledge
T/F Single Link Clustering is Deterministic
True. Unless there are ties, it will give us an answer.
In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?
Minimum Spanning Tree
What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters
O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times
Issues with SLC
Could end up with weird clusters that surround other clusters
k-means clustering algorithm
- pick k centers (art random)
- each center “claims” its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence
When does k-means converage
When recomputing the centers does not move
T/F the centers have to be a point in the collection of objects
False
k-means in euclidean space process
circle between partition assignment and recalculating the centers
Which optimization algorithm is most like k-means? HC, GA, or SA
Hill Climbing. You take a step towards a configuration that is better than you have before
K-means runs out of configurations and converges in finite time when
finite configurations, break ties consistently, and never go into a configuration with a higher error
Big O for Each iteration of k-means clustering is
polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions
Big O for Finite iterations of k-means
exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster
If ties are broken consistently in k-means then error
decreases [with one exception]
T/F k-means cannot get stuck
False (local optima can still happen)
How to avoid k-means getting stuck
random restart OR do initial analysis and pick centers in good spots
Soft Clustering Overview
Points can be probabilistically from one cluster or another
Soft Clustering Idea
Assume the data was generated by
1. select one of k gaussians uniformly [fixed known variance]
2. sample x from Gaussian
3. repeat n times
Task: find a hypothesis that maximizes the prob of the data (ML)
What is the maximum likelihood gaussian (soft clustering)
The ML mean of the Gaussian is the mean of the data
Expectation Maximum
Has two phases expectation which is soft clustering and maximization
How is expectation maximization like k-means
k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric
Properties of Em
- Monotonically non-decreasing likelihood (it’s not getting worse)
- Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does)
- Will not Diverge (numbers don’t blow up and become infinitely large)
- Can get stuck (local optima)
- Works with any distribution (If EM solvable)
Clustering Properties
- Richness
- Scale Invariance
- Consistency
Richness
For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.
Scale Invariance
scaling distances by a positive value does not change the clustering
Consistency
Shrinking intra cluster distances and expanding intercluster distances does not change the clustering
Impossibility Theorem
No clustering can achieve all three of richness, scale invariance, consistency
Curse of dimensionality
Data needed grows exponentially with features (2^n)
Big O of feature selection
np-hard, 2^n, exponential
Two approaches to feature selection
filtering and wrapping
filtering overview
set of features as input, passes to same search algorithm that outputs fewer features
wrapping overview
set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed
Pros of filtering
speed: faster than wrapping because you don’t have to worry about what the learner does.
Ignores the learning problem
cons of filtering
look at features in isolation.
Pros of wrapping
takes into account model bias and learning itself
con of wrapping
extremely slow compared to filtering
ways to do feature filtering
- information gain
- variance, entropy
- “useful features”
- independent/non redundant
- ex: DT, NN (trim features with low weights)
ways to do feature wrapping
- hill climbing
- randomized Optimization
- forward search
- backward search
forward search
Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing
backward search
start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well
Relevance: strongly relevant
xi is strongly relevant if removing is degrades the Bayes optimal classifier
Relevance: weakly relevant
Xi is weakly relevant if
- not strongly relevant
- and there exists some subset of features such that adding xi to the subset improves the bayes optimal features
Relevance measures effect on
Bayes optimal classifier. INFORMATION
Usefulness measures effect on
a particular predictor. ERROR given model/learner (c is useful in neural nets)
feature transformation
the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible
T/F Feature selection is a subset of feature transformaiton
True
feature transformation vs feature selection
transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2
ad hoc information retrieval problem (google problem)
a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features)
ad hoc - you don’t know what they queries are
polysemy
words can have multiple meanings
ex: car- Vehicle and also stupid thing in LISP
results in false positives
synonomy
many words mean the same thing. ex: car and automobile are same thing
results in false negative