Final Prep Flashcards
2nd Half Semester Information
<p>Unsupervised Learning</p>
<p>Make Sense of Unlabeled Data
Data Description - find a more compact way to describe it
Versus Function Approximation of Supervised Learning with Labeled Data</p>
<p>Basic Clustering Problem</p>
<p>Given:
Set of objects X
Inter object Distances D(x,y)
which is equal D(y,x)
~~~
Output:
Partition Pd(x) = Pd(y)
this is the "partition function"
which outputs a compact descriptor of all equivalent data
</p>
~~~
<p>Single Linkage Clustering (SLC)</p>
<p>A space of n objects all distances from each other
The two closest points are connected
To be SLC it must be closest
And then the closes clusters are merged
Until only k clusters are achieved
Distances only matter between un-linked items</p>
<p>Hierarchical Agglomerative Cluster Structure</p>
<p>Used to represent linkage cluster derived from SLC</p>
<p>Running Time of SLC</p>
<p>for n points with k Clusters O(n^3) n^2 from the distances and n/2 to link the closest one find the Similar to spanning tree problem </p>
<p>k-means clustering algorithm</p>
<p>-pick k "centers" (at random)
- each center claims center
- recompute the centers by averaging clustered
- repeat until converged
- this center could be a point in the collection or a point in the space. This is almost like creating kNN</p>
<p>Euclidean space</p>
<p>In geometry, a two- or three-dimensional space in which the axioms and postulates of Euclidean geometry apply; also, a space in any finite number of dimensions, in which points are designated by coordinates (one for each dimension) and the distance between two points is given by a distance formula.</p>
<p>k-means Proof</p>
<p>P(x) : Partition/Cluster of object x
Ci: set of all points in a cluster of i
center = sum(y)/(Ci)
P(x) -> center -> P(x)
</p>
<p>K-Means as Optimization</p>
<p>Configurations(inputs) - center, P
Scores - the further from the center increases the error of that object E(P, center) = Sum||center - x||^2
Neighborhood-where the P are set and the center changes or the center is set and the P changes
</p>
<p>Randomized Optimization most like K-Means</p>
<p>Hill Climbing
You are taking steps towards a configuration better than the configuration before. We are finding Neighbors slightly better than before</p>
<p>Properties of K-means clustering</p>
<p>-Each Iteration is polynomial O(kn)
-finite (exponential) iterations O(k^n)
different ways of assigning partitions.
in practice this is short as there are limited ways
- Error Decreases (if ties broken consistently)
- can get stuck
</p>
<p>Stop K-means stuck mitigation</p>
<p>-Random Restarts
| - Initial review of data to find centers far apart</p>
<p>Equal Distance Points between Clusters</p>
<p>for points equal distance between two clusters. It will sometimes be assigned to either cluster
</p>
<p>Soft Clustering</p>
<p>Assume the data generated by
1) Select one of k Gaussians with known variances
2) Same xi from that Guassian
3) Repeat n times
Task- Find a hypothesis h = </p>
<p>Maximum Likelihood Gaussian</p>
<p>- The ML mean of the gaussian is the mean of the data
- does not apply for k different means, just for one
- the k different means is based on assigning each point hidden variables to determine what gaussian it is a part of</p>
<p>Expectation Maximazation</p>
<p>Tick tok between expectation (definze z from mu) to maximazation where you define ,u from z</p>
<p>Properties of EM</p>
<p>-monotonically non- decreasing likelihood
- It always moves to something better
- Has a change to not converge (practically does)
- Will not diverge
in k means there is change to diverge
- can get stuck
can make a local optima where it finds a good assignment but we need ot fix with random restart
- works with any distribution
domain knowledge for E and M stop based on data</p>
<p>Clustering Properties</p>
<p>- Richness
all inputs and any way of clustering is possible
- Scale In-variance
doubling or halving distances does not changing the way the clusters should be determined
- Consistency
shrinking the point to point and expanding the cluster to cluster distance will not change the clustering. This is an application of domain knowledge.</p>
<p>Impossibility Theorem</p>
<p>Richness/Scale In-variance/ and Consistency cannot all be true at the same time</p>
<p>Feature Selection</p>
<p>Two Reasons:
1) Knowledge Discovery
Make it able to interpret-ability and Insight
2) Cure of Dimensionality
As you add more data you need exponential to the number of features you have</p>
<p>How hard is it to reduce a problem from N features to m features</p>
<p>This is an exponential hard problem. 2^N or (n m) which is n choose m. This is NP-hard. This can match the 3 set np-hard problem. </p>
<p>Filtering- Features</p>
<p>Features with some type of search algorithm then to a learning algorithm. This is flow forward, but with no feedback. It can look at the labels, so you could use some type of thing like information gain. You could even use DT as the filterer.
Pro
Speed
Cons
Look at features in isolation, but maybe it is important when looked at with something else
Ignores the learner.</p>
<p>Wrapping - Features</p>
<p>Take in the features, searches over a subset then goes to the learning algorithm. Then updates the search with those scores. The advantage being
Pro
Takes into account the learner and model bias
Con
Very slow since the speed of the learner is also included in the time.</p>