Machine Learning Flashcards
Inductive learning
generalize from a given set of (training) examples so that accurate predictions can be made about future examples; learn an unknown function
how to represent a “thing” in machine learning
x: example or instance of a specific object; represented by a feature vector; each dimension - feature
feature vector representation
extract a feature vector x, that describes all attribute relevant for an object; each x is a list of (attribute, value) pairs
types of features
numerical features - discrete or continuous
categorical features - no intrinsic ordering
ordinal features - similar to categorical but clear ordering
point in feature vector representation
each example can be interpreted as a point in a D-dimensional feature space, where D is the number of features/attributes
Training set
A training set is a collection of examples (instances), which is the input to the learning process; assume instances are independent and identically distributed. training set = experience given to learning algorithm
idd
independent and identically distributed
Unsupervised learning
training set = x1, …xn; no “teacher” to show how examples should be handled; tasks: clustering, discovery, novelty detection; dimensionality reduction
goal of clustering
group training samples into clusters such that examples in the same cluster are similar, and examples in different clusters are different
Clustering methods
Hierarchical Agglomerative Clustering
K-means Clustering
Mean Shift Clustering
Hierachical Clustering General Idea
initially every point is in its own cluster
find the pair of clusters that are the closest
merge the two into a single cluster
repeat
end result: binary tree
How to measure closeness between 2 clusters (hierarchical clustering)
Single linkage, complete-linkage, average linkage
single-linkage
the shortest distance from any member of 1 cluster to any member of another cluster
complete linkage
the largest distance from any member of 1 cluster to any member of another cluster
average linkage
the average distance between all pairs of members, one from each cluster
How to measure the distance between a pair of examples?
Euclidean, manhattan/city block, hamming
Dendrogram
binary tree resulting from hierarchical clustering; the tree can be cut at any level to produce different numbers of clusters