week 6 Flashcards
chp 6
How human and machine learns?
(similarity and differences)
Similarities
- Learn - For humans, it’s a natural cognitive
process, while for machines, it’s a function of
their programming. - Experience - Humans learn from life experiences,
while machines learn from data they are exposed
to. - Feedback - Humans improve based on feedback
from teachers, peers, or results, and machines
improve based on feedback loops programmed
into their algorithms (like reinforcement
learning).
Differences
- General vs. specific learning - Humans are
capable of general learning and can apply
knowledge in varied contexts, whereas machines
typically learn specific tasks and are not as
flexible. - Small vs. big data - Humans can learn effectively
from small amounts of data, while machines
require large datasets to learn effectively. - Learning speed - Machines can process vast
amounts of information quickly once they are
properly trained, while human learning is a
slower,
Designing
a Learning
System
choosing training experience -> choosing target function -> choosing representation of target function -> choosing function approximation -> final design
ML model :
1)supervised learning
2)unsupervised learning
3)reinforcement learning
1)supervised learning
-learn from labeled data
-decision tree, svm
2)unsupervised learning
-learn from unlabelled data
-K-Means Clustering
3)Reinforcement Learning
-Learn to make decison through reward by continuous trial and error
- Markov decision process
Type of supervised learning-classification, regression
TYPES OF SUPERVISED LEARNING
Classification
*It involves a type of problem
where the output variable is a
category or categorical such as
“positive or negative”, “living
things or non living things”.
Problems that involve predicting a
category output:
● Image classification
○ Medical image classification
● Text classification
○ Sentiment analysis
○ Spam detection
Regression
*It involves a type of problem
where the output variable is a
real value or quantitative such as
weight, budget, etc.
Problems that involve predicting a
numerical output:
● Predicting stock price
● Predicting a house’s sale
price based on its square
footage
● Revenue forecasting
Today: Support Vector Machine (SVM)
A classifier derived from statistical learning theory by Vapnik, et al. in 1992
SVM became famous when, using images as input, it gave accuracy comparable
to neural-network with hand-designed features in a handwriting recognition task
Currently, SVM is widely used in object detection & recognition, content-based
image retrieval, text recognition, biometrics, speech recognition, etc.
Also used for regression (will not cover today)
Outline
Linear Discriminant function
Large Margin Linear Classifier
Nonlinear SVM:The Kernel Trick
Demo of SVM
Linear SVMs:
Overview
- The classifier is a separating hyperplane.
- Most “important” training points are support
vectors; they define the hyperplane. - Quadratic optimization algorithms can identify
which training points xi are support vectors with
non-zero Lagrangian multipliers αi
. - Both in the dual formulation of the problem
and in the solution training points appear only
inside inner products:
Non-linear SVMs
- Datasets that are linearly separable
with some noise work out great: - But what are we going to do if the
dataset is just too hard? - How about… mapping data to a
higher-dimensional space:
Non-linear SVMs: Feature Spaces
- General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
The “Kernel Trick”
- The linear classifier relies on inner product between vectors
KK xxii, xxjj = xxii
TTxxjj
- If every datapoint is mapped into high-dimensional space via
some transformation φφ: xx ⟶ φφ(xx), the inner product becomes:
KK xxii, xxjj = φφ xxii
TTφφ(xxjj)
- A kernel function is a function that is equivalent to an inner product
in some feature space. Example:
2-dimensional vectors xx = xx1xx2 ; let KK xxii, xxjj = (1 + xxii
TTxxjj)2
Need to show that KK xxii, xxjj = φφ xxii
TTφφ(xxjj)
KK xxii, xxjj = (1 + xxii
TTxxjj)2 = 1 + xxii
2 xxjj
2 + 2xxii xxjj xxii xxjj + xxii
2 xxjj
2 + 2xxii xxjj + 2xxii xxjj
= [1 xxii
2 2xxii xxii xxii
2 2xxii 2xxii ]
TT [1 xxjj
2 2xxjj xxjj xxjj
2 2xxjj 2xxjj ]
= φφ xxii
TTφφ(xxjj) , where φφ xx = [1 xx1
2 2xx1xx2 xx2
2 2xx1 2xx2]
- Thus, a kernel function implicitly maps data to a high-dimensional
space (without the need to compute each φ(x) explicitly).
SVM Applications
- SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s. - SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data. - SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data. - SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. - Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi
’s at a time, e.g. SMO [Platt ’99] and [Joachims’99]
- Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
TYPES OF
UNSUPERVISED
LEARNING
Clustering
*Technique that involves grouping
similar objects or data points
together into clusters, based on
certain features or
characteristics.
Dimensionality
Reduction
*It involves a type of problem
where the output variable is
a real value or quantitative
such as weight, budget, etc.
CLUSTERING
ALGORITHM
UNSUPERVISED LEARNING ALGORITHM
Clustering is a machine learning technique that
involves grouping similar objects or data points
together into clusters, based on certain features or
characteristics.
* Important in various industries, such as marketing,
healthcare, finance, and more.
* The main goals of clustering are to identify natural
groupings in data, minimize the distance or
dissimilarity between data points within the same
cluster, and maximize the distance or dissimilarity
between data points in different clusters.
SIMILARITY AND DISSIMILARITY
MEASURES FOR CLUSTERING (Euclidean Distance, Manhattan Distance, Cosine Similarity)
- Euclidean Distance: This is a common measure of distance
between two points in a multidimensional space. It is
calculated as the square root of the sum of the squared
differences between corresponding coordinates of the two
points. - Manhattan Distance: This is another measure of distance
between two points, based on the absolute differences
between corresponding coordinates of the two points. It is
often used in image recognition or pattern matching. - Cosine Similarity: This is a measure of similarity between
two vectors, based on the cosine of the angle between
them. It is often used in natural language processing or
recommendation systems.
CLUSTERING ALGO: K-MEANS
- K-Means is a widely used algorithm for clustering that aims to
partition a dataset into K distinct clusters, where K is a pre-specified
number. - Effective for well-separated clusters and can handle different types of
data, such as numerical or categorical variables. - Requires the pre-specification of the number of clusters, which may
not be known in advance or may be difficult to determine.
Steps of K-Means Algorithm:
- Initialization: Select K random data points from the dataset
as the initial centroids for the K clusters.
2.Assignment: Assign each data point to the cluster whose
centroid is closest to it, based on a distance measure such as
Euclidean distance or Manhattan distance.
3.Update: Recalculate the centroid of each cluster as the mean
of all the data points assigned to it in the previous step. - Convergence: Repeat steps 2 and 3 until the centroids
converge, that is, until the change in the centroids is below a
certain threshold or the maximum number of iterations is
reached.
K-MEANS ALGORITHM - INITIALIZATION
- The initialization step is the first step in the K-Means clustering algorithm, and it involves
selecting the initial centroids for the K clusters. - One common method for initialization is random selection, where K data points are randomly
selected from the dataset to serve as the initial centroids. - Example:
- As a starting point, you tell your model how many clusters it should make. First the model
picks up K, (let K = 3) datapoints from the dataset. These datapoints are called cluster
centroids. - Now there are different ways you to initialize the centroids, you can either choose them at
random — or sort the dataset, split it into K portions and pick one datapoint from each
portion as a centroid.
K-MEANS ALGORITHM –
ASSIGNMENT STEP
- Involves assigning each data point to the nearest centroid
based on its distance. - The distance metric used to measure the distance between
data points and centroids can affect the resulting clusters - Most common distance metric used in K-Means clustering
is the Euclidean distance
DD xxii, xxiii = xxii − xxiii = �
jj=1
pp
xxii − xxii′jj
2
- Clusters defined by Euclidean distance is invariant to
translations and rotations in feature space, but not
invariant to scaling of features.
K-MEANS ALGORITHM
– UPDATE STEP
The update step is the third step in the K-Means
clustering algorithm, and it involves updating the
centroids of each cluster based on the mean of the data
points assigned to that cluster.
* The most common method used for updating centroids
in K-Means clustering is the simple mean, which
calculates the mean of all the data points assigned to
each cluster.
* Because the initial centroids were chosen arbitrarily,
your model the updates them with new cluster values.
* If some other algo, like K-Mode, or K-Median was used,
instead of taking the average value, mode and median
would be taken respectively.
K-MEANS ALGORITHM – CONVERGENCE
CRITERIA
- The convergence criteria is the final step in the K-Means clustering
algorithm, and it determines when the algorithm should stop
iterating and the clusters are considered stable. - The common convergence criteria –
- the algorithm will stop after a predetermined number of iterations, regardless of whether
the clusters have stabilized or not. - minimum change in centroids, where the algorithm will stop iterating when the centroids of
all clusters change by less than a certain threshold value between consecutive iterations. - minimum change in the sum of squared errors (SSE) or the silhouette score.
CLUSTERING ALGO – HIERARCHICAL
CLUSTERING
- Type of clustering algorithm that creates a tree-like structure of
clusters by recursively dividing or merging clusters based on the
similarity between data points. - Two main types:
- Agglomerative
- Divisive
CLUSTERING ALGO:
SPECTRAL CLUSTERING
- Represent datapoints as the
vertices V of a graph G. - All pairs of vertices are connected by
an edgeE. - Edges have weights W.
- – Large weights mean that the
adjacent vertices are very similar;
small weights imply dissimilarity.
GRAPH PARTITIONING
Clustering on a graph is equivalent to partitioning the vertices of the
graph.
* A loss function for a partition of V into sets A and B
cccc AA, BB = �
uu∈ ,vv∈BB
WWuu,vv
- In a good partition, vertices in different partitions will be dissimilar.
- Mincut criterion: Find partition AA, BB that minimizes cccc AA, BB)
SPECTRAL CLUSTERING
- In spectral clustering, the data points are treated as nodes of a graph.
- Spectral clustering involves 3 steps:
1. Compute a similarity graph
2. Project the data onto a low-dimensional space
3. Create clusters
SPECTRAL CLUSTERING – STEP 1
- We first create an undirected graph G = (V, E) with vertex set V = {v1,
v2, …, vn} = 1, 2, …, n observations in the data. - This can be represented by an adjacency matrix which has the
similarity between each vertex as its elements. - To perform that, we can opt to do any of these options:
- The ε-neighborhood graph
- KNN graph
- Fully connected graph
SPECTRAL CLUSTERING – STEP 2
- Step 2 — Project the data onto a low-dimensional space:
- Our goal then is to transform the space so that when the 2 points are
close, they are always in same cluster, and when they are far apart,
they are in different clusters. - Compute the Graph Laplacian, which is just another matrix
representation of a graph and can be useful in finding interesting
properties of a graph.
SPECTRAL CLUSTERING
– STEP 3
- Step 3 — Create clusters
- For this step, we use the eigenvector corresponding to the 2nd
eigenvalue to assign values to each node. - On calculating, the 2nd eigenvalue is 0.189 and the
corresponding eigenvector v2 = [0.41, 0.44, 0.37, -0.4, -0.45, -
0.37]. - To get bipartite clustering (2 distinct clusters), we first assign
each element of v2 to the nodes such that {node1:0.41 ,
node2:0.44 , … node6: -0.37}. - Then split the nodes such that all nodes with value > 0 are in
one cluster, and all other nodes are in the other cluster. - Thus, in this case, we get nodes 1, 2 & 3 in one cluster, and 4,
5 & 6 in the 2nd cluster.