Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about... mapping data to a higher-dimensional space: Non-linear SVMs: Feature Spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hill- climb over a subset of αi ’s at a time, e.g. SMO [Platt ’99] and [Joachims’99] Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.

week 6 Flashcards by esther boon

How human and machine learns?
(similarity and differences)

Similarities

Learn - For humans, it’s a natural cognitive
process, while for machines, it’s a function of
their programming.
Experience - Humans learn from life experiences,
while machines learn from data they are exposed
to.
Feedback - Humans improve based on feedback
from teachers, peers, or results, and machines
improve based on feedback loops programmed
into their algorithms (like reinforcement
learning).

Differences

General vs. specific learning - Humans are
capable of general learning and can apply
knowledge in varied contexts, whereas machines
typically learn specific tasks and are not as
flexible.
Small vs. big data - Humans can learn effectively
from small amounts of data, while machines
require large datasets to learn effectively.
Learning speed - Machines can process vast
amounts of information quickly once they are
properly trained, while human learning is a
slower,

How well did you know this?

Not at all

Perfectly

Designing
a Learning
System

choosing training experience -> choosing target function -> choosing representation of target function -> choosing function approximation -> final design

How well did you know this?

Not at all

Perfectly

ML model :
1)supervised learning
2)unsupervised learning
3)reinforcement learning

1)supervised learning
-learn from labeled data
-decision tree, svm

2)unsupervised learning
-learn from unlabelled data
-K-Means Clustering

3)Reinforcement Learning
-Learn to make decison through reward by continuous trial and error
- Markov decision process

How well did you know this?

Not at all

Perfectly

Type of supervised learning-classification, regression

TYPES OF SUPERVISED LEARNING

Classification
*It involves a type of problem
where the output variable is a
category or categorical such as
“positive or negative”, “living
things or non living things”.

Problems that involve predicting a
category output:
● Image classification
○ Medical image classification
● Text classification
○ Sentiment analysis
○ Spam detection

Regression
*It involves a type of problem
where the output variable is a
real value or quantitative such as
weight, budget, etc.

Problems that involve predicting a
numerical output:
● Predicting stock price
● Predicting a house’s sale
price based on its square
footage
● Revenue forecasting

How well did you know this?

Not at all

Perfectly

Today: Support Vector Machine (SVM)

A classifier derived from statistical learning theory by Vapnik, et al. in 1992

SVM became famous when, using images as input, it gave accuracy comparable
to neural-network with hand-designed features in a handwriting recognition task

Currently, SVM is widely used in object detection & recognition, content-based
image retrieval, text recognition, biometrics, speech recognition, etc.

Also used for regression (will not cover today)

How well did you know this?

Not at all

Perfectly

Outline
Linear Discriminant function
Large Margin Linear Classifier
Nonlinear SVM:The Kernel Trick
Demo of SVM

How well did you know this?

Not at all

Perfectly

Linear SVMs:
Overview

The classifier is a separating hyperplane.
Most “important” training points are support
vectors; they define the hyperplane.
Quadratic optimization algorithms can identify
which training points xi are support vectors with
non-zero Lagrangian multipliers αi
.
Both in the dual formulation of the problem
and in the solution training points appear only
inside inner products:

How well did you know this?

Not at all

Perfectly

Non-linear SVMs

Datasets that are linearly separable
with some noise work out great:
But what are we going to do if the
dataset is just too hard?
How about… mapping data to a
higher-dimensional space:

Non-linear SVMs: Feature Spaces

General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:

How well did you know this?

Not at all

Perfectly

The “Kernel Trick”

The linear classifier relies on inner product between vectors

KK xxii, xxjj = xxii
TTxxjj

If every datapoint is mapped into high-dimensional space via
some transformation φφ: xx ⟶ φφ(xx), the inner product becomes:

KK xxii, xxjj = φφ xxii

TTφφ(xxjj)

A kernel function is a function that is equivalent to an inner product
in some feature space. Example:
2-dimensional vectors xx = xx1xx2 ; let KK xxii, xxjj = (1 + xxii

TTxxjj)2

Need to show that KK xxii, xxjj = φφ xxii

TTφφ(xxjj)

KK xxii, xxjj = (1 + xxii

TTxxjj)2 = 1 + xxii
2 xxjj
2 + 2xxii xxjj xxii xxjj + xxii
2 xxjj
2 + 2xxii xxjj + 2xxii xxjj

= [1 xxii

2 2xxii xxii xxii

2 2xxii 2xxii ]

TT [1 xxjj

2 2xxjj xxjj xxjj

2 2xxjj 2xxjj ]

= φφ xxii

TTφφ(xxjj) , where φφ xx = [1 xx1

2 2xx1xx2 xx2

2 2xx1 2xx2]

Thus, a kernel function implicitly maps data to a high-dimensional
space (without the need to compute each φ(x) explicitly).

How well did you know this?

Not at all

Perfectly

SVM Applications

SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data.
SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi

’s at a time, e.g. SMO [Platt ’99] and [Joachims’99]

Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.

How well did you know this?

Not at all

Perfectly

TYPES OF
UNSUPERVISED
LEARNING

Clustering
*Technique that involves grouping
similar objects or data points
together into clusters, based on
certain features or
characteristics.

Dimensionality
Reduction
*It involves a type of problem
where the output variable is
a real value or quantitative
such as weight, budget, etc.

How well did you know this?

Not at all

Perfectly

CLUSTERING
ALGORITHM

UNSUPERVISED LEARNING ALGORITHM

Clustering is a machine learning technique that
involves grouping similar objects or data points
together into clusters, based on certain features or
characteristics.
* Important in various industries, such as marketing,
healthcare, finance, and more.
* The main goals of clustering are to identify natural
groupings in data, minimize the distance or
dissimilarity between data points within the same
cluster, and maximize the distance or dissimilarity
between data points in different clusters.

How well did you know this?

Not at all

Perfectly

SIMILARITY AND DISSIMILARITY
MEASURES FOR CLUSTERING (Euclidean Distance, Manhattan Distance, Cosine Similarity)

Euclidean Distance: This is a common measure of distance
between two points in a multidimensional space. It is
calculated as the square root of the sum of the squared
differences between corresponding coordinates of the two
points.
Manhattan Distance: This is another measure of distance
between two points, based on the absolute differences
between corresponding coordinates of the two points. It is
often used in image recognition or pattern matching.
Cosine Similarity: This is a measure of similarity between
two vectors, based on the cosine of the angle between
them. It is often used in natural language processing or
recommendation systems.

How well did you know this?

Not at all

Perfectly

CLUSTERING ALGO: K-MEANS

K-Means is a widely used algorithm for clustering that aims to
partition a dataset into K distinct clusters, where K is a pre-specified
number.
Effective for well-separated clusters and can handle different types of
data, such as numerical or categorical variables.
Requires the pre-specification of the number of clusters, which may
not be known in advance or may be difficult to determine.

How well did you know this?

Not at all

Perfectly

Steps of K-Means Algorithm:

Initialization: Select K random data points from the dataset
as the initial centroids for the K clusters.
2.Assignment: Assign each data point to the cluster whose
centroid is closest to it, based on a distance measure such as
Euclidean distance or Manhattan distance.
3.Update: Recalculate the centroid of each cluster as the mean
of all the data points assigned to it in the previous step.
Convergence: Repeat steps 2 and 3 until the centroids
converge, that is, until the change in the centroids is below a
certain threshold or the maximum number of iterations is
reached.

How well did you know this?

Not at all

Perfectly

K-MEANS ALGORITHM - INITIALIZATION

Study These Flashcards

The initialization step is the first step in the K-Means clustering algorithm, and it involves
selecting the initial centroids for the K clusters.
One common method for initialization is random selection, where K data points are randomly
selected from the dataset to serve as the initial centroids.
Example:
As a starting point, you tell your model how many clusters it should make. First the model
picks up K, (let K = 3) datapoints from the dataset. These datapoints are called cluster
centroids.
Now there are different ways you to initialize the centroids, you can either choose them at
random — or sort the dataset, split it into K portions and pick one datapoint from each
portion as a centroid.

K-MEANS ALGORITHM –
ASSIGNMENT STEP

Study These Flashcards

Involves assigning each data point to the nearest centroid
based on its distance.
The distance metric used to measure the distance between
data points and centroids can affect the resulting clusters
Most common distance metric used in K-Means clustering
is the Euclidean distance

DD xxii, xxiii = xxii − xxiii = �
jj=1
pp
xxii − xxii′jj
2

Clusters defined by Euclidean distance is invariant to
translations and rotations in feature space, but not
invariant to scaling of features.

K-MEANS ALGORITHM
– UPDATE STEP

Study These Flashcards

The update step is the third step in the K-Means
clustering algorithm, and it involves updating the
centroids of each cluster based on the mean of the data
points assigned to that cluster.
* The most common method used for updating centroids
in K-Means clustering is the simple mean, which
calculates the mean of all the data points assigned to
each cluster.
* Because the initial centroids were chosen arbitrarily,
your model the updates them with new cluster values.
* If some other algo, like K-Mode, or K-Median was used,
instead of taking the average value, mode and median
would be taken respectively.

K-MEANS ALGORITHM – CONVERGENCE
CRITERIA

Study These Flashcards

The convergence criteria is the final step in the K-Means clustering
algorithm, and it determines when the algorithm should stop
iterating and the clusters are considered stable.
The common convergence criteria –
the algorithm will stop after a predetermined number of iterations, regardless of whether
the clusters have stabilized or not.
minimum change in centroids, where the algorithm will stop iterating when the centroids of
all clusters change by less than a certain threshold value between consecutive iterations.
minimum change in the sum of squared errors (SSE) or the silhouette score.

CLUSTERING ALGO – HIERARCHICAL
CLUSTERING

Study These Flashcards

Type of clustering algorithm that creates a tree-like structure of
clusters by recursively dividing or merging clusters based on the
similarity between data points.
Two main types:
Agglomerative
Divisive

CLUSTERING ALGO:
SPECTRAL CLUSTERING

Study These Flashcards

Represent datapoints as the
vertices V of a graph G.
All pairs of vertices are connected by
an edgeE.
Edges have weights W.
– Large weights mean that the
adjacent vertices are very similar;
small weights imply dissimilarity.

GRAPH PARTITIONING

Study These Flashcards

Clustering on a graph is equivalent to partitioning the vertices of the
graph.
* A loss function for a partition of V into sets A and B

cccc AA, BB = �
uu∈ ,vv∈BB

WWuu,vv

In a good partition, vertices in different partitions will be dissimilar.
Mincut criterion: Find partition AA, BB that minimizes cccc AA, BB)

SPECTRAL CLUSTERING

Study These Flashcards

In spectral clustering, the data points are treated as nodes of a graph.
Spectral clustering involves 3 steps:
1. Compute a similarity graph
2. Project the data onto a low-dimensional space
3. Create clusters

SPECTRAL CLUSTERING – STEP 1

Study These Flashcards

We first create an undirected graph G = (V, E) with vertex set V = {v1,
v2, …, vn} = 1, 2, …, n observations in the data.
This can be represented by an adjacency matrix which has the
similarity between each vertex as its elements.
To perform that, we can opt to do any of these options:
The ε-neighborhood graph
KNN graph
Fully connected graph

SPECTRAL CLUSTERING – STEP 2

* Step 2 — Project the data onto a low-dimensional space: * Our goal then is to transform the space so that when the 2 points are close, they are always in same cluster, and when they are far apart, they are in different clusters. * Compute the Graph Laplacian, which is just another matrix representation of a graph and can be useful in finding interesting properties of a graph.

SPECTRAL CLUSTERING – STEP 3

* Step 3 — Create clusters * For this step, we use the eigenvector corresponding to the 2nd eigenvalue to assign values to each node. * On calculating, the 2nd eigenvalue is 0.189 and the corresponding eigenvector v2 = [0.41, 0.44, 0.37, -0.4, -0.45, - 0.37]. * To get bipartite clustering (2 distinct clusters), we first assign each element of v2 to the nodes such that {node1:0.41 , node2:0.44 , ... node6: -0.37}. * Then split the nodes such that all nodes with value > 0 are in one cluster, and all other nodes are in the other cluster. * Thus, in this case, we get nodes 1, 2 & 3 in one cluster, and 4, 5 & 6 in the 2nd cluster.

week 6 Flashcards

chp 6 (26 cards)