Algos Problem 7: Machine Learning - Clustering Algorithms Flashcards
Diauxic shift ?
In yeast:
When glucose runs out, yeast inverts its metabolism, and ethanol becomes its new food.
Can only occur in the presence of oxygen
How does Whole Genome Duplications Enable Evolutionary Breakthroughs?
Ohno argued that a WGD would provide a platform for a revolutionary innovation such as the diauxic shift, since every duplicated gene would have two copies.
- One copy would be free to evolve without compromising the gene’s existing function.
- Another copy would perform the existing function.
Gene Expression Matrix ?
Matrix of gene expression vectors
vectors containing e_ij, = expression level of gene i at checkpoint j, for individual genes
Expression values - logarithm?
Logarithm of expression value is often taken.
After this transformation, positive values of a gene’s expression vector correspond to increased expression, and negative values correspond to decreased expression
Lloyd Algorithm
Select k arbitrary data points as Centers and then iteratively performs the following two steps:
- Centers to Clusters: Assign each data point to the cluster corresponding to its nearest center (ties are broken arbitrarily).
- Clusters to Centers: After the assignment of data points to k clusters, compute new centers as clusters’ center of gravity.
The Lloyd algorithm terminates when the centers stop moving (convergence).
Was sind die zwei Schritte des iterativen Loyd Algorithmus zum k-means Clustering? Warum sollte man mehrere Initialisierungen wählen? (3 Punkte)
(1) Abstand von Zentroiden/Mittelpunkte zu Datenpunkten & Zuordnung zum náchsten Zentroid
(2) Berechnung neuer Zentroide
Mehrere Startpunkte benoetigt, da die initialiesierung randomisiert ist, und das Algorithm in ein lokales Optimum stecken kann//
Since it relies on a random initialization and Lloyd’s algorithm can get stuck in local optima of the k-means objective function
Clustering - definition?
clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
k- means clustering?
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances).
–> Lloyd algorithm
Explain hard vs soft clusterin
Hard Clustering and Soft Clustering. In hard clustering, one data point can belong to one cluster only.
But in soft clustering, the output provided is a probability likelihood of a data point belonging to each of the pre-defined numbers of clusters.
Flipping coins -
formula for probability of generating a given sequence of flips is ?
Pr(sequence|θ) = θi* (1-θ)n-i
θ = parameter, unkown bias (probability that the coin lands on heads).
i = number of heads
n = number of throws
This expression is minimized at θ= i/n (most likely bias)
Flipping coins
From Coin Flipping to k-means Clustering:
Where Are Data, HiddenVector, and Parameters?
Data: data points Data = (Data1,…,Datan)
Parameters: Centers = (Center1,…,Centerk)
HiddenVector: assignments of data points to k centers
(n-dimensional vector with coordinates varying from 1 to k).
Hierarchical Clustering - definition
Hierarchical clustering is an unsupervised learning technique used to group similar objects into clusters. It creates a hierarchy of clusters by merging or splitting them based on similarity measures.
Hierarchical clustering groups similar objects into a dendrogram. It merges similar clusters iteratively, starting with each data point as a separate cluster. This creates a tree-like structure that shows the relationships between clusters and their hierarchy.
The dendrogram from hierarchical clustering reveals the hierarchy of clusters at different levels, highlighting natural groupings in the data. It provides a visual representation of the relationships between clusters, helping to identify patterns and outliers, making it a useful tool for exploratory data analysis.
Hierarchical clustering - step by step
Hierarchical clustering starts from a transformation of n x m expression matrix into n x n similarity matrix or distance matrix
form a node (single-element cluster) for every gene
Identify the two closest clusters and merge them.
Recompute the distance between the new cluster
and all others with a given distance metric:
e.g. as average distance between elements of two
clusters.
Identify the two closest clusters and merge them.
Recompute the distance between two clusters
(here, as average distance).
Identify the two closest clusters and merge them.
…
Iterate until all elements form a single cluster (root).
Hierarchical Clustering - Different Distance Functions?
average, median, or maximum distance
Expectation Maximization ?
watch yt video again?