5. Genomic Data Analysis Flashcards
how can cell types be identified
- physical appearance
- presence/absence of surface proteins
- isolation of cells and profiling of individual characteristics using sequencing technologies
how is single-cell RNA sequencing conducted
quality control (outlier removal) –> normalisation –> feature selection –> dim. reduction –> cell-cell distance –> unsupervised clustering
what is the purpose of quality control in single-cell rna sequencing
find unreliable cells, possible doublets
what is the purpose of feature selection & dim reduction in single-cell rna sequencing
find the most informative genes and strongest signals from background noise & decreasing processing time
what is the purpose of cell-cell distance in single-cell rna sequencing
to assist with clustering algos
what is the process of clustering
feature selection/dim reduction (optional) –> clustering algo design/selection –> cluster validation –> results interpretation
how are clusters validated
adjusted rand index - measures the similarity between clusters
visual inspection
cluster magnitude/cardinality
downstream performance
how can we improve clustering
- try scaling data
- try another similarity measure
- check assumptions of algo match data distribution
how can we check the similarity measure
manually select known distant examples, and similar examples, and determine whether the distance metric is conducive
how can we check that we have the optimum number of clusters
plot loss v number of clusters
trial & error
what are some clustering algo categories
- centroid based (kmeans)
- connectively based
- density based (GMM)
- hierarchical
- distribution based (DBSCAN)
what is centroid based
- fast and efficient
- separate datapoints by multiple centroids and squared distance of data points from them
what is the kmeans algo
- pick k, number of clusters
- pick centroids (random points in space)
- assign each data point to centroid
- update centroids by mean loc of their points
- stop when centroids don’t move much else repeat
what are the clustering distances
euclidean, manhattan, minowski, hamming
what is euclidean
straight line, numeric data only
what is manhattan
distance along each axis/feature e.g. walking around city blocks, numeric data only
what is minowski
generalised form of manhattan & euclidean
uses higher orders to further exacerbate dissimilarities
only for numeric data
what is hamming
count similarities in each feature
accommodates categorical
what are the pros & cons of k-means
- can get stuck in local optimum
- clusters are spherical
- can be tripped by outliers
- numerical only
- simple
- quick
what are the two approaches to hierarchical clustering and how do they work
agglomerative - bottom up, start with many clusters & merge together in a tree based hierarchy
divisive - start from the top and break into smaller clusters, same tree based hierarchy
what are the pros and cons of hierarchical clustering
- no assumptions on cluster number
- can correspond to meaningful taxonomies
- once a decision is made to combine two clusters, it cant be reversed
- slow on large datasets
what is DBSCAN
density based spatial clustering of applications with noise
radius of original points (similar to centroids are user defined)
what is the DBSCAN algorithm
- pick the radius that will define core points
- pick a random core point and find points that fall within it’s radius, these are core points
- continue to iterate across the data points finding core points that fall within the radius of other core points to finish the cluster
- then find non-core points which are close to core points. don’t use non-core points to further extend the cluster
- first cluster is finished. repeat for other clusters
what is the silhouette coefficient
an alternative to the elbow method
pick a range of values for K (e.g. 1-10)
for each point calculate silhouette CE
a(i) = distance from point i to every other point in cluster
b(i) = distance from point i to every other point in data
goal is a(i) < b(i) to
s(i) = b(i)-a(i)/ larger of b(i) and a(i)
worst silhouette CE = -1, best = 1
every cluster will have it’s own silhouette plot, and the average is taken of the SCI over each cluster