5. Genomic Data Analysis Flashcards
how can cell types be identified
- physical appearance
- presence/absence of surface proteins
- isolation of cells and profiling of individual characteristics using sequencing technologies
how is single-cell RNA sequencing conducted
quality control (outlier removal) –> normalisation –> feature selection –> dim. reduction –> cell-cell distance –> unsupervised clustering
what is the purpose of quality control in single-cell rna sequencing
find unreliable cells, possible doublets
what is the purpose of feature selection & dim reduction in single-cell rna sequencing
find the most informative genes and strongest signals from background noise & decreasing processing time
what is the purpose of cell-cell distance in single-cell rna sequencing
to assist with clustering algos
what is the process of clustering
feature selection/dim reduction (optional) –> clustering algo design/selection –> cluster validation –> results interpretation
how are clusters validated
adjusted rand index - measures the similarity between clusters
visual inspection
cluster magnitude/cardinality
downstream performance
how can we improve clustering
- try scaling data
- try another similarity measure
- check assumptions of algo match data distribution
how can we check the similarity measure
manually select known distant examples, and similar examples, and determine whether the distance metric is conducive
how can we check that we have the optimum number of clusters
plot loss v number of clusters
trial & error
what are some clustering algo categories
- centroid based (kmeans)
- connectively based
- density based (GMM)
- hierarchical
- distribution based (DBSCAN)
what is centroid based
- fast and efficient
- separate datapoints by multiple centroids and squared distance of data points from them
what is the kmeans algo
- pick k, number of clusters
- pick centroids (random points in space)
- assign each data point to centroid
- update centroids by mean loc of their points
- stop when centroids don’t move much else repeat
what are the clustering distances
euclidean, manhattan, minowski, hamming
what is euclidean
straight line, numeric data only