Week 7: Missing Data and Clustering Flashcards
REVERSED
One for which the within-cluster variation (W(Ck)) is as small as possible
What is a “good” k-means clustering?
REVERSED
imp = mice(data, seed=1, m=20, print=FALSE) #imputes missing data 20 times
fit = with(imp, lm(formula)) #fits the specified model to the imputed data, result will be m different model results
summary(pool(fit)) #pools the 20 sets of estimated parameters
How do you perform multiple imputation and fit a model in R?
REVERSED
O(n^3)
What is the time complexity of hierarchical clustering?
REVERSED
used to do a summarise transformation across all variables
What does summarise_all() do?
REVERSED
- Scaling to standard deviation 1 gives equal importance to each variable in the clustering
- Useful when variables are measured on different scales
When and why should variables be scaled before computing dissimilarity?
REVERSED
library(“mice”)
imp = mice(data, method = “mean”, m=1, maxit=1)
#another way: replace\_na(data, variable = mean(variable, na.rm=TRUE, …)
How do you perform mean imputation in R? (2 ways)
REVERSED
Only mean is unbiased under NDD
Standard error is too small
Disturbs relations between variables
Under what conditions is mean imputation unbiased? What happens to the standard error?
REVERSED
- Only look at unsupervised bit: data and clustering and quantify how successful clustering is
- Popular measures: average silhouette width (ASW) - how close points are to other clusters, gap statistic
What are internal validation indices and what are some popular methods?
REVERSED
LOCF is always biased
Standard error is too small
Under what conditions is LOCF imputation unbiased? What happens to the standard error?
REVERSED
data %>% group_by(variable) %>% summarise_all(function(x) sum(is.na(x))
What is the code for getting the number of na’s for each variable grouped by a variable
REVERSED
Mean, regression weight and correlation are unbiased only under NDD.
Under what conditions is pairwise deletion unbiased?
REVERSED
Replace missing data by the mean or the mode for categorical data
What is mean imputation?
REVERSED
Mean, regression weights and correlation are unbiased under SDD
Standard error is too small
Under what conditions is stochastic regression imputation unbiased? What happens to the standard error?
REVERSED
A refinement of regression imputation that attempts to address correlation bias by adding noise to the predictions. This method first estimates the intercept, slope and residual variance under the linear model, then calculates the predicted value for each missing value and adds a random draw from the residual to the prediction
What is stochastic regression imputation?
REVERSED
It is not suitable for clustering non-spherical groups of objects
What is the disadvantage of a k-medoids clustering?
REVERSED
options(na.action = na.omit)
How do you change the settings to always omit NAs?
REVERSED
- Retains the full dataset and allows for systematic differences between the observed and unobserved data by the inclusion of the response indicator
- Can be useful to estimate the treatment effect in randomised trials when a baseline covariance is partially observed
What are the advantages of the indicator method? (2)
REVERSED
- Complete: Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
- Single: Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
- Average: Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
- Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.
What are the 4 types of linkage and how they work?
REVERSED
library(patchwork) #allows you to display ggplots together using plot1 + plot2
What is a way to display ggplots together in R?
REVERSED
- Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations
- Iterate until the cluster assignments stop changing:
a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster
b) Assign each observation to the cluster whose centroid is closest (using Euclidean distance)
* When the result no longer changes, the local optimum has been reached
What is the algorithm for k-means clustering?
REVERSED
library(fpc)
clusterboot(data, clustermethod = hclustCBI, method = “complete“, k = 3) #for kmeans use kmeansCBI
Gives Jaccard bootstrap mean for each cluster. Generally stability less than 0.6 is considered unstable. Clusters with stability above 8.5 are highly stable (likely to be real clusters)
How do you perform stability assessment in R?
REVERSED
- Bottom-up or agglomerative clustering: the dendrogram is built starting from the leaves and combining clusters up to the trunk
- Divisive clustering: starting with one cluster and keep splitting the most different
What are two ways of hierarchical clustering?
REVERSED
mean(y, na.rm=TRUE)
How do you do a mean calculation, removing missing values first?
REVERSED
na.action(model)
How do you show the indices of NAs in a model?
REVERSED
Are the clusters associated with an external feature Y? Find data for Y to evaluate
How do you use external information to evaluate clustering results?
REVERSED
- K-means clustering applied to images
- Goal is image compression -> less storage. Cluster pixels and replace them by their cluster centroid
- File size increases with number of clusters
- Image loss decreases with number of clusters
What is vector quantisation?
REVERSED
distances = dist(data, method = “euclidean”) result = hclust(distances, method = “average”)
library(ggdendro)
ggdendrogram(result)
#select number of clusters using cutoff point: h= gives the height or k= gives the number of clusters cutree(result, h=2) #results in vector with number of cluster for each variable. Good to use as.factor when going to plot by colour
How do you perform hierarchical clustering in R and plot the dendrogram and select the number of clusters?
REVERSED
Every data point has some likelihood of being missing. The process that governs these probabilities is called the missing data mechanism or response mechanism. There are three categories of missing data: MCAR, MAR and MNAR
What is Rubins theory of classifying missing data?
REVERSED
colSums(is.na(data))
How do you display the number of NAs in each variable of a dataset?
REVERSED
- Reduce variables into 2D “manifold” for visualisation
- Popular techniques: UMAP, t-SNE, MDS, Discriminant coordinates, PCA
How do you use visual exploration to evaluate clustering results when there are many variables?
REVERSED
distances = dist(data) result = hclust(distances)
clusters = cutree(result, 2) silhouette\_scores = silhouette(clusters, distances)
plot(silhouette_scores)
How do you perform average silhouette width analysis in R? what does the result tell us?
REVERSED
Find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios.
What are strategies to handle MNAR data?
REVERSED
First builds a model from the observed data
Predictions for the incomplete cases are then calculated under the fitted model and serve as replacements for the missing data
What is regression imputation?