Week 7: Missing Data and Clustering Flashcards
What is a way to display ggplots together in R?
library(patchwork) #allows you to display ggplots together using plot1 + plot2
How do you print the number of missing values in a model?
naprint(na.action(model))
What is the code for creating a table display of the missingness patterns?
md.pattern(data)
How do you display the number of NAs in each variable of a dataset?
colSums(is.na(data))
What are embedded or model based methods for missing data?
Don’t impute, deal with missing values in the prediction model itself
What are the advantages of multiple imputation? (3)
- Solves the problem of too small standard errors
- Our level of confidence in a particular imputed value is expressed as the variation across the m completed datasets
- Under the right conditions, the pooled estimates are unbiased and have the correct statistical properties
How do you perform LOCF imputation in R?
tidyr::fill(data, variable)
Under what conditions is listwise deletion unbiased? What happens to the standard error?
Mean, regression weight and correlation are unbiased only under NDD. Standard error is too large
What is linkage?
The dissimilarity between two clusters if one or both contains multiple observations
What are the disadvantages of single and centroid and complete linkage?
- Single linkage can result in extended, trailing clusters in which single observations are fused one at a time. Can’t separate clusters properly if their is noise between clusters
- Centroid linkage can result in undesirable inversions, where two clusters are fused at a height below either of the individual clusters in the dendrogram.
- Complete linkage tends to break large clusters.
How do you perform mean imputation in R? (2 ways)
library(“mice”)
imp
What are the advantages of the indicator method? (2)
- Retains the full dataset and allows for systematic differences between the observed and unobserved data by the inclusion of the response indicator
- Can be useful to estimate the treatment effect in randomised trials when a baseline covariance is partially observed
How do you perform k-means clustering in R? and what does the output consist of?
means_cluster
Under what conditions is stochastic regression imputation unbiased? What happens to the standard error?
Mean, regression weights and correlation are unbiased under SDD
Standard error is too small
What is regression imputation?
First builds a model from the observed data
Predictions for the incomplete cases are then calculated under the fitted model and serve as replacements for the missing data
What is mean imputation?
Replace missing data by the mean or the mode for categorical data
What are the forumlas for NDD, SDD and UDD? Where M indicates whether variable 2 is missing (1) or not (0)
NDD: Pr(M=1 | var1, var2) = Pr(M=1)
SDD: Pr(M=1 | var1, var2) = Pr(M=1 | var1)
UDD: Pr(M=1 | var1, var2) can’t be reduced
What is the k-medoids clustering algorithm?
- Initialise: select k random points as the medoids
- Assign each data point to the closest medoid by using any distance method (e.g. euclidean)
- For each data point of cluster i, its distance from all other data points is computed and added. The point of ith cluster for which the computed sum of distances from other points is minimal is assigned as the medoid for that cluster
- Repeat steps 2 and 3 until the medoids stop moving
How do you perform multiple imputation and fit a model in R?
imp
What are internal validation indices and what are some popular methods?
- Only look at unsupervised bit: data and clustering and quantify how successful clustering is
- Popular measures: average silhouette width (ASW) - how close points are to other clusters, gap statistic
Under what conditions is regression imputation unbiased? What happens to the standard error?
Mean and regression weights are unbiased under SDD
-for regression weights under SDD, only if the factors that influence the missingness are part of the regression model
Standard error is too small
What is a “good” k-means clustering?
One for which the within-cluster variation (W(Ck)) is as small as possible
When does hierarchical clustering give worse results than k-means clustering?
When the data doesn’t have a hierarchical structure. e.g. when the best division into 2 groups is by gender but the best division into 3 groups is by nationality
How do you differentiate between SDD and UDD?
You can’t
What are the disadvantages of pairwise deletion?
- The covariance matrix may not be positive-definite
- Problems are more severe for highly correlated variables
- Requires numerical data that follows approximate normal distribution
What is the general code for imputing data and fitting a model with mice in R?
imp
How do you perform listwise deletion in R?
Use na.omit()
What is the algorithm for k-means clustering?
- Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations
- Iterate until the cluster assignments stop changing:
a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster
b) Assign each observation to the cluster whose centroid is closest (using Euclidean distance)
* When the result no longer changes, the local optimum has been reached
What are the advantages of k-medoids over k-means clustering?
- In k-means, squared Euclidean distance places the highest influence on the largest distances
- K-means lacks robustness against outliers that produce very large distances
- K-medoids is less sensitive to outliers
What does the vertical axis/height of a dendrogram show?
Mean intercluster dissimilarity: for any two observations, look at the point in the tree where branches containing those observations are first fused. The height of this fusion indicates how different the observations are. Higher = less similar
How do you perform hierarchical clustering in R and plot the dendrogram and select the number of clusters?
distances
Should you cut the dendrogram higher or lower for more clusters?
Lower cut = more clusters
What is within cluster variation?
Can be defined in many ways, common approach is euclidean distance