Week 7: Missing Data and Clustering Flashcards

Question

What are the disadvantages of pairwise deletion?

Answer 1

- The covariance matrix may not be positive-definite - Problems are more severe for highly correlated variables - Requires numerical data that follows approximate normal distribution

Answer 2

Use na.omit()

Answer 3

1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations 2. Iterate until the cluster assignments stop changing: a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster b) Assign each observation to the cluster whose centroid is closest (using Euclidean distance) \* When the result no longer changes, the local optimum has been reached

Answer 4

- In k-means, squared Euclidean distance places the highest influence on the largest distances - K-means lacks robustness against outliers that produce very large distances - K-medoids is less sensitive to outliers

Answer 5

Mean intercluster dissimilarity: for any two observations, look at the point in the tree where branches containing those observations are first fused. The height of this fusion indicates how different the observations are. Higher = less similar

Answer 6

Lower cut = more clusters

Answer 7

Can be defined in many ways, common approach is euclidean distance

Answer 8

- Bottom-up or agglomerative clustering: the dendrogram is built starting from the leaves and combining clusters up to the trunk - Divisive clustering: starting with one cluster and keep splitting the most different

Answer 9

The indicator method replaces each missing value by a zero and extends the regression model by the response indicator. This is applied to each incomplete variable. Then analyse the extended model

Answer 10

used to do a summarise transformation across all variables

Answer 11

Every data point has some likelihood of being missing. The process that governs these probabilities is called the missing data mechanism or response mechanism. There are three categories of missing data: MCAR, MAR and MNAR

Answer 12

mean(y, na.rm=TRUE)

Answer 13

na.action(model)

Answer 14

- K-means clustering applied to images - Goal is image compression -\> less storage. Cluster pixels and replace them by their cluster centroid - File size increases with number of clusters - Image loss decreases with number of clusters

Answer 15

imp$data #shows original data imp$imp #shows imputed data complete(imp, 3) #extracts the 3rd completed dataset of the m imputations

Answer 16

Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are ad-hoc imputation methods for longitudinal data Take the previous observed value as a replacement for the missing data

Answer 17

SDD assumption (or NDD)

Answer 18

Mean, regression weight and correlation are unbiased only under NDD.

Answer 19

O(k\*(n-k)^2)

Answer 20

- Within-dataset variance: the conventional sampling variance caused from taking a sample rather than entire population, the uncorrected standard error - Between-dataset variance: extra variance cause by the missing data - Simulation error: extra variance caused by the fact that the estimator is based on a finite amount of datasets m. Less of a problem with machines as m can be large

Answer 21

Are the clusters associated with an external feature Y? Find data for Y to evaluate

Answer 22

Calculates the mean and covariances of all available data. The matrix summary of statistics is then used for analysis and modelling

Answer 23

LOCF is always biased Standard error is too small

Answer 24

A refinement of regression imputation that attempts to address correlation bias by adding noise to the predictions. This method first estimates the intercept, slope and residual variance under the linear model, then calculates the predicted value for each missing value and adds a random draw from the residual to the prediction

Answer 25

options(na.action = na.omit)

Answer 26

- Reduce variables into 2D “manifold” for visualisation - Popular techniques: UMAP, t-SNE, MDS, Discriminant coordinates, PCA

Answer 27

1. Begin with n observations and a measure (such as Euclidean distance) of all the n(n-1)/2 pairwise dissimilarities. Treat each observation as its own cluster. 2. For i=n, n-1, … 2: a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are the least dissimilar. Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed b) Compute the new pair-wise inter-cluster dissimilarities among the i-1 remaining clusters

Answer 28

- Use of external information - Visual exploration - Stability assessment - Internal validation indices

Answer 29

lm(y~x, data, na.action = na.omit)

Answer 30

- Scaling to standard deviation 1 gives equal importance to each variable in the clustering - Useful when variables are measured on different scales

Answer 31

Replacing missing values with guessed values

Answer 32

Only mean is unbiased under NDD Standard error is too small Disturbs relations between variables

Answer 33

Look at the {0,1} missingness indicator M versus other features. If you can classify M from other features thence do not have NDD

Answer 34

data %\>% group\_by(variable) %\>% summarise\_all(function(x) sum(is.na(x))

Answer 35

How much does the clustering change when: 1. Changing some hyperparameters, 2. Changing some observations (bootstrapping), 3. Changing some features Check if observations are classified into same cluster across choices

Answer 36

Find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios.

Answer 37

Deduce the missing data from the data you have e.g. from height and weight can calculate BMI

Answer 38

Missing completely at random (MCAR) / Not data dependent (NDD): The probability of being missing is the same for all cases. The causes of the missing data are unrelated to the data. Missing at random (MAR) / Seen data dependence (SDD): the probability of being missing is the same only within groups defined by the observed data Missing not at random (MNAR) / Unseen data dependence (UDD): the probability of being missing varies for reasons that are unknown to us. It is missing because of the value you would have obtained.

Answer 39

Eliminate all cases with one or more missing values

Answer 40

library(fpc) clusterboot(data, clustermethod = hclustCBI, method = “complete“, k = 3) #for kmeans use kmeansCBI Gives Jaccard bootstrap mean for each cluster. Generally stability less than 0.6 is considered unstable. Clusters with stability above 8.5 are highly stable (likely to be real clusters)

Answer 41

- Create several (m) complete versions of the data (imputed datasets) by replacing the missing values by plausible data values using stochastic imputation - Estimate the parameters of interest from each imputed dataset. Typically done by applying the method that we would have used if the data was complete - Last step is to pool the m parameter estimates into one estimate by getting the average of the estimates Q\*, and to estimate its variance

Answer 42

The idea of k-medoids is to make the final centroids as actual data points, making them interpretable

Answer 43

It is not suitable for clustering non-spherical groups of objects

Answer 44

Clustering looks to find homogeneous subgroups among observations

Answer 45

- Complete: Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities. - Single: Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. - Average: Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities. - Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.

Answer 46

- Can increase correlation between variables - Correlations are biased upwards - P-values are too optimistic - Variability systematically underestimated

Answer 47

map\_dbl(data, mean) #use map\_int if data is int etc. summarise\_all(data, mean)

Answer 48

- Small decisions such as k and dissimilarity measure have big impacts on the clusters - Validating the clusters obtained: clustering will always result in clusters, but do they represent true subgroups in the data or are they simply clustering the noise - Since k-means and hierarchical clustering force every observation into a cluster, the clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster

Answer 49

Specify the desired number of clusters K, then the K-means algorithm will assign each observation to exactly one cluster

Answer 50

- Large loss of information - Hopeless with many features - Inconsistencies in reporting as analysis on the same data often uses different sub-samples - Can lead to non-sensical sub-samples e.g. deleting data in time series analysis

Answer 51

Considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance

Answer 52

- Single linkage can differentiate between non-eliptical clusters - Complete linkage gives well-separated clusters if their is noise between the clusters

Week 7: Missing Data and Clustering Flashcards

(84 cards)