Missing values and PCa Flashcards
What is MAR, MCAR, MNAR and what are examples of them?
MAR = missing at random. When missing values are associated with a feature. Example, medical dataset where the reportings are lower for the observations in age group “younger patients”.
MCAR: Missing completely at random. when the likelihood of missing values are the same for each class. Example, it could be that the data was destroyed by chance (lab result not working or so).
MNAR: Missing not at random, when there is a systematic missing of values. Medical dataset, when healty patients tend to not report.
What are strategies for handling missing data and pros and cons of these?
Imputation by mean/median/mode. This is efficient when the missing data does not amount to a large proportion of the dataset. When we have MCAR. This can reduce the variability of the data.
KNN: using several features (or all?) to determine what the values should be. The choice of k is important. This might. be computationally not efficient.
Regression imputation: using regression techniques to fit the missing values. this might however overfit and underestimate the variability of the data.
Explain how PCA works and why this is interesting to use.
PCA = principal component analysos. The main idea is that we can use linear combinations of the p original features to create fewer M (M<p). So we can use fewer features - but still use all variables.
So we want to keep as few variables as possible, to as much information as possible.
How do PCA and K-means clustering contribute to outlier detection, and what are the key considerations in evaluating their performance for identifying outliers in a dataset? Discuss the strengths and limitations of each method in the context of outlier detection, and provide insights into scenarios where one approach might outperform the other
PCA is a linear combination that aims to find the combination that explains the most of the variance. it is sensitive to scale => normalization to ensure that all variables contribute equally.
PCA captures underlying patterns such it can notice outliers that do not follow these patterns.
They are particularly used when we have a linear dataset.
K-means uses the closeness of data points to the cluster means. Outliers the ones considered to be the most far away for the cluster centers (means). this method is also sensitive to scales. Outliers are determined based on the deviation from the centers.
It is suitable for datasets with distinct clusters.