Data Quality & Proximinity Measurement Flashcards
Noise
Random and unpredictable variation in the data that is not related to the underlying pattern or signal that the model is trying to learn.
Outliers
Data objects with characteristics that are considerably different than most of the other data objects in the data set.
Missing values
The absence of a particular value in a dataset
2 Reason for Missing Values
Information is not collected.
Attributes may not be applicable to all cases.
4 Way to Handling Missing Values
Eliminate the attribute altogether - column
Eliminate instances/objects - row
Replace missing values.
Statistical methods - Median for age
Sophisticated methods - Regression imputation, KNN imputation
Prediction of missing values.
Data Cleaning
Process of dealing with duplicate data issues.
Proximity Measures
Mathematical metrics used to determine the similarity or distance between two data points in feature space.
2 Distance measures used in clustering algorithms
K-means clustering
Hierarchical clustering,
2 Distance measures used in classification algorithms
K-nearest neighbors (KNN)
Hierarchical Clustering.
4 Way to Calculate Dissimilarity (Distance)
Manhattan Distance / Taxicab / City block / L1 norm
Euclidean Distance
Minkowski Distance
Mahalanobis Distance
4 Way to Calculate Similarity Measurement
Simple Matching Coefficient (SMC)
Jaccard Coefficients
Cosine Similarity
Correlation
Drawback - Cannot detect non-linearity