Data Quality & Proximinity Measurement Flashcards

1
Q

Noise

A

Random and unpredictable variation in the data that is not related to the underlying pattern or signal that the model is trying to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Outliers

A

Data objects with characteristics that are considerably different than most of the other data objects in the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Missing values

A

The absence of a particular value in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

2 Reason for Missing Values

A

Information is not collected.
Attributes may not be applicable to all cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 Way to Handling Missing Values

A

Eliminate the attribute altogether - column

Eliminate instances/objects - row

Replace missing values.
Statistical methods - Median for age
Sophisticated methods - Regression imputation, KNN imputation

Prediction of missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Cleaning

A

Process of dealing with duplicate data issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Proximity Measures

A

Mathematical metrics used to determine the similarity or distance between two data points in feature space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 Distance measures used in clustering algorithms

A

K-means clustering
Hierarchical clustering,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 Distance measures used in classification algorithms

A

K-nearest neighbors (KNN)
Hierarchical Clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

4 Way to Calculate Dissimilarity (Distance)

A

Manhattan Distance / Taxicab / City block / L1 norm
Euclidean Distance
Minkowski Distance
Mahalanobis Distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4 Way to Calculate Similarity Measurement

A

Simple Matching Coefficient (SMC)
Jaccard Coefficients
Cosine Similarity
Correlation
Drawback - Cannot detect non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly