Data_Quality (Brohrer) Flashcards
1
Q
choose your rows
A
one and only one target value for each row (well… for multilabel, multiclass may not be true but still it has to answer a precise question as in single label case)
2
Q
inspect the data
A
- spot problems in data
- identify the domain of values for different features, use functionality similar to DataFrame.describe() - pandas
- review the documentation for all features
- visualize data (histograms, scatter plots, etc) for anomalies, correlation, etc.
3
Q
correct the data
A
- if the error is obvious, correct the data
- otherwise delete the value and leave it as missing
4
Q
missing data -> mean
A
- assumes the missing data is distributed in the same manner as the present one
- aka Missing Completely at Random
5
Q
missing data -> median, mode
A
- median is a robust statistic useful when dealing with outliers
- mode is used when dealing with categorical variable.s
6
Q
missing data -> interpolated value
A
- interpolate the value when dealing with ordered data
- useful with time-series data
7
Q
missing data -> constant
A
- when dealing with information that is not Missing at Random i.e. people don’t share income info as it is very ‘high’; using high gives a better representation of missing info even if not precise
8
Q
missing data -> missing rank
A
- when dealing with data presented in ranking order with no duplicates or gaps
9
Q
missing data -> 0
A
- when dealing with numerical data is a reasonable choice
- when dealing with categorical data use it as (key, value) pair (0, ‘missing’)
10
Q
missing data -> 0 - confidence
A
- if confindence in using 0 is low, use a separate column/feature to indicate with a boolen flag the missing value
11
Q
missing data -> ‘missing’
A
- use such an indicator value or similar for categorical data
12
Q
missing data -> delete columns
A
- use it when the number of missing data instances is to high to be useful (uninformative)
!!! NOTE: - make sure there is enough data left before proceeding further
13
Q
missing data -> delete rows
A
- use it when critical information (features) per row is missing
!!! NOTE: - make sure there is enough data left before proceeding further
14
Q
missing data -> imputation
A
- a way to have information from other columns used to fill missing information in a reasonable way (i.e. through some correlated values between features)