Data_Quality (Brohrer) Flashcards

1
Q

choose your rows

A

one and only one target value for each row (well… for multilabel, multiclass may not be true but still it has to answer a precise question as in single label case)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

inspect the data

A
  • spot problems in data
  • identify the domain of values for different features, use functionality similar to DataFrame.describe() - pandas
  • review the documentation for all features
  • visualize data (histograms, scatter plots, etc) for anomalies, correlation, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

correct the data

A
  • if the error is obvious, correct the data

- otherwise delete the value and leave it as missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

missing data -> mean

A
  • assumes the missing data is distributed in the same manner as the present one
  • aka Missing Completely at Random
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

missing data -> median, mode

A
  • median is a robust statistic useful when dealing with outliers
  • mode is used when dealing with categorical variable.s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

missing data -> interpolated value

A
  • interpolate the value when dealing with ordered data

- useful with time-series data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

missing data -> constant

A
  • when dealing with information that is not Missing at Random i.e. people don’t share income info as it is very ‘high’; using high gives a better representation of missing info even if not precise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

missing data -> missing rank

A
  • when dealing with data presented in ranking order with no duplicates or gaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

missing data -> 0

A
  • when dealing with numerical data is a reasonable choice

- when dealing with categorical data use it as (key, value) pair (0, ‘missing’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

missing data -> 0 - confidence

A
  • if confindence in using 0 is low, use a separate column/feature to indicate with a boolen flag the missing value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

missing data -> ‘missing’

A
  • use such an indicator value or similar for categorical data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

missing data -> delete columns

A
  • use it when the number of missing data instances is to high to be useful (uninformative)
    !!! NOTE:
  • make sure there is enough data left before proceeding further
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

missing data -> delete rows

A
  • use it when critical information (features) per row is missing
    !!! NOTE:
  • make sure there is enough data left before proceeding further
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

missing data -> imputation

A
  • a way to have information from other columns used to fill missing information in a reasonable way (i.e. through some correlated values between features)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly