Part 3: Data Preparation Flashcards

1
Q

What is the purpose of data preparation?

A

To convert acquired ‘raw’ data into valid, consistent data, using structures and representations that will make analysis straightforward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Initial steps of data preparation

A

1)Explore the content, values, and the overall shape of the data
2)Determine the purpose in which the data will be used
3)Determine the type and aims of the analysis to be applied to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Possible discovered problems with real data

A

1)Data is incorrectly packaged
2)Some values may not make sense
3)Some values may be missing
4)The format doesn’t seem right
5)The data doesn’t have the right structure for the tools and packages to be used with it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data preparation activities:

A

1)Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
2)Data integration: combining data
3)Data transformation: shaping datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In data warehousing, the data preparation activities are known as:

A

ETL (Extract, Transform, Load) is used for the process of taking data from operational systems and loading them into warehouse
Data harmonization and data enhancement are also used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification of error types:

A

1)Validity
2)Accuracy
3)Completeness
4)Consistency
5)Uniformity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Validity

A

Checking whether the data values match any specified constraints, value limits, and formats for the column in which they appear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Accuracy

A

Checking correctness requires some external ‘gold standard’ to check them against

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Completeness

A

Checking if all the values are present and if there’s any missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Consistency

A

If two values should be the same but are not, then there is an inconsistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Uniformity

A

It is necessary to choose a base or
canonical representation and translate all values to that
form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data harmonization

A

A data cleansing activity for creating a common (aka canonical) form for non-uniform data. Mixed forms often occur when two or more data sources use different representations (and then combined)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Approaches to handling dirty data

A
  • fix it – replace incorrect or missing values with the correct values
  • remove it – remove the value, or a group of values (or rows of data or data elements) from the dataset
  • replace it – substitute a default marker for the incorrect value, so that later processing can recognize it is dealing with inappropriate values
  • leave it – simply note that it was identified and leave it, hoping that its impact on subsequent processing is minimal.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Documenting data cleansing

A
  • it is necessary to:
  • document how the dirty data was identified and handled, and for what reason
  • and maintain the data in both raw and ‘cleaned’ form
  • If the data originally came from operational systems, it might be necessary to feed the findings back to the managers of these systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the benefits of documenting data cleansing?

A
  1. Allows others to consider the changes made and ensure they were both valid and sensible.
  2. Helps to build a core of approaches and methods for the kinds of datasets that are frequently used.
  3. Allows managers of operations systems where the data came from to adjust and improve their validation processes.
  4. Allows you, in time, to develop effective cleansing regimes for specialized data assets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data cleansing activities include:

A

1) Data harmonization
2) Data laundering
3) Data obfuscating (aka Data anonymization)

17
Q

Data Laundering

A

Attempts to break the link between the dataset and its (valid) provenance

18
Q

Data obfuscating (data anonymization)

A

It is the process of removing the link between sensitive data and the real-world entities to which it applies, while retaining the value and usefulness of the data at the same time.

19
Q

What is the difference between data laundering and data obfuscating and data cleansing itself?

A

The key difference between these activities and data cleansing itself is:
1) in data cleansing, we are trying to document and maintain the full provenance of the dataset
2)in data laundering, we want to lose its history
3)and in data obfuscating, we’re trying to produce anonymized but useful data

20
Q

The several interpretations of ‘more data’

A

-more of the same, a bigger dataset with more data elements (a longer table, one with more rows)
-more data about a data element we already have (a wider table, one with more columns)
-more datasets (more tables).

21
Q

Give the ways to flag invalid entries:

A

1)Null marker
2)Not a Number(Nan) and Not a Time (NaT)
3)None value

22
Q

Treating missing or invalid data

A

There’s no consistent or automatic way to handle the full range of semantic interpretations of missing values, we have to:
1) treat them with care
2) decide what they represent
3) how they can be interpreted
4) how they can be best cleaned so that subsequent processing and analysis does not lead to logical errors
-Much will depend on how the chosen libraries and packages handle missing data.