Part 3: Data Preparation Flashcards

Question 1

Q

What is the purpose of data preparation?

Answer

A

To convert acquired ‘raw’ data into valid, consistent data, using structures and representations that will make analysis straightforward

Question 2

Q

Initial steps of data preparation

Answer

A

1)Explore the content, values, and the overall shape of the data
2)Determine the purpose in which the data will be used
3)Determine the type and aims of the analysis to be applied to it

Question 3

Q

Possible discovered problems with real data

Answer

A

1)Data is incorrectly packaged
2)Some values may not make sense
3)Some values may be missing
4)The format doesn’t seem right
5)The data doesn’t have the right structure for the tools and packages to be used with it

Question 4

Q

Data preparation activities:

Answer

A

1)Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
2)Data integration: combining data
3)Data transformation: shaping datasets

Question 5

Q

In data warehousing, the data preparation activities are known as:

Answer

A

ETL (Extract, Transform, Load) is used for the process of taking data from operational systems and loading them into warehouse
Data harmonization and data enhancement are also used

Question 6

Q

Classification of error types:

Answer

A

1)Validity
2)Accuracy
3)Completeness
4)Consistency
5)Uniformity

Question 7

Q

Validity

Answer

A

Checking whether the data values match any specified constraints, value limits, and formats for the column in which they appear

Question 8

Q

Accuracy

Answer

A

Checking correctness requires some external ‘gold standard’ to check them against

Question 9

Q

Completeness

Answer

A

Checking if all the values are present and if there’s any missing values

Question 10

Q

Consistency

Answer

A

If two values should be the same but are not, then there is an inconsistency.

Question 11

Q

Uniformity

Answer

A

It is necessary to choose a base or
canonical representation and translate all values to that
form

Question 12

Q

Data harmonization

Answer

A

A data cleansing activity for creating a common (aka canonical) form for non-uniform data. Mixed forms often occur when two or more data sources use different representations (and then combined)

Question 13

Q

Approaches to handling dirty data

Answer

A

fix it – replace incorrect or missing values with the correct values
remove it – remove the value, or a group of values (or rows of data or data elements) from the dataset
replace it – substitute a default marker for the incorrect value, so that later processing can recognize it is dealing with inappropriate values
leave it – simply note that it was identified and leave it, hoping that its impact on subsequent processing is minimal.

Question 14

Q

Documenting data cleansing

Answer

A

it is necessary to:
document how the dirty data was identified and handled, and for what reason
and maintain the data in both raw and ‘cleaned’ form
If the data originally came from operational systems, it might be necessary to feed the findings back to the managers of these systems

Question 15

Q

What are the benefits of documenting data cleansing?

Answer

A

Allows others to consider the changes made and ensure they were both valid and sensible.
Helps to build a core of approaches and methods for the kinds of datasets that are frequently used.
Allows managers of operations systems where the data came from to adjust and improve their validation processes.
Allows you, in time, to develop effective cleansing regimes for specialized data assets.

Question 16

Q

Data cleansing activities include:

Answer

Study These Flashcards

A

1) Data harmonization
2) Data laundering
3) Data obfuscating (aka Data anonymization)

Question 17

Q

Data Laundering

Answer

Study These Flashcards

A

Attempts to break the link between the dataset and its (valid) provenance

Question 18

Q

Data obfuscating (data anonymization)

Answer

Study These Flashcards

A

It is the process of removing the link between sensitive data and the real-world entities to which it applies, while retaining the value and usefulness of the data at the same time.

Question 19

Q

What is the difference between data laundering and data obfuscating and data cleansing itself?

Answer

Study These Flashcards

A

The key difference between these activities and data cleansing itself is:
1) in data cleansing, we are trying to document and maintain the full provenance of the dataset
2)in data laundering, we want to lose its history
3)and in data obfuscating, we’re trying to produce anonymized but useful data

Question 20

Q

The several interpretations of ‘more data’

Answer

Study These Flashcards

A

-more of the same, a bigger dataset with more data elements (a longer table, one with more rows)
-more data about a data element we already have (a wider table, one with more columns)
-more datasets (more tables).

Question 21

Q

Give the ways to flag invalid entries:

Answer

Study These Flashcards

A

1)Null marker
2)Not a Number(Nan) and Not a Time (NaT)
3)None value

Question 22

Q

Treating missing or invalid data

Answer

Study These Flashcards

A

There’s no consistent or automatic way to handle the full range of semantic interpretations of missing values, we have to:
1) treat them with care
2) decide what they represent
3) how they can be interpreted
4) how they can be best cleaned so that subsequent processing and analysis does not lead to logical errors
-Much will depend on how the chosen libraries and packages handle missing data.

Part 3: Data Preparation Flashcards

(22 cards)