Chapter 19 Flashcards

Question 1

Q

What are ETL pieces

Answer

A

Transformation etc

Question 2

Q

What is data cleansing

Answer

A

Dirty data should remove to go to dataware house

Question 3

Q

What is GIGO

Answer

A

Stands for “Garbage In, Garbage Out.” GIGO is a computer science acronym that implies bad input will result in bad output.

Question 4

Q

What is dirty data

Answer

A

It is relative term. It means data does not confirm its value.

Question 5

Q

Who tells data dirty or clean

Answer

A

The person who have domain knowledge

Question 6

Q

What is toddler employee

Answer

A

Example of dirty data. Employee too much young to get a job

Question 7

Q

What is un-born employee

Answer

A

Employee DOB is less than Date of joining

Question 8

Q

What is govt decision making

Answer

A

Investment of govt where there is no need and it is loss of money

Question 9

Q

What is direct mall marketing

Answer

A

Failure of advertisement campaign and loss of money

Question 10

Q

What are lighter side of dirty data

Answer

A

Toddler Employee

- Un-born Employee

Question 11

Q

What are 3 classes of anomalies

Answer

A

Syntactically dirty data
Semantically dirty data
Coverage anomalies

Question 12

Q

What are sub classes of syntactically dirty data

Answer

A

Lexical errors

- Irregularities

Question 13

Q

What are sub classes of Semantically dirty data

Answer

A

Integrity constraint violation
Business rule contradiction
Duplication

Question 14

Q

What are Coverage anomalies

Answer

A

Missing attributes

- Missing Records

Question 15

Q

What are lexical errors

Answer

A

There is problem in structure of data and storage problem

Question 16

Q

What are irregularities

Answer

Study These Flashcards

A

Missing of unit (e.g. there is salary in column 2000 and we do not know it is Pkr, USD or what)

Question 17

Q

What is Integrity constraint violation

Answer

Study These Flashcards

A

Integrity constraint violations occur when an insert, update, or delete statement violates a primary key, foreign key, check, or unique constraint or a unique index.

Question 18

Q

What is business rule contradiction

Answer

Study These Flashcards

A

It is violation of business rule

Question 19

Q

How we can handle coverage anomalies

Answer

Study These Flashcards

A

Remove that record that have problem
Manual data feeding
Use global constant figure (use global value and use it where ever missing)
Replace most probable value with missing value

Question 20

Q

What are 2 key based problem

Answer

Study These Flashcards

A

Primary key problems

- None-Primary key problems

Question 21

Q

What are primary key problems

Answer

Study These Flashcards

A

Same key but different data
Same entity with different keys
PK in one system but not in other
Same PK but not in different formats

Question 22

Q

What are non primary key problems

Answer

Study These Flashcards

A

Different encoding in different sources (e.g. M/F and some place male/female and so on)
Multiple ways to represent the same information
Sources might contain invalid data
Two fields with different data but same name
Required field left empty
Data incomplete
Data contain null values

Question 23

Q

What are 4 methods of automate data cleansing

Answer

Study These Flashcards

A

1- Association rules (Make rules on statistical properties)
2- Pattern based (Find different pattern values)
3- Statistical (with the help of mean value etc)
4- Clustering (group together values which are similar and anomalies left alone)

Chapter 19 Flashcards

(23 cards)