Chapter 19 Flashcards

1
Q

What are ETL pieces

A

Transformation etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data cleansing

A

Dirty data should remove to go to dataware house

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is GIGO

A

Stands for “Garbage In, Garbage Out.” GIGO is a computer science acronym that implies bad input will result in bad output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is dirty data

A

It is relative term. It means data does not confirm its value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Who tells data dirty or clean

A

The person who have domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is toddler employee

A

Example of dirty data. Employee too much young to get a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is un-born employee

A

Employee DOB is less than Date of joining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is govt decision making

A

Investment of govt where there is no need and it is loss of money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is direct mall marketing

A

Failure of advertisement campaign and loss of money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are lighter side of dirty data

A
  • Toddler Employee

- Un-born Employee

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are 3 classes of anomalies

A
  • Syntactically dirty data
  • Semantically dirty data
  • Coverage anomalies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are sub classes of syntactically dirty data

A
  • Lexical errors

- Irregularities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are sub classes of Semantically dirty data

A
  • Integrity constraint violation
  • Business rule contradiction
  • Duplication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Coverage anomalies

A
  • Missing attributes

- Missing Records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are lexical errors

A

There is problem in structure of data and storage problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are irregularities

A

Missing of unit (e.g. there is salary in column 2000 and we do not know it is Pkr, USD or what)

17
Q

What is Integrity constraint violation

A

Integrity constraint violations occur when an insert, update, or delete statement violates a primary key, foreign key, check, or unique constraint or a unique index.

18
Q

What is business rule contradiction

A

It is violation of business rule

19
Q

How we can handle coverage anomalies

A
  • Remove that record that have problem
  • Manual data feeding
  • Use global constant figure (use global value and use it where ever missing)
  • Replace most probable value with missing value
20
Q

What are 2 key based problem

A
  • Primary key problems

- None-Primary key problems

21
Q

What are primary key problems

A
  • Same key but different data
  • Same entity with different keys
  • PK in one system but not in other
  • Same PK but not in different formats
22
Q

What are non primary key problems

A
  • Different encoding in different sources (e.g. M/F and some place male/female and so on)
  • Multiple ways to represent the same information
  • Sources might contain invalid data
  • Two fields with different data but same name
  • Required field left empty
  • Data incomplete
  • Data contain null values
23
Q

What are 4 methods of automate data cleansing

A

1- Association rules (Make rules on statistical properties)
2- Pattern based (Find different pattern values)
3- Statistical (with the help of mean value etc)
4- Clustering (group together values which are similar and anomalies left alone)