Chapter 19 Flashcards
What are ETL pieces
Transformation etc
What is data cleansing
Dirty data should remove to go to dataware house
What is GIGO
Stands for “Garbage In, Garbage Out.” GIGO is a computer science acronym that implies bad input will result in bad output.
What is dirty data
It is relative term. It means data does not confirm its value.
Who tells data dirty or clean
The person who have domain knowledge
What is toddler employee
Example of dirty data. Employee too much young to get a job
What is un-born employee
Employee DOB is less than Date of joining
What is govt decision making
Investment of govt where there is no need and it is loss of money
What is direct mall marketing
Failure of advertisement campaign and loss of money
What are lighter side of dirty data
- Toddler Employee
- Un-born Employee
What are 3 classes of anomalies
- Syntactically dirty data
- Semantically dirty data
- Coverage anomalies
What are sub classes of syntactically dirty data
- Lexical errors
- Irregularities
What are sub classes of Semantically dirty data
- Integrity constraint violation
- Business rule contradiction
- Duplication
What are Coverage anomalies
- Missing attributes
- Missing Records
What are lexical errors
There is problem in structure of data and storage problem
What are irregularities
Missing of unit (e.g. there is salary in column 2000 and we do not know it is Pkr, USD or what)
What is Integrity constraint violation
Integrity constraint violations occur when an insert, update, or delete statement violates a primary key, foreign key, check, or unique constraint or a unique index.
What is business rule contradiction
It is violation of business rule
How we can handle coverage anomalies
- Remove that record that have problem
- Manual data feeding
- Use global constant figure (use global value and use it where ever missing)
- Replace most probable value with missing value
What are 2 key based problem
- Primary key problems
- None-Primary key problems
What are primary key problems
- Same key but different data
- Same entity with different keys
- PK in one system but not in other
- Same PK but not in different formats
What are non primary key problems
- Different encoding in different sources (e.g. M/F and some place male/female and so on)
- Multiple ways to represent the same information
- Sources might contain invalid data
- Two fields with different data but same name
- Required field left empty
- Data incomplete
- Data contain null values
What are 4 methods of automate data cleansing
1- Association rules (Make rules on statistical properties)
2- Pattern based (Find different pattern values)
3- Statistical (with the help of mean value etc)
4- Clustering (group together values which are similar and anomalies left alone)