Lecture 3 Flashcards
Data Preparation
How is data analytics perfomed in practice ?
Data analytics goes through 4 phases in practice:
1. Preparation
2. Preprocessing
3. Analysis
4. Postprocessing
What are the tasks performed during the preparation phase ?
The tasks perfomed during preparation are:
* Planning
* Data collection
* Feature generation
* Data selection
What are the tasks performed during the preprocessing phase ?
The tasks perfomed during preprocessing are:
* cleaning
* filtering
* completion
* correction
* standardization
* transformation
What are the tasks performed during the analysis phase ?
The tasks perfomed during the analysis are:
* visualisation
* correlation
* regression
* forecasting
* classification
* clustering
What are the tasks performed during the postprocessing phase ?
The tasks perfomed during postprocessing are:
* interpretation
* documentation
* evaluation
This is for the data preparation!
Why is merging often necessary in data analytics ?
It is often the case that one dataset is insufficient to perform the whole analysis, hence dataset have to be merged. This depends on the number of datasets and type of variables.
What are the types of merging possible ?
There are 4 types of merging:
* Appending
* Horizontal stacking
* Join family
* Variable selection
When is Appending used ?
Appending is used with 2 datasets with the same variables. That is when the data gets vertically stacked
When is Horizontal stacking used ?
Horizontal stacking is used for 2 datasets with the same variables. They get horizontally sequenced.
When is Join familiy used ?
It is used for two datasets with different variables. There are 6 types of Join families:
* inner_join: everything that is in one AND the other
* full_join: everything that is in one OR the other
* left_join: everything in the left join AND in the intersection of both
* right_join: everything in the right join AND in the intersection of both
* semi_join: only what is in both, with only variables from the first
* anti_join: what is in neither keeping only the first
What is common variable selection?
With more than two datasets, it is about selecting all the common features and joining them
What is creation of multiple subdataset?
It is a second technique employed to merge more than 2 datasets. That is to merge subsets of datasets.
When does duplicate data occur ?
It occurs from data entry:
* lack of unique identifiers
* lack of integrity or validation checks
* data errors
Or from data merging:
* structural heterogeneity
* lexical heterogeneity
What can you do to limit duplicates ?
Duplicates can be prevented by design with:
* use of standards
* integrity rules
* validation checks
What is strucutral heterogeneity?
Fields of different databases represent the same information in a structurally different manner.
EX:
DB1: Contact Name
DB2: Salutation, First Name, Last Name
What is Lexical heterogeneity?
Fields of different databases are structurally the same, but they represent the same information in a different manner.
DB1 - Address: 32 E St. 4.
DB2 - Address: 32 East,4 th Street
What is a unique identifiers?
A unique identifier is a data field that is always unique for an entity
Ex: Social Security Number for customer data, Manufacturer Part Number..
What are validation checks ?
Validation check: Do the unique identifiers conform to valid
patterns (for example, AAA-GG-SSSS for SSN)?
What are
integrity constraints?
Integrity constraint: Do the identifier comply with the standard length? (for example, 11 char limit for SSN)
What can you do when faced with duplicate data ?
You can use deduplication, which is the process of removing duplicates from a dataset. Deduplication techniques for a single dataset are:
* Identify duplicate rows
* Find duplicates in one column
* Find duplicates in one column
* Drop duplicates in one column
* Drop duplicates in multiple columns
What is preprocessing?
Preprocessing is often used as an umbrella term to define all the operations that are performed prior to start the
analysis.
There is :
* General purpose preprocessing
* Preprocessing for diagnostic analytics