Understanding the concept of data inspection Flashcards
Data Inspection
Data Inspection is the act of viewing data for verification and debugging purposes, before, during, or after a translation.
Data cleansing:
When cleansing a dataset, data-scientists seek to:
Remove null or invalid results
Standardise data within a single relevant, usable format
Unify disparate data sources in a consistent format
Maintain the integrity of the source dataset
As simple as this process might seem, actual data sets are often much larger and typically contain a large variety of disparate values requiring greater scrutiny.
Data Extraction:
Data extraction is the process of obtaining data from a database or SaaS platform so
that it can be replicated to a destination — such as a data warehouse designed to support online analytical processing (OLAP).
Types of data extraction:
Extraction jobs may be scheduled, or analysts may extract data on demand as dictated by business needs and analysis goals. Data can be extracted in three primary ways:
Update notification
Incremental extraction
Full extraction
Data extraction process:
- Check for changes to the structure of the data, including the addition of new tables and columns. Changed data structures have to be dealt with programmatically.
- Retrieve the target tables and fields from the records specified by the integration’s replication scheme.
- Extract the appropriate data, if any.
Data loading:
Data loading is the process of copying and loading data or data sets from a source file, folder or application to a database or similar application. It is usually implemented by copying digital data from a source and pasting or loading the data to a data storage or processing utility.
Benefits of data loading:
Today, the ETL process — including data loading — is designed for speed, efficiency, and flexibility. But more importantly, it can scale to meet the growing data demands of most enterprises. ETL easily accommodates proliferation of data sources as technologies like IoT and connected devices continue to gain popularity. And it can handle any number of data types and formats, whether structured, semi- structured, or unstructured.
Challenges of data loading:
Slow down analysis
Increase the likelihood of errors
Require specialized knowledge
Require costly equipment