Data Integration & Data Quality Flashcards
Which three data models are there?
1 - hierarchical model
2 - network model
3 - relational database model
Hierarchical data model
Data is structured as a tree of records
Network model
Allows nodes to have multiple parents, allowing for more flexibility
What does the relational database model provide?
An independent way to store data. This is the only data model in which data can be restructured without affecting the applications.
Approaches against physical location challenges
1 - data federation
2 - data warehousing
Data federation
Leaving the data in its place and moving the query to the data. The database engine becomes the federation engine or the mediator, which maintains a global target schema to provide a virtual view of the integrated data. The query optimizer then uses this global target schema to decompose a query into partial queries that can be executed by each of the sources.
What are wrappers used for in data federation?
They are used for each of the sources to map the global schema to the schema of the source and to negotiate how much of a partial query each of the sources can do and at what cost. It translates the query and returns to result back to the federation engine.
What to do if the data is described differently in different places?
Map the existing schemas to one common schema. This is a virtual schema in the case of federation and a materialised schema in the case of data warehousing.
What do declarative schema mappings describe?
The relationships between the schemas of heterogeneous data sources.
Name the measurements of data quality
1 - accuracy
2 - timeliness (update frequency)
3 - completeness
4 - consistency
5 - duplication
6 - referential integrity (first insert, then updates, etc)
7 - domain integrity (ages should be in a certain domain)
8 - follow the business rules (if there can only be one manager, not multiple managers in the database)
What are the key aspects of data distribution?
1 - availability
2 - scalability
3 - transparency (of access)
4 - reliability/fault-tolerance
Homogenous distributed databases
When websites have identical software, are aware of each other, and agree to cooperate in processing user requests.
It appears as a single system to the user.
It is a closed world assumption, everything in the network is known.
Heterogeneous distrbuted databases
Different sites may use different schemas and software. They might not be aware of each other and may provide only limited facilities for cooperation in transaction processing (queries)
Horizontal fragmentation
Each tuple of the relation is assigned to one or more fragments
For example, storing the employees of each department in a separate database.
Vertical fragmentation
The schema for the relation is split into several smaller schemas
What does vertical fragmentation require?
All schemas to have a common candidate key / super key to ensure losless joins (unique id)
Communication heterogeneity
Some databases might allow for direct query language while others offer APIs
Schema heterogeneity
The structure of the table is different
Example: student_info table and student_classes and student_contact_info
Data type heterogeneity
Data stored as different data types
Value heterogeneity
The same logical values being stored in different ways
Example: str, st, street
Semantic heterogneity
The same values meaning different things in different sources
Example: title -> job title or title of a person
When are federated databases useful?
When there are many sources and only a few are communicating.
Name the two approaches to update data warehouses
1 - complete rebuild
2 - incremental updates
Name two important things about data warehouses
1 - A data warehouse is never up to date at all times
2 - They cannot handle streaming data
Data lake
A data lake stores all types of structured and unstructured data as-is.
Can be combined with data warehouses
Mediator
A virtual view over the data. It has a virtual schema that combines all schemas from the sources. The mapping takes place at query time. The mediator sends queries to the wrappers, which connect to the course and send back the result to the mediator. The mediator combines all the results into one final result.