Data Integration & Data Quality Flashcards
Which three data models are there?
1 - hierarchical model
2 - network model
3 - relational database model
Hierarchical data model
Data is structured as a tree of records
Network model
Allows nodes to have multiple parents, allowing for more flexibility
What does the relational database model provide?
An independent way to store data. This is the only data model in which data can be restructured without affecting the applications.
Approaches against physical location challenges
1 - data federation
2 - data warehousing
Data federation
Leaving the data in its place and moving the query to the data. The database engine becomes the federation engine or the mediator, which maintains a global target schema to provide a virtual view of the integrated data. The query optimizer then uses this global target schema to decompose a query into partial queries that can be executed by each of the sources.
What are wrappers used for in data federation?
They are used for each of the sources to map the global schema to the schema of the source and to negotiate how much of a partial query each of the sources can do and at what cost. It translates the query and returns to result back to the federation engine.
What to do if the data is described differently in different places?
Map the existing schemas to one common schema. This is a virtual schema in the case of federation and a materialised schema in the case of data warehousing.
What do declarative schema mappings describe?
The relationships between the schemas of heterogeneous data sources.
Name the measurements of data quality
1 - accuracy
2 - timeliness (update frequency)
3 - completeness
4 - consistency
5 - duplication
6 - referential integrity (first insert, then updates, etc)
7 - domain integrity (ages should be in a certain domain)
8 - follow the business rules (if there can only be one manager, not multiple managers in the database)
What are the key aspects of data distribution?
1 - availability
2 - scalability
3 - transparency (of access)
4 - reliability/fault-tolerance
Homogenous distributed databases
When websites have identical software, are aware of each other, and agree to cooperate in processing user requests.
It appears as a single system to the user.
It is a closed world assumption, everything in the network is known.
Heterogeneous distrbuted databases
Different sites may use different schemas and software. They might not be aware of each other and may provide only limited facilities for cooperation in transaction processing (queries)
Horizontal fragmentation
Each tuple of the relation is assigned to one or more fragments
For example, storing the employees of each department in a separate database.
Vertical fragmentation
The schema for the relation is split into several smaller schemas