Chapter 15 Flashcards
What are the phases of information integration?
Analysis; discovery; planning; deployment; runtime
What are the approaches of information integration?
Bottom-up design (when the available data sources are well known); top-down design (when the available data sources are not known a priori); hybrid design (based on requirements)
What is schema matching?
An automatic process to obtain a mapping. A mapping is a set of correspondences between two schemas.
what is the difference between individual and combining matchers?
Individual matchers exploit only one kind of information for identifying matches, meanwhile combining matchers use several (hybrid and composite)
what is the difference between schema-only and instance-based matching?
Schema-only techniques operate solely on metadata, meanwhile instance-based techniques also consider properties of the data(use statistical info on data values)
structural schema matching examples:
cupid and similarity flooding
which approach cupid uses?
Hybrid approach: element-based and structure-based
what are the phases of cupid?
Linguistic matching; structure matching; creation of mapping/matches
What is the goal of schema integration?
To create an integrated schema T from a set of schema S that is:
Complete; minimal, correct, intelligible
What are the four phases of schema integration?
preintegration; comparing the schemas; conforming the schemas, schema merging and restructuring
What are the goals of integration planning?
Creation of an executable mapping
mention 4 single-source data level problems:
typos; dummy values, wrong values, deprecated values, cryptic values, wrong reference, duplicates
mention 4 multi-source data level problems:
contradictory values, deffering representation, different physical units, different precision, different levels of details
how to handle data quality problems?
two phases: individual records and multiple records.
In the first phase (individual records) we try to normalise data, execute convertions, remove outliers,. In the second phase (involving multiple records) we detect duplicate entries and execute data fusion.