lec 6(done) Flashcards
data Integration:
Combines data from multiple sources into a coherent store
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set.
Data Integration Issues:
1-Entity identification problem
2-Redundancy and correlation analysis
3-Tuple duplication
4-Data value conflict detection and resolution
example of schema Integration and Object Matching
e.g., customer_id in one database andcust_number in another.
what is metadata?
a set of data that describes and gives information about other data.
this include (attribute name, meaning, data type, range of values, null rules)
how can we avoid errors in schema integration?
by using metadata
Redundant data occurs often when integration of multiple databases due to:
Dimension naming: The same attribute may have different names in different databases
Derivable data: The same attribute can be “derived” from another attribute or set of attributes. e.g., annual revenue
Redundant attributes can be detected by
correlation analysis
Correlation does not imply causality explain:
If A and B are correlated, this does not necessarily imply that A causes B or that B causes A.
- # of hospitals and # of car-theft in a city are correlated
- Both are causally linked to the third variable: population
Correlation Test for Nominal Data we use
X2 (chi-square) test
X2 (chi-square) test
slide 7-11
Correlation Test for Numeric Data we use
Correlation Coefficient
Correlation Coefficient
slide 12
Scatter plots can be used to
view correlations between attributes
Tuple Duplication
Two or more identical tuples for a given data entry
Inconsistencies often arise with tuple duplicates, due to what?
due to updating some but not all data occurrences.
For example, a database contains three duplicate purchase tuples and we updatethe purchaser’s name for only one or two tuples, so that might cause inconsistencies.