College 2: Entity Resolution Flashcards
Entity resolution
Decide if two data structures correspond to same real-world entity
Reasons for variable descriptions
- Text variations (misspellings, acronyms etc)
- Local knowledge (different formats in sources, lack of global coordination for identification)
- Evolving nature of data (entity alternative names appearing in time, updates in entity data)
- New functionality:
* Web page extraction (calais, cogito)
* Imported data from various applications
* Mashups for easy and fast integration from various sources
Typical method entity resolution
Identify data that describes the same thing, decide how to merge, update data collection
Atomic Similarity methods: Edit Distance
Number of operations to convert from 1st to 2nd string.
Insert costs + delete costs + substitute costs (Levensteihn distance = 1)
Atomic Similarity method: Gap Distance
Overcome limitation of edit distance with shortened strings by considering two extra operations: open and extend gap with small cost. Cost = 1 + o + 82
Atomic Similarity Method: Jaro similarity
JaroSim(s1,s2) = 1/3 (C/|s1| + C/|s2| + C-T/C)
C: Common characters s1 & s2
T: transposition/2
Deis vs. Desi C=4, T=2/2
Atomic Similarity Method: Jaro-Wrinkler
Extension gives heigher weight to matching prefix:
Jw(s1,s2) = JaroSim(s1,s2) + PxLx (1-JaroSim(s1,s2))
L: common prefix with four as maximum
Atomic Similarity Method: Soundex
Converts each word into a phonetic encoding by assigning the same code to the string parts that sound
the same. Similarity between the corresponding phonetic encoding
Structural Heterogeneity
See structures as sets and compute set similarity
Set similarity: Group linkage
- Considers groups of relational records instead of individual relational records.
- Groups match when high similarity between data of individual records and large fraction of matching records
Set similarity: database community
Each relation record is an entity
Set similarity: Merge-purge approach
- Example: Idea: same entities with share information
- Create a key for each relation (e.g. email)
- Sort relations according to key
- Compare only a limited set of relations in each iteration