College 2: Entity Resolution Flashcards

Question 1

Q

Entity resolution

Answer

A

Decide if two data structures correspond to same real-world entity

Question 2

Q

Reasons for variable descriptions

Answer

A

Text variations (misspellings, acronyms etc)
Local knowledge (different formats in sources, lack of global coordination for identification)
Evolving nature of data (entity alternative names appearing in time, updates in entity data)
New functionality:
* Web page extraction (calais, cogito)
* Imported data from various applications
* Mashups for easy and fast integration from various sources

Question 3

Q

Typical method entity resolution

Answer

A

Identify data that describes the same thing, decide how to merge, update data collection

Question 4

Q

Atomic Similarity methods: Edit Distance

Answer

A

Number of operations to convert from 1st to 2nd string.
Insert costs + delete costs + substitute costs (Levensteihn distance = 1)

Question 5

Q

Atomic Similarity method: Gap Distance

Answer

A

Overcome limitation of edit distance with shortened strings by considering two extra operations: open and extend gap with small cost. Cost = 1 + o + 82

Question 6

Q

Atomic Similarity Method: Jaro similarity

Answer

A

JaroSim(s1,s2) = 1/3 (C/|s1| + C/|s2| + C-T/C)
C: Common characters s1 & s2
T: transposition/2

Deis vs. Desi C=4, T=2/2

Question 7

Q

Atomic Similarity Method: Jaro-Wrinkler

Answer

A

Extension gives heigher weight to matching prefix:
Jw(s1,s2) = JaroSim(s1,s2) + PxLx (1-JaroSim(s1,s2))
L: common prefix with four as maximum

Question 8

Q

Atomic Similarity Method: Soundex

Answer

A

Converts each word into a phonetic encoding by assigning the same code to the string parts that sound
the same. Similarity between the corresponding phonetic encoding

Question 9

Q

Structural Heterogeneity

Answer

A

See structures as sets and compute set similarity

Question 10

Q

Set similarity: Group linkage

Answer

A

Considers groups of relational records instead of individual relational records.
Groups match when high similarity between data of individual records and large fraction of matching records

Question 11

Q

Set similarity: database community

Answer

A

Each relation record is an entity

Question 12

Q

Set similarity: Merge-purge approach

Answer

A

Example: Idea: same entities with share information
Create a key for each relation (e.g. email)
Sort relations according to key
Compare only a limited set of relations in each iteration

Question 13

Q

College 2: Entity Resolution Flashcards

(13 cards)