College 2: Entity Resolution Flashcards

1
Q

Entity resolution

A

Decide if two data structures correspond to same real-world entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reasons for variable descriptions

A
  1. Text variations (misspellings, acronyms etc)
  2. Local knowledge (different formats in sources, lack of global coordination for identification)
  3. Evolving nature of data (entity alternative names appearing in time, updates in entity data)
  4. New functionality:
    * Web page extraction (calais, cogito)
    * Imported data from various applications
    * Mashups for easy and fast integration from various sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Typical method entity resolution

A

Identify data that describes the same thing, decide how to merge, update data collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Atomic Similarity methods: Edit Distance

A

Number of operations to convert from 1st to 2nd string.
Insert costs + delete costs + substitute costs (Levensteihn distance = 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Atomic Similarity method: Gap Distance

A

Overcome limitation of edit distance with shortened strings by considering two extra operations: open and extend gap with small cost. Cost = 1 + o + 82

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Atomic Similarity Method: Jaro similarity

A

JaroSim(s1,s2) = 1/3 (C/|s1| + C/|s2| + C-T/C)
C: Common characters s1 & s2
T: transposition/2

Deis vs. Desi C=4, T=2/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Atomic Similarity Method: Jaro-Wrinkler

A

Extension gives heigher weight to matching prefix:
Jw(s1,s2) = JaroSim(s1,s2) + PxLx (1-JaroSim(s1,s2))
L: common prefix with four as maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Atomic Similarity Method: Soundex

A

Converts each word into a phonetic encoding by assigning the same code to the string parts that sound
the same. Similarity between the corresponding phonetic encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Structural Heterogeneity

A

See structures as sets and compute set similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Set similarity: Group linkage

A
  • Considers groups of relational records instead of individual relational records.
  • Groups match when high similarity between data of individual records and large fraction of matching records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Set similarity: database community

A

Each relation record is an entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Set similarity: Merge-purge approach

A
  • Example: Idea: same entities with share information
  • Create a key for each relation (e.g. email)
  • Sort relations according to key
  • Compare only a limited set of relations in each iteration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly