Information Extraction Flashcards
Information extraction
The process of Information Extraction turns the unstructured information embedded in texts into structured data, e.g. populating a relational database to enable further processing.
Relation Extraction
Finding and classifying semantic relations among entities mentioned in a text.
RDF triple
A tuple of entity-relation-entity,
called a subject-predicate-object expression.
5 Classes of algorithms for relation extraction
- handwritten patterns
- supervised machine learning
- semi-supervised (via bootstrapping or distant supervision)
- unsupervised
Semisurpervised Relation Extraction via Bootstrapping
If we have a few high-precision seed patterns, or seed tuples, we can bootstrap a classifier.
Bootstrapping proceeds by taking the entities in the seed pair, and then finding sentences (e.g. on the web) that contain both entities.
From all such sentences, we extract and generalize the context around the entities to learn new patterns.
Semantic drift
In semantic drift, an erroneous pattern leads to the introduction of erroneous tuples, which - in turn - leads to the creation of problematic patterns and the meaning of the extracted relations ‘drifts’.
Relation Extraction
Confidence values in bootstrapping
Bootstrapping systems assign confidence values to new tuples to avoid semantic drift.
Given a document collection D
, a current set of tuples, T
, and a proposed pattern p
, we need to track two factors:
-
hits(p)
: the set of tuples inT
thatp
matches while looking inD
. -
finds(p)
: the total set of tuples thatp
finds inD
.
Conf(p
) = log(|finds(p)
)|) x |hits(p)
| / finds(p)
Distant Supervision for Relation Extraction
Distant supervision combines the advantages of bootstrapping with supervised learning.
Instead of just a handful of seeds, distant supervision uses a large database to acquire a huge number of seed examples, creates lots of noisy pattern features from all these examples, and then combines them in a supervised classifier.
Unsupervised Relation Extraction
Open Information Extraction
A task which has the goal of extracting relations from the web when we have no labeled training data, and not even any list of relations.
Open Information Extraction
ReVerb 4 Steps
- Run a part-of-speech tagger and entity chuncker over
s
- For each verb in
s
, find the longest sequence of wordsw
that start with a verb and satisfy syntactic and lexical constraints, merging adjacent matches. - For each phrase
w
, find the nearest noun phrasex
to the left which is not a relative pronoun, wh-word or existential “there”. Find the nearest noun phrasey
to the right. - Assign confidence
c
to the relationr = (x, w, y)
using a confidence classifier and return it.
Temporal expressions
Expressions that refer to absolute points in time, relative times, durations and sets of those.
Absolute temporal expressions can be mapped directly to calendar dates, times of day, or both.
Relative temporal expressions map to particular times through some other reference point.
Durations denote spans of time at varying levels of granularity.
Temporal Normalization
The process of mapping a temporal expression to either a specific point in time, or to a duration.
Fully qualified date expression
Contains a year, month and day in some conventional form.
Event Extraction
The task of identifying mentions of events in tasks.
7 Allen Relations
A before B
A overlaps B
A meets B
A equals B
A starts B
A finishes B
A during B