Information Extraction Flashcards
task
activity of populating a structure information repository (database) from an unstructured/free text information source
the difference in strengths between information extraction and information retrieval
IR:
- can search huge collections quickly
- insensitive to genre & domain of texts
- relatively straight forward to implement
IE:
- extracts facts from texts, not just texts from text collection
- resulting structured data source has many applications
Difference in weakness between IR and IE
IR:
- returns documents not information/answers so further reading required
- not discriminating enough
IE:
systems = genre/domain & porting new ones = time consuming/difficult
limited accuracy
computionally demanding
information extraction task
document collection and predefined set of entities, relations and events
return: structured representation of all mentions of specified entities, relations and or events
Named Entity Recognition
For each textual mention of an entity of one of a fixed set of types, identify its extent (position interval of the word in the text) and the type (organisation, person)
Types of entities
named individuals, named kinds (objects), times, measures
Coreference task
link together all different textual expression that refer to the same real world entity regardless of whether the surfare form
relation extraction
identify all assertions of relations, usually binary between entities identified in entity extraction divided into 2 subtasks
relation detection
find pairs of entities between which a relation holds
relation classification
for those pairs of entities, determine what the relation is
event detection & event classification
identify all reports of event instance, typically of a small set of classes divided into 2 subtasks
finds mentions of events in a text
assign detected events to one of a set of classes
knowledge engineering approaches
use manually authored rules and can be divided into:
- deep: linguistically inspired language understanding systems.
- shallow: systems engineered to the IE task, typically using pattern-action rules
supervised learning approaches
for each entity/relation in a given text, create a training instance represented in term of features so systems may learn patterns that match extraction targets and classifiers that classify tokens
bootstrapping approaches
minimally supervised systems are given seed tuples and/or seed patterns to search the text for:
- occurrences of seed tuples then extract a pattern that - - - matches the context of the seed tuples from which it harvests new tuples
distance supervision approaches
assumes a semi-structured data source which contains tuples of entities standing in the relation and a point to a source text
Evaluation of the IE system
Keys = correct answers, produced manually for each extraction task Responses = scoring of systems results vs keys done automatically