C6 Flashcards
information extraction (IE) with applications
discover structured information from unstructured or semi-structured text
applications: automatically identify mentions of medications and side effects in electronic health records, find company names in economic newspaper texts
2 types of information extraction tasks
- Named Entity Recognition (NER)
- Relation extraction
Named Entity Recognition
machine learning task based on sequence labelling:
- word order matters
- one entity can span multiple words
- multiple ways to refer to the same concept
=> extracted entities often need to be linked to a standard form
sequence labelling for NER
- sequence = sentence, element = word, label = entity type
- one label per token
- assigned tags capture both the boundary and the type
IOB tagging
format of training data
- each word gets a label (punctuation gets labelled separately)
- beginning (B), inside (I) of each entity type
- and one for tokens outside (O) any entity
Hidden Markov Model (HMM)
probabilistic sequence model: given a sequence of units (words), it computes a probability distribution over possible
sequences of labels and chooses the best label sequence
probabilities are estimated by counting on a labelled training corpus
feature-based NER
supervised learning:
- each word is represented by a feature vector with information about the word and its context: create a feature vector for word x_i in position i, describing x_i and its context
Part-Of-Speech tagging
Part-of-speech (POS) = category of words that have similar grammatical properties
- noun, verb, adjective, adverb
- pronoun, preposition, conjunction, determiner
Conditional Random Fields (CRF)
It is hard for generative models like HMMs to add features directly into the model => more powerful model: CRF
- discriminative undirected probabilistic graphical model
- can take rich representations of observations (feature vectors)
- takes previous labels and context observations into account
- optimizes the sequence as a whole. The probability of the best sequence is computed by the Viterbi algorithm
commonly used neural sequence model for NER
bi-LSTM-CRF:
LSTM = neural architecture with Long Short-Term Memory
Bi-LSTMs are Recurrent Neural Networks (RNNs)
But for NER the softmax optimization is insufficient because we need strong constraints for neighbouring tokens (I tag must follow I or B tag) => CRF layer on top of the bi-LSTM output
normalization of extracted mentions
suppose we have to extract company names and stock market info in newspaper text -> multiple extracted mentions can refer to the same concept
in order to normalize these, we need a list of concepts:
- knowledge bases (IMBD, Wikipedia)
- ontology
ontology linking approaches
- Define it as text classification task with the ontology items as labels. challenges: huge label space and we don’t have training data for all items
- Define it as term similarity task: use embeddings trained for synonym detection
give a relation extraction example and three possible methods
example relations: Tim Wagner is a spokesman for American Airlines, United is a unit of UAL Corp.
methods:
1. Co-occurrence based
2. Supervised learning (most reliable)
3. Distant supervision (if labelled data is limited)
co-occurrence based relation extraction
assumption: entities that frequently co-occur are semantically connected
- use a context window (e.g. sentence) to determine co-occurrence
- we can create a network structure based on this
supervised relation extraction
assumtions: two entities, one relation
relation extraction as classification problem
1. Find pairs of named entities (usually in the same sentence).
2. Apply a relation classification on each pair. The classifier can use any supervised technique
distant supervision relation extraction
Suppose we don’t have labelled data for relation extraction, but we do have a knowledge base => How could you use the knowledge base to identify relations in the text and discover relations that are not yet in the knowledge base?
- Start with a large, manually created knowledge base (e.g. IMDB)
- Find occurrences of pairs of related entities from the database in sentences
- Assumption: If two entities participate in a relation, any sentence that contains these entities express that relation - Train a Relation Extraction classifier (supervised) on the found entities and their context
- Apply the classifier to sentences with yet unconnected other entities in order to find new relations
what is a named entity?
A named entity is a sequence of words that designates some real-world entity (typically a name)
challenges of NER
- ambiguity of segmentation (where are the boundaries of an entity?)
- type ambiguity (JFK refers to president or airport
- shift of meaning (“president of the US” changes from Obama to Trump)
why do we need knowledge bases or ontologies for NER?
multiple extracted mentions can refer to the same concept, so we want to normalize these and need a list of concepts
relation extraction
example relations: Tim Wagner is a spokesman for American Airlines, United is a unit of UAL Corp.
methods:
1. Co-occurrence based
2. Supervised learning (most reliable)
3. Distant supervision (if labelled data is limited)
why are POS tags informative for NER?
some word categories are more likely to be (part of) an entity