C6 Flashcards
information extraction (IE) with applications
discover structured information from unstructured or semi-structured text
applications: automatically identify mentions of medications and side effects in electronic health records, find company names in economic newspaper texts
2 types of information extraction tasks
- Named Entity Recognition (NER)
- Relation extraction
Named Entity Recognition
machine learning task based on sequence labelling:
- word order matters
- one entity can span multiple words
- multiple ways to refer to the same concept
=> extracted entities often need to be linked to a standard form
sequence labelling for NER
- sequence = sentence, element = word, label = entity type
- one label per token
- assigned tags capture both the boundary and the type
IOB tagging
format of training data
- each word gets a label (punctuation gets labelled separately)
- beginning (B), inside (I) of each entity type
- and one for tokens outside (O) any entity
Hidden Markov Model (HMM)
probabilistic sequence model: given a sequence of units (words), it computes a probability distribution over possible
sequences of labels and chooses the best label sequence
probabilities are estimated by counting on a labelled training corpus
feature-based NER
supervised learning:
- each word is represented by a feature vector with information about the word and its context: create a feature vector for word x_i in position i, describing x_i and its context
Part-Of-Speech tagging
Part-of-speech (POS) = category of words that have similar grammatical properties
- noun, verb, adjective, adverb
- pronoun, preposition, conjunction, determiner
Conditional Random Fields (CRF)
It is hard for generative models like HMMs to add features directly into the model => more powerful model: CRF
- discriminative undirected probabilistic graphical model
- can take rich representations of observations (feature vectors)
- takes previous labels and context observations into account
- optimizes the sequence as a whole. The probability of the best sequence is computed by the Viterbi algorithm
commonly used neural sequence model for NER
bi-LSTM-CRF:
LSTM = neural architecture with Long Short-Term Memory
Bi-LSTMs are Recurrent Neural Networks (RNNs)
But for NER the softmax optimization is insufficient because we need strong constraints for neighbouring tokens (I tag must follow I or B tag) => CRF layer on top of the bi-LSTM output
normalization of extracted mentions
suppose we have to extract company names and stock market info in newspaper text -> multiple extracted mentions can refer to the same concept
in order to normalize these, we need a list of concepts:
- knowledge bases (IMBD, Wikipedia)
- ontology
ontology linking approaches
- Define it as text classification task with the ontology items as labels. challenges: huge label space and we don’t have training data for all items
- Define it as term similarity task: use embeddings trained for synonym detection
give a relation extraction example and three possible methods
example relations: Tim Wagner is a spokesman for American Airlines, United is a unit of UAL Corp.
methods:
1. Co-occurrence based
2. Supervised learning (most reliable)
3. Distant supervision (if labelled data is limited)
co-occurrence based relation extraction
assumption: entities that frequently co-occur are semantically connected
- use a context window (e.g. sentence) to determine co-occurrence
- we can create a network structure based on this
supervised relation extraction
assumtions: two entities, one relation
relation extraction as classification problem
1. Find pairs of named entities (usually in the same sentence).
2. Apply a relation classification on each pair. The classifier can use any supervised technique