Lecture 5 - Relation Extraction & Question Answering Flashcards
What is a relation triple?
a simple relation between a predicate and 2 arguments (subject - predicate - object (of some sort)) | e.g.: Golden Gate Park location San Francisco
Where is relation extraction used?
- create new structured knowledge bases (useful for any app)
- augment current knowledge bases (adding words to WordNet)
- support in question answering
What is the Automated Content Extraction (ACE)?
ACE is a tool containing the 17 most important (or at least the ones identified) relations
e.g. PHYSICAL - LOCATED
PERSON - SOCIAL - FAMILY
What are the three methods to extract relations?
- Hand-written patterns
- Supervised Machine Learning
- Semi-supervised and unsupervised
Briefly, what does Hearst paper say about patterns
- there are a lot of patterns that can be used to suggest that two entities are in this IS-A (hyponym) relation
- these kind of patterns are able to learn the IS-A relation between a new term such as “bow lute” and “Bambara ndang”
What are some pros and cons of hand-written relation extraction?
PRO:
Human patterns tend to be high precision
Can be tailored to specific domains
CON:
Human patterns are often low recall
A lot of work to think about all the patterns corresponding to all the relations
What are the steps for supervised relation extraction?
- Choose a set of relations we’d like to extract
- Choose a set of relevant named entities
- Find and label data
* choose a representative corpus
* label the named entities in the corpus
* hand label the relations between them
* break into training, development, test - Train a classifier on the training set
How do you do classification in supervised relation extraction?
- Find all pairs of named entities
- Decide if two entities are related
- If yes, classify the relation
we need the extra step 2 because it is faster to drop the unimportant pairs
I think how it works is basically that you parse the sentence in some way: headwords, bag of words, bigrams and so on, and you also have the label for this relation (because the sentence contains the two entities you are looking for and the relation between them)
, or you can use ACE for this task
And then, basically, you can use SVM, Naive Bayes and so on to train the model on that dataset
How do you evaluate a supervised relation extraction model? What are the formulas?
You compute precision, recall and F1 score
P = # correctly extracted relations/ total # of extracted relations
R = # correctly extracted relations/ total # of gold relations
F1 = 2PR/ (P + R)
What are some pros and cons of supervised relation extraction?
PRO:
can get high accuracies with enough hand-labeled data, if test set is similar enough to training
CON:
labeling a large training set is expensive
supervised model do not generalize well to different genres
What are the three semi-supervised and unsupervised relation extraction models that we learned?
- Bootstrapping (using seeds)
- Distant Supervision
- Unsupervised learning from the web
When should you use bootstrapping relation extraction
When you don’t have a hand-labeled dataset but you have some seed tuples or some high precision patterns
How does bootstrapping relation extraction work?
- Gather a set of seed pairs that have the relation R
- Iterate:
* Find sentences with these pairs (maybe from web)
* Look at the context between or around the pair and generalize the context to create patterns
* Use the patterns for grep for new pairs (to find new pair)
e.g. seed tuple
grep (Google) for the environments of the seed tuple:
“Mark Twain is buried in Elmira” - X is buried in Y
“The grave of Mark Twain is in Elmira” - The grave of X is in Y
“Elmira is Mark Twain’s final resting place” - Y is X’s final resting place
use these patterns to grep for new tuples
iterate
What is the Distant Supervision algorithm?
It combines bootstrapping with supervised learning
- use a large dataset to get a huge # of seed tuples
- create lots of features with these examples
- combine in supervised classifiers
How does the Distant Supervision algorithm work?
- For each relation (e.g born-in)
- For each tuple in a big database (e.g. , )
- Find sentences in large corpus with both entities (e.g. “Hubble was born in Marshfield”, “Einstein, born in Ulm”)
- Extract frequent features (parse, words) (e.g. PER was born in LOC, PER, born in LOC)
- Train supervised classifiers using thousands of features
How does unsupervised relation extraction work?
- Use parsed data to train a “trustworthy tuple” classifier
- Single-pass extract all relations between NPs, keep if trustworthy
- Assessor ranks relations based on text redundancy
How do you evaluate semi-supervised and unsupervised relations extraction models?
You can only approximate precision by drawing a random sample of relations from output and check presision manually
What are the three types of question answering models?
IR-based QA
Knowledge-based QA
Hybrid QA
What are the two main question types?
Factoid - Where is Apple based?
Complex (narrative) - What do scholars think about Jefferson’s position on dealing with pirates?
these types of questions can be answered | Factoid questions are used in commercial applications | Complex questions are generally answered more in research systems
What is the judgment behind the three types of question answering models? (briefly)
IR-based: go find the answer in some string on the web
Knowledge-based: build an answer by understanding a parse of the question
Hybrid: take a combination of these 2 approaches (most modern systems)
What are the three main steps of IR-based QA?
QUESTION ANSWERING
- detect question types, answer type, focus, relations
- formulate queries to send to a search engine
PASSAGE RETRIEVAL
- retrieved ranked documents
- break into suitable passages and rerank
ANSWER PROCESSING
- extract candidate answers
- rank candidates using evidence from the text and external sources
Swipe to see how IR-based QA works in my own words
- starts with a questions and begins by extracting information from the question itself (most important and common: a query that is gonna be sent to an IR engine and the type of the answer that tell us what kind of entity we are looking for
- in advance, we take a lot of documents, we index them so that when we have a query, we can return a lot of documents
- from those documents we extract passages (so, parts of those documents) → then they are processed in answer processing (by looking at what type of answer we are looking for) → and then returns the answer
What things should the model extract from the question we are asking?
“what are the two states that border Florida”
- answer type (name, entity, number)
- query formulation (two states, border, Florida)
- focus detection (two states) - find the question word/s that can be replaced by the answer
- relation extraction (borders(Florida, ?x, north)
Briefly what is knowledge-based and hybrid-based QA?
Knowledge-based: builds a semantic representation of the query (times, dates, locations), and then maps from these semantics to query structured data or resources (geospatial database, ontologies like Wikipedia, restaurant reviews and so on)
Hybrid: builds a shallow semantic representation of the query, and then generates answer candidates using IR methods, and then score each candidate using richer knowledge sources (like above)