POS-Parsing 5 Flashcards
What are the 8 POS tags in grammar school?
noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection
The collection of POS tags used is called?
tagset
What is a tagset?
A tagset contains all part of speech tags used for a specific corpus and what the tags mean (e.g., VBD = verb in past tense)
• Tags are usually uppercase (DT, ADJ, VBD)
• Similar tags often share a prefix (e.g., V… = related to verbs)
• Tagsets are language-specific and corpus-specific
(e.g., Social media corpora have a tag for emotions)
Name two tagsets.
Penn Treebank Tagset and Universal Tagset
Name three difficulties of POS tagging?
A word can have multiple POS tags
Most of them are common words
Can be difficult even for experienced human labellers
What are homonyms?
Two distinct words that have the same spelling are called homonyms.
Sentences that can be derived by a grammar are in the formal language defined by that grammar, and are called?
grammatical sentences
Sentences that cannot be derived by a given formal grammar are not in the language defined by that grammar and are referred to as?
ungrammatical
In linguistics, the use of formal languages to model natural languages
is called?
generative grammar
since the language is defined by the set of possible sentences “generated” by the grammar
What is syntactic parsing?
the task of recognizing a sentence and assigning a syntactic structure to it.
Name three types of parsing?
Constituency Parsing
Dependency Parsing
Syntactic Parsing
What is constituency parsing?
Constituency parsing aims to extract a constituency-based parse tree
from a sentence that represents its syntactic structure according to a
phrase structure grammar
What is dependency parsing?
Dependency grammars focuses on how words relate to other words
Dependency is a binary relation between a head (or: governor) and its dependents.
• The head of a sentence is usually the finite verb.
• Every other word in the sentence depends on it either directly or through a
path of dependencies
Caveat: there are multiple theories for dependency parsing that may yield different results!
Why is syntactic parsing important?
Give 2 reasons.
• Grammar checking
• Understand the subject/main verb/object of a sentence; useful in
downstream tasks, e.g. question answering, information extraction
What is Chunking?
Chunking is a process of extracting phrases from unstructured text.
• E.g. Instead of just extracting simple tokens which may not represent the
actual meaning of the text, it is advisable to use phrases such as
“South Africa” as a single word instead of ‘South’ and ‘Africa’
separate words.
Give some reasons for chunking and areas it is used.
• For entity detection • Proper names (e.g., Monty Python) • Definite noun phrases (e.g., the knights who say “ni”) • Sometimes also indefinite nouns or noun chunks (e.g., every student or cats) • Help multiple NLP tasks • Information retrieval (search engines) • Text classification • Sentence simplification/paraphrase • Summarisation
What are named entities?
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, ..
What is the goal of a named entity recognition (NER) systems? And what are NERs useful for?
is to identify all textual mentions of named entities
• Two steps
• Identify the boundaries of the NE (e.g., by NP-chunking)
• Identify the type of the NE (e.g., by Naïve Bayes classification)
• NER is useful for
• Information extraction
• Answering specific questions, e.g. “Who is the president of the US?”
• Instead of retrieving a whole sentence, just present the NE “Joe Biden”
Give 4 problems associated with named entity recognition.
• Simple word lookup incorrectly identifies words as NE e.g. location discovery
• Lists with people names or organizations have poor coverage
e.g. Hard to keep up with new people or organizations
• Named entity terms are ambiguous
e.g. May and North are DATE and LOCATION, but can also be PERSON
• Further challenge: multi-word terms
e.g. Stanford University, …
Why are POS tags helpful?
• Text to speech (how do we pronounce “abstract”, “lead”, “read”)
• Find phrases ( Article Adj* N à noun phrases)
• Input for downstream NLP tasks (e.g. parsing, chunking, named entity
recognition)
Name three ways to design a POS tagger?
Idea 1:
• Collect a large dataset with sentences and their POS tags
• For each word, find its most likely POS tag
• For a new sentence, label each word with its most probable POS tag
Idea 2: Train a classifier
• Most probable POS tag of the word
• Prefixes: irreplaceable, unfortunate, inactive à strong clues for JJ
• Suffixes: fortunately, largely à a strong clue for RB (have exceptions, elderly)
• Capitalization: Meridian, USA, RHUL à a strong clue for NNP
• Other features, e.g. 35-year: digit-NN, a clue for JJ
Idea 3: Utilize contextual information
• Use POS tags of surrounding words as additional features
What is syntactic constituency?
Syntactic constituency is the idea that groups of words can behave as
single units, or constituents.
What is a context-free grammar?
A context-free grammar (CFG) consists of a set of rules or
productions, each of which expresses the ways that symbols of the
language can be grouped and ordered together, and a lexicon of
words and symbols.
How would you identify named entities?
- Simple solution: look up each word in an appropriate list of names
- Doing this blindly has problems, e.g. with location discovery
Reading is also a place but can also be seen as reading a book.
What is used for named entities and chunking in python?
spacy