POS-Parsing 5 Flashcards

Question

What is used for named entities and chunking in python?

Answer 1

A

noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

Answer 2

A

A tagset contains all part of speech tags used for a specific corpus and what the tags mean (e.g., VBD = verb in past tense)
• Tags are usually uppercase (DT, ADJ, VBD)
• Similar tags often share a prefix (e.g., V… = related to verbs)
• Tagsets are language-specific and corpus-specific
(e.g., Social media corpora have a tag for emotions)

Answer 3

A

Penn Treebank Tagset and Universal Tagset

Answer 4

A

A word can have multiple POS tags
Most of them are common words
Can be difficult even for experienced human labellers

Answer 5

A

Two distinct words that have the same spelling are called homonyms.

Answer 6

A

grammatical sentences

Answer 7

A

ungrammatical

Answer 8

A

generative grammar

since the language is defined by the set of possible sentences “generated” by the grammar

Answer 9

A

the task of recognizing a sentence and assigning a syntactic structure to it.

Answer 10

A

Constituency Parsing
Dependency Parsing
Syntactic Parsing

Answer 11

A

Constituency parsing aims to extract a constituency-based parse tree
from a sentence that represents its syntactic structure according to a
phrase structure grammar

Answer 12

A

Dependency grammars focuses on how words relate to other words
Dependency is a binary relation between a head (or: governor) and its dependents.
• The head of a sentence is usually the finite verb.
• Every other word in the sentence depends on it either directly or through a
path of dependencies
Caveat: there are multiple theories for dependency parsing that may yield different results!

Answer 13

A

• Grammar checking
• Understand the subject/main verb/object of a sentence; useful in
downstream tasks, e.g. question answering, information extraction

Answer 14

A

Chunking is a process of extracting phrases from unstructured text.
• E.g. Instead of just extracting simple tokens which may not represent the
actual meaning of the text, it is advisable to use phrases such as
“South Africa” as a single word instead of ‘South’ and ‘Africa’
separate words.

Answer 15

Study These Flashcards

A

• For entity detection
• Proper names (e.g., Monty Python)
• Definite noun phrases (e.g., the knights who say “ni”)
• Sometimes also indefinite nouns or noun chunks
(e.g., every student or cats)
• Help multiple NLP tasks
• Information retrieval (search engines)
• Text classification
• Sentence simplification/paraphrase
• Summarisation

Answer 16

Study These Flashcards

A

Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, ..

Answer 17

Study These Flashcards

A

is to identify all textual mentions of named entities
• Two steps
• Identify the boundaries of the NE (e.g., by NP-chunking)
• Identify the type of the NE (e.g., by Naïve Bayes classification)
• NER is useful for
• Information extraction
• Answering specific questions, e.g. “Who is the president of the US?”
• Instead of retrieving a whole sentence, just present the NE “Joe Biden”

Answer 18

Study These Flashcards

A

• Simple word lookup incorrectly identifies words as NE e.g. location discovery
• Lists with people names or organizations have poor coverage
e.g. Hard to keep up with new people or organizations
• Named entity terms are ambiguous
e.g. May and North are DATE and LOCATION, but can also be PERSON
• Further challenge: multi-word terms
e.g. Stanford University, …

Answer 19

Study These Flashcards

A

• Text to speech (how do we pronounce “abstract”, “lead”, “read”)
• Find phrases ( Article Adj* N à noun phrases)
• Input for downstream NLP tasks (e.g. parsing, chunking, named entity
recognition)

Answer 20

Study These Flashcards

A

Idea 1:
• Collect a large dataset with sentences and their POS tags
• For each word, find its most likely POS tag
• For a new sentence, label each word with its most probable POS tag

Idea 2: Train a classifier
• Most probable POS tag of the word
• Prefixes: irreplaceable, unfortunate, inactive à strong clues for JJ
• Suffixes: fortunately, largely à a strong clue for RB (have exceptions, elderly)
• Capitalization: Meridian, USA, RHUL à a strong clue for NNP
• Other features, e.g. 35-year: digit-NN, a clue for JJ

Idea 3: Utilize contextual information
• Use POS tags of surrounding words as additional features

Answer 21

Study These Flashcards

A

Syntactic constituency is the idea that groups of words can behave as
single units, or constituents.

Answer 22

Study These Flashcards

A

A context-free grammar (CFG) consists of a set of rules or
productions, each of which expresses the ways that symbols of the
language can be grouped and ordered together, and a lexicon of
words and symbols.

Answer 23

Study These Flashcards

A

Simple solution: look up each word in an appropriate list of names
Doing this blindly has problems, e.g. with location discovery

Reading is also a place but can also be seen as reading a book.

POS-Parsing 5 Flashcards

(25 cards)