NLP basics Flashcards

Question 1

Q

Why is Web relevant for NLP?

Answer

A

It can be both
- application area (search engines, news summarization, chatbots, recomendation systems..)
- resource to improve the quality of NLP models (corpus of data, knowledge repositories such as Wikipedia)..

Question 2

Q

What are the common challenges for NLP?

Answer

A

How to remove noise (duplicates)
How to assess the quality of content?
How to deal with errors, such as spelling or grammar
How to clean the data

Question 3

Q

What are morphemes?

Answer

A

Morphemes are the smalles units of text that have meaning.

Word dwarfs has 2 morphemes: ‘dwarf’ and ‘s’ (plural) or word loved has lov and ed.

Question 4

Q

What are stems?

Answer

A

Steps are minimal free morphemes.
‘cat’ is a stem, but ‘s’ is not. Stems carry the main meaning of words

Question 5

Q

What are affixes and types.

Answer

A

They are bound morphemes.
- suffixes: appear after the base (cat + s)
- prefixes: appear before the base (un + true)
- infixes: appear inside the base (fan + bloody + tastic)
- circumfixes: appear on both sides of the base (ge + sag + t)

Question 6

Q

What is stemming?

Answer

A

Stemming is an algorithm to remove the endings of words: sitting -> sitt
Its objective is to group words which belog to the same morphological family by transforming them to their stemmed representation. Stems obtained are not necessarly real words or word forms.

Problems with stemming can be under-stemming: adhere -> adher adhesion -> adhes, or over-stemming: appendicitis -> append append -> append

Question 7

Q

What is lemmatization?

Answer

A

Usually needs lexical resources and POS tagging to correctly identifies the base form (left -> leave or left->left). It transforms words into their base forms. Plural becomse singular, past tense becomes present…

Question 8

Q

What are homophones?

Answer

A

Words that have the same phoneme (pronounced the same) but are spelled differently.

Question 9

Q

What is tokenization?

Answer

A

Tokenization is a process of segmenting an input string into an ordered sequence of units.

Assuming one token is one word, tokenization is segmenting a sentence into an ordered sequence of words (words are divided by an empty space).

Question 10

Q

What is a token?

Answer

A

Tokens are units which can be in a form of words or sub-words, and are the output of tokenization.

Question 11

Q

What is a tokenizer?

Answer

A

A system that splits text into tokens:

John likes Mary. -> [John, likes, Mary, .] PAY ATTENTION TO PUNCATIONS

Question 12

Q

What could be challenges of tokenization?

Answer

A

If we split at whitespace chatecters, then what to do with puncations and special characters?
Are multi-word names one token or different tokens?
Are fullstops the end of sentence always? What about Mr. or number such as 62.5?
If whitespace is a separator, what about numbers in a form of 1 543? Are these two numbers or one?
Commas can be a part of numbers; 1,45
ambiguous single quote
different languages (chinese)

Question 13

Q

What is syntax?

Answer

A

It refers to the way words are arranged together

Question 14

Q

What is POS tagging?

Answer

A

POS tagging (part-of-speech) is assigning a part of speech (markers) to each word in a corpus or a sentence. (Noun, Verb, Adjective, Adverb, Preposition…)

Input is a sequence of words and a set of available tags, output is a sequence of tags that fit the best to given words.

The dwarf loves -> Determiner Noun Verb

It resolves abiguities of words. Book can be a verb or a noun. POS tagging, based on the context, assigns the tag with the highest probability.

Question 15

Q

What is a phoneme?

Answer

A

A phoneme is the smallest unit of sound in a language that can change the meaning of a word. For example, in English, the words “bat” and “pat” differ by only one phoneme, /b/ and /p/, which distinguishes their meanings.

Question 16

Q

What is a bound morpheme?

Answer

Study These Flashcards

A

A morpheme that alone can’t be a word, such as (-s), a morpheme that has a meaning of plural

Question 17

Q

What are homographs?

Answer

Study These Flashcards

A

Words which have the same spelling but different meanings (I saw the saw)

Question 18

Q

What is Semantics and Lexical Semantics

Answer

Study These Flashcards

A

Semantics:
▪ Study of the meaning of words, phrases, sentences, or documents

Lexical Semantics
▪ Study of the meaning of lexical units, i.e. words.

Question 19

Q

Give an example of Lexical Ambiguity

Answer

Study These Flashcards

A

He hit the ball with the bat.
Chuck Norris can hit a bat with a ball. (animal or an object)

Question 20

Q

What is the purpose of an utterance?

Answer

Study These Flashcards

A

“I NEVER said she stole my money” = I simply didn’t ever say it

“I never SAID she stole my money” = I might have implied it in some way, but I never explicitly said it.

“I never said SHE stole my money” = I said someone took it; I didn’t say it was she.

Utterance: “Is it cold in here or is it just me?
Intended meaning: “Please close the window!”

Utterance: “Oh, great! Another meeting.”
Intended meaning: The speaker likely means the opposite of what they are
literally saying—meetings might be something they dislike, despite the
positive tone

Question 21

Q

Answer

Study These Flashcards

A

NLP basics Flashcards

Lection 1 (21 cards)