NLP basics Flashcards

Lection 1

1
Q

Why is Web relevant for NLP?

A

It can be both
- application area (search engines, news summarization, chatbots, recomendation systems..)
- resource to improve the quality of NLP models (corpus of data, knowledge repositories such as Wikipedia)..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the common challenges for NLP?

A
  • How to remove noise (duplicates)
  • How to assess the quality of content?
  • How to deal with errors, such as spelling or grammar
  • How to clean the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are morphemes?

A

Morphemes are the smalles units of text that have meaning.

Word dwarfs has 2 morphemes: ‘dwarf’ and ‘s’ (plural) or word loved has lov and ed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are stems?

A

Steps are minimal free morphemes.
‘cat’ is a stem, but ‘s’ is not. Stems carry the main meaning of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are affixes and types.

A

They are bound morphemes.
- suffixes: appear after the base (cat + s)
- prefixes: appear before the base (un + true)
- infixes: appear inside the base (fan + bloody + tastic)
- circumfixes: appear on both sides of the base (ge + sag + t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stemming?

A

Stemming is an algorithm to remove the endings of words: sitting -> sitt
Its objective is to group words which belog to the same morphological family by transforming them to their stemmed representation. Stems obtained are not necessarly real words or word forms.

Problems with stemming can be under-stemming: adhere -> adher adhesion -> adhes, or over-stemming: appendicitis -> append append -> append

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is lemmatization?

A

It transforms words into their base forms. Plural becomse singular, past tense becomes present…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are homophones?

A

Words that have the same phoneme (pronounced the same) but are spelled differently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is tokenization?

A

Tokenization is a process of segmenting an input string into an ordered sequence of units.

Assuming one token is one word, tokenization is segmenting a sentence into an ordered sequence of words (words are divided by an empty space)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a token?

A

Tokens are units which can be in a form of words or sub-words, and are the output of tokenization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a tokenizer?

A

A system that splits text into tokens:

John likes Mary. -> [John, likes, Mary, .] PAY ATTENTION TO PUNCATIONS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What could be challenges of tokenization?

A
  • If we split at whitespace chatecters, then what to do with puncations and special characters?
  • Are multi-word names one token or different tokens?
  • Are fullstops the end of sentence always? What about Mr. or number such as 62.5?
  • If whitespace is a separator, what about numbers in a form of 1 543? Are these two numbers or one?
  • Commas can be a part of numbers; 1,45
  • ambiguous single quote
  • different languages (chinese)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is syntax?

A

It refers to the way words are arranged together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is POS tagging?

A

POS tagging (part-of-speech) is assigning a part of speech (markers) to each word in a corpus or a sentence.

Input is a sequence of words and a set of available tags, output is a sequence of tags that fit the best to given words.

The dwarf loves -> Determiner Noun Verb

It resolves abiguities of words. Book can be a verb or a noun. POS tagging, based on the context, assigns the tag with the highest probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a phoneme?

A

A phoneme is the smallest unit of sound in a language that can change the meaning of a word. For example, in English, the words “bat” and “pat” differ by only one phoneme, /b/ and /p/, which distinguishes their meanings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly