NLP basics Flashcards
Lection 1
Why is Web relevant for NLP?
It can be both
- application area (search engines, news summarization, chatbots, recomendation systems..)
- resource to improve the quality of NLP models (corpus of data, knowledge repositories such as Wikipedia)..
What are the common challenges for NLP?
- How to remove noise (duplicates)
- How to assess the quality of content?
- How to deal with errors, such as spelling or grammar
- How to clean the data
What are morphemes?
Morphemes are the smalles units of text that have meaning.
Word dwarfs has 2 morphemes: ‘dwarf’ and ‘s’ (plural) or word loved has lov and ed.
What are stems?
Steps are minimal free morphemes.
‘cat’ is a stem, but ‘s’ is not. Stems carry the main meaning of words
What are affixes and types.
They are bound morphemes.
- suffixes: appear after the base (cat + s)
- prefixes: appear before the base (un + true)
- infixes: appear inside the base (fan + bloody + tastic)
- circumfixes: appear on both sides of the base (ge + sag + t)
What is stemming?
Stemming is an algorithm to remove the endings of words: sitting -> sitt
Its objective is to group words which belog to the same morphological family by transforming them to their stemmed representation. Stems obtained are not necessarly real words or word forms.
Problems with stemming can be under-stemming: adhere -> adher adhesion -> adhes, or over-stemming: appendicitis -> append append -> append
What is lemmatization?
It transforms words into their base forms. Plural becomse singular, past tense becomes present…
What are homophones?
Words that have the same phoneme (pronounced the same) but are spelled differently.
What is tokenization?
Tokenization is a process of segmenting an input string into an ordered sequence of units.
Assuming one token is one word, tokenization is segmenting a sentence into an ordered sequence of words (words are divided by an empty space)
What is a token?
Tokens are units which can be in a form of words or sub-words, and are the output of tokenization.
What is a tokenizer?
A system that splits text into tokens:
John likes Mary. -> [John, likes, Mary, .] PAY ATTENTION TO PUNCATIONS
What could be challenges of tokenization?
- If we split at whitespace chatecters, then what to do with puncations and special characters?
- Are multi-word names one token or different tokens?
- Are fullstops the end of sentence always? What about Mr. or number such as 62.5?
- If whitespace is a separator, what about numbers in a form of 1 543? Are these two numbers or one?
- Commas can be a part of numbers; 1,45
- ambiguous single quote
- different languages (chinese)
What is syntax?
It refers to the way words are arranged together
What is POS tagging?
POS tagging (part-of-speech) is assigning a part of speech (markers) to each word in a corpus or a sentence.
Input is a sequence of words and a set of available tags, output is a sequence of tags that fit the best to given words.
The dwarf loves -> Determiner Noun Verb
It resolves abiguities of words. Book can be a verb or a noun. POS tagging, based on the context, assigns the tag with the highest probability.
What is a phoneme?
A phoneme is the smallest unit of sound in a language that can change the meaning of a word. For example, in English, the words “bat” and “pat” differ by only one phoneme, /b/ and /p/, which distinguishes their meanings.