Text Process Basics 1.2 Flashcards

1
Q

What is tokenization?

A

Tokens correspond to words and numeric

sequences separated by white-space characters or punctuation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is sentence tokenization?

A

For long documents, we may not be interested in words but instead in
sentences therein:
• Check whether a sentence’s sentiment is positive or negative
• Check whether a sentence contains propaganda content
• Check the grammatical correctness of a sentence
• etc.

This is known as sentence tokenization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Stemming and give an example of it?

A

Stemming is the process of reducing inflection in words to their root
forms, such as mapping a group of words to the same stem, even if
the stem itself is not a valid word in the language.

E.g. Before to befor or sanitize to sanit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When a language contains words that are derived from another word
as their use in the speech changes what is it called?

A

Inflected Language - Stemming and Lemmatization help to find the root forms from the inflected words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are stems created?

A

Stems are created by removing the suffixes or prefixes used with a word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are NLTK’s main stemmers?

A
Porter Stemmer and
Lancaster Stemmer (a.k.a Paice-Husk Stemmer) (more aggressive)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Lemmatization?

A

Lemmatization reduces the inflected words
properly ensuring that the root word belongs to the language.
In Lemmatization the root word is called Lemma.
A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Pros and Cons of Stemming?

A

Pro: Quick to run, because it is based on simple rules; suitable for processing a large amount of text
Con: Output can be meaningless; no one-fits-all rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pros and Cons of Lemmatization?

A

Pro: The derived root word is meaningful
Con: more expensive (and hence slow) to run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are stopwords?

A

Stop words, such as “am”, “the”, “to” and “are”, support words and sentences.
They help us to construct grammatical sentences, but they mostly do not affect the meaning of the sentence
NLTK has a stopwords list; we use it to find and remove the stopwords

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the typical text cleaning steps?

A

Raw text > Tokenization > Stopwords Removal > Stemming or Lemmatization > Normalized Text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of when you should not perform Stemming or Lemmatization?

A

If you need to check the whether a word is used in the singular or plural form, you should not perform stemming or lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly