Linguistic Modules Flashcards

1
Q

What are some issues with tokenization?

A
  1. Hyphens: Break into multiple tokens or remain as one?
  2. Optional Hyphens: Users may use different forms of the same word
  3. Whitespace: Can’t be sure where to split tokens. Effect is compounded with hyphens
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some approaches for normalization?

A
  1. Equivalence classes
  2. Query expansion
  3. Lemmatization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are equivalence classes?

A

Grouping terms by mapping them to the same base term
Ex: Deleting periods or hyphens so U.S.A. maps to USA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some issues associated with normalization in other languages?

A
  1. Accents: Should they be removed and mapped to the same term? Should they be kept and mapped separately?
  2. Terms used in different languages mean different things
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is case folding?

A

Reducing all letters to lower case. Better to do this since users tend to use lowercase regardless of correct captitalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are stop words?

A

Extremely common words which appear to be of little value in selecting documents such as “of”, “and”, “the”, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do most search engines index stop words even though they are of little value?

A

They are needed for phrase queries to determine proper proximity between tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is query expansion?

A

A method of normalization where you take every token the tokenizer admits and add them as separate dictionary terms. Then, when you query your word could map to several words in the index once processed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we handle spelling mistakes in a query?

A

Using a soundex which forms equivalence classes based on phonetic heuristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is lemmatization?

A

Reducing variant forms of a word to the base form.
Ex: am, are, is -> be
Applies linguistic rules to transform the word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is one of the main methods of lemmatization?

A

Stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is stemming?

A

A heuristic lemmatization process that chops off the ends of words to reduce the number of word variations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Porter Algorithm?

A

The most common algorithm for stemming english words to reduce them to their root forms by stripping prefixes and suffixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When can stemming improve effectiveness?

A

If stemmed words are actually close to each other in meaning it can increase recall.
Ex: It is good for tours and touring to be mapped to tour but not for operational and operative to map to oper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly