Lecture 1 - Basic Text Processing, Regular Expressions, Text Normalization, Edit Distances Flashcards

1
Q

Define Natural Language Processing.

A

Natural Language refers to the way that people communicate to each other using voice, text etc.

Natural Language Processing is the interaction between computers and human language; an automatic manipulation of natural language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define Regular Expressions.

A

Regular Expressions = specially encoded text strings which is used as a pattern for matching sets of strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Discuss some regex special characters.

A

Anchors: ^ $

  • Carat ^ will search at the starting of the line
  • Dollar sign $ searches at the end of the line

Negation in disjunction [ˆSs] - it means “not a capital S nor s”

etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is text normalization (in general) and what are the three tasks it is composed of?

A

= a set of tasks where we convert a text in a more convenient and standard form

  1. Segmenting/tokenizing words in running text
  2. Normalizing word formats
  3. Segmenting sentences in running text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are lemma, stem and wordform?

A
  • lemma = same stem, part of speech, rough word sense
  • wordform = the full inflected surface form
  • stem = the part of the word that never changes even when morphologically inflected

e. g., cat and cats = belong to the same lemma; but have different wordforms
e. g., from “produced”, the lemma is “produce”, but the stem is “produc-“.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are types and tokens in a sentence?

A
  • type = an element of the vocabulary
  • token = an instance of that type in running text

e.g., they lay back on the San Francisco grass and looked at the stars and their

  • 15 tokens (or 14 - depending on whether we are treating “San Francisco” as a different token)
  • 13 types (or 12, same with San Francisco)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the Maximum Matching Word Segmentation Algorithm.

A

Give a string

  1. Start a pointer at the beginning of the string
  2. Find the longest word in the dictionary that matches the string starting at pointer
  3. Move the pointer over the word in string
  4. Go to 2
    - does not really work in English
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normalizing word formats is part of general text normalization. Give an example of normalizing word format.

A

e.g., we want to match U.S.A and USA, we could delete the periods; or US vs us - we could transform everything to lowercase, but that will be problematic, because they have different meanings

or, we could use asymmetric expansion: Enter: window; Search: window, windows, Windows etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dot is ambiguous in NLP because it can be end of the sentence or in can be present in abbreviations and so on. What can you use to determine whether a dot is the end of the sentence?

A

A decision tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does Maximum Edit Distance do?

A

It tells how similar two strings are based on the numbers of edits: insertion, deletion, substitution

Used for spell correction, machine translation etc.

Insertion, deletion and substitution each have a cost of one. then you sum up how many times any of these operations happened and then you get the Maximum Edit Distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

If we want to find the minimum edit distance between two strings then we use a method called backtrace. What is the complexity of this?

A

O(m+n). linear
m - # characters in word1
n - # characters in word2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly