Lecture 1 - Basic Text Processing, Regular Expressions, Text Normalization, Edit Distances Flashcards

Question 1

Q

Define Natural Language Processing.

Answer

A

Natural Language refers to the way that people communicate to each other using voice, text etc.

Natural Language Processing is the interaction between computers and human language; an automatic manipulation of natural language.

Question 2

Q

Define Regular Expressions.

Answer

A

Regular Expressions = specially encoded text strings which is used as a pattern for matching sets of strings

Question 3

Q

Discuss some regex special characters.

Answer

A

Anchors: ^ $

Carat ^ will search at the starting of the line
Dollar sign $ searches at the end of the line

Negation in disjunction [ˆSs] - it means “not a capital S nor s”

etc.

Question 4

Q

What is text normalization (in general) and what are the three tasks it is composed of?

Answer

A

= a set of tasks where we convert a text in a more convenient and standard form

Segmenting/tokenizing words in running text
Normalizing word formats
Segmenting sentences in running text

Question 5

Q

What are lemma, stem and wordform?

Answer

A

lemma = same stem, part of speech, rough word sense
wordform = the full inflected surface form
stem = the part of the word that never changes even when morphologically inflected

e. g., cat and cats = belong to the same lemma; but have different wordforms
e. g., from “produced”, the lemma is “produce”, but the stem is “produc-“.

Question 6

Q

What are types and tokens in a sentence?

Answer

A

type = an element of the vocabulary
token = an instance of that type in running text

e.g., they lay back on the San Francisco grass and looked at the stars and their

15 tokens (or 14 - depending on whether we are treating “San Francisco” as a different token)
13 types (or 12, same with San Francisco)

Question 7

Q

Explain the Maximum Matching Word Segmentation Algorithm.

Answer

A

Give a string

Start a pointer at the beginning of the string
Find the longest word in the dictionary that matches the string starting at pointer
Move the pointer over the word in string
Go to 2
- does not really work in English

Question 8

Q

Normalizing word formats is part of general text normalization. Give an example of normalizing word format.

Answer

A

e.g., we want to match U.S.A and USA, we could delete the periods; or US vs us - we could transform everything to lowercase, but that will be problematic, because they have different meanings

or, we could use asymmetric expansion: Enter: window; Search: window, windows, Windows etc.

Question 9

Q

Dot is ambiguous in NLP because it can be end of the sentence or in can be present in abbreviations and so on. What can you use to determine whether a dot is the end of the sentence?

Answer

A

A decision tree.

Question 10

Q

What does Maximum Edit Distance do?

Answer

A

It tells how similar two strings are based on the numbers of edits: insertion, deletion, substitution

Used for spell correction, machine translation etc.

Insertion, deletion and substitution each have a cost of one. then you sum up how many times any of these operations happened and then you get the Maximum Edit Distance

Question 11

Q

If we want to find the minimum edit distance between two strings then we use a method called backtrace. What is the complexity of this?

Answer

A

O(m+n). linear
m - # characters in word1
n - # characters in word2

Lecture 1 - Basic Text Processing, Regular Expressions, Text Normalization, Edit Distances Flashcards

(11 cards)