C2 Flashcards

Question 1

Q

markup

Answer

A

meta information in a text file that is clearly distinguishable from the textual content

Question 2

Q

unicode

Answer

A

universal standard for all writing systems, more inclusive than ASCII

for maximum compatibility we encode texts in UTF-8 when reading and writing

Question 3

Q

minimum edit distance between two strings

Answer

A

minimum number of editing operations (insertion, substitution, deletion) needed to transform one string into another

Levenshtein distance: deletion, insertion and substitution all have a cost of 1

Question 4

Q

token count

Answer

A

number of words in a document, including duplicates

Question 5

Q

vocabulary size

Answer

A

number of unique terms, feature size when we use words as features

Question 6

Q

stop words

Answer

A

extremely common words without much content

remove stop words: keyword extraction
never remove stop words: sequence labelling tasks or classification tasks with small data

Question 7

Q

basic word forms

Answer

A

reduce number of features and generalizes better

lemma: dictionary form of a word (verbs: infinitive, nouns: singular form)

stem: portion of a word that is common to a set of (inflected) forms when all affixes are removed (not further analyzable into meaningful elements)

Question 8

Q

character encoding

Answer

A

the way that a computer displays text in a way
that humans can understand

Question 9

Q

Levenshtein Distance op (i,j)

Answer

A

min van:
D(i-1, j) + 1
D(i, j-1) + 1
D(i-1, j-1) + 1 als X(i) neq Y(j)
D(i-1, j-1) + 0 als X(i) = Y(j)

Question 10

Q

token

Answer

A

an instance of a word or term occurring in a document

Question 11

Q

term

Answer

A

a token when used as feature (or in an index), generally in normalized form (e.g. lowercased)

Question 12

Q

Optical Character Recognition

Answer

A

a technique for converting the image of a printed text to a digital text

C2 Flashcards

(12 cards)