C2 Flashcards
markup
meta information in a text file that is clearly distinguishable from the textual content
unicode
universal standard for all writing systems, more inclusive than ASCII
for maximum compatibility we encode texts in UTF-8 when reading and writing
minimum edit distance between two strings
minimum number of editing operations (insertion, substitution, deletion) needed to transform one string into another
Levenshtein distance: deletion, insertion and substitution all have a cost of 1
token count
number of words in a document, including duplicates
vocabulary size
number of unique terms, feature size when we use words as features
stop words
extremely common words without much content
- remove stop words: keyword extraction
- never remove stop words: sequence labelling tasks or classification tasks with small data
basic word forms
reduce number of features and generalizes better
lemma: dictionary form of a word (verbs: infinitive, nouns: singular form)
stem: portion of a word that is common to a set of (inflected) forms when all affixes are removed (not further analyzable into meaningful elements)
character encoding
the way that a computer displays text in a way
that humans can understand
Levenshtein Distance op (i,j)
min van:
D(i-1, j) + 1
D(i, j-1) + 1
D(i-1, j-1) + 1 als X(i) neq Y(j)
D(i-1, j-1) + 0 als X(i) = Y(j)
token
an instance of a word or term occurring in a document
term
a token when used as feature (or in an index), generally in normalized form (e.g. lowercased)
Optical Character Recognition
a technique for converting the image of a printed text to a digital text