Words Flashcards
What are some problems with natural langauge?
There is lots of ambiguity from identical word forms
There is also dependency on punctuation or intonation
What types of texts exist?
Formal News
Polemic News (argumentative)
Speech
Historic, Poetic, Musical
Social Media
What is a sentence?
A unit of written language
What is an utterance?
It is a unit of spoken language
What is a word form?
It is the inflected form as it appears in the corpus
What is a lemma?
It is an abstract form shared by word forms having the same stem, POS, word sense
What are function words?
Indicate the grammatical relationship between terms but have little topical meaning
What are types?
They are a number of distinct words in a corpus
What are tokens?
It is the collection of all words
What are some lexical analysis steps we can take?
Stripping punctuation, folding cases, removing function words, lemmatising and stemming text, taking an index for each of the words
How much of text is function words?
They account for up to 60 percent of text
What does repetition signal?
It signals intention
What do wordclouds provide?
They provide visual representation of statistical summary
What is tokenization?
It is the process of turning a stream of characters into a sequence of words
What is a token?
A token is a lexical construct that can be assigned grammatical and semantic roles
What is a naive solution to tokenization?
Break on spaces and punctuation - too simple for general case, is useful piece of information for parses, helps to indicate sentence boundaries
What are some tokenization issues?
Punctuation can be internal, for example abbreviations, prices, times etc
We can have multiword expressions (New York, rock ‘n’ roll)
Clitic contractions (we’re, I’m, etc.)
Numeric expressions
What tokenization method is better than splitting on spaces and punctuation?
Pattern tokenizaation
What is pattern tokenization?
It is the use of regular expressions or other pattern matching styles to define tokenization rules
What are some problems with pattern tokenization?
Rules will be corpus specific and probably overfitted to achieve accuracy for a very specific task
Besides splitting on punctuation/spaces and pattern tokenization, what is another way to tokenize text?
We can learn common patterns from the corpus itself, or a similar training corpus
Words can be split into subword units (morphemes, significant punctuation)
How can we split words?
Lemmatization is a way of determining words with different superficial forms but have the same root
Words are composed of subword units called morphemes
Word parts with recognised meanings can have an affix (stems + affixes)
What is stemming?
Stemming is a lemmatizaation process that removes suffixes by applying a sequence of rewrites
What is a popular stemmer?
The Porter Stemmer
What are some issues with the Porter Stemmer?
It is very language specific and is very crude
What is BPE?
Byte-Pair Encoding
How does BPE work?
It works by starting with a vocabulary which is the set of individual characters and tries to learn new tokens. This is repeated k times examining the corpus.
What happens each time BPE repeats?
It selects the two symbols that are most frequently adjacent. (A, B)
It adds the new merged symbol to the vocabulary (AB)
It replaces every adjacent A B in the corpus with the new AB
What happens when BPE is run with thousands of merges on a very large corpus?
It represents most words as full symbols and only the rare words and unknown words will have to represented by their parts
Explain what the image shows
It shows BPE in process. We start with the vocabulary and the word end. It then looks for the most common pair of adjacent tokens, which is ‘er’. We then create the ‘er’ token and rewrite any appearances of this token with the combined token, and then repeat this process. This continues until we have enough symbols
What are some alternatives to BPE?
WordPiece - Is based on an n-gram language model using multiple-adjacent words as single tokens
SentencePiece - extends these as a simple and language independent text tokenizer
How do we compute if words are similar?
Levenshtein Distance
Minimum Edit Distance
How does Levenshtein distance work?
It is the shortest sequence of edits to transform one string into another. In the image, we make 3 substitutions, a delete and an insert, so we have a total of 5 edits
How does Minimum Edit Distance work?
We works by writing down the set of edits needed to go from one word to another. Its applications include plagiarism analysis, alignments in parallel corpora, similarities in different generations for software updates
How can the Levenshtein metric be calculated?
It can be calculated recursively