Words Flashcards by Ben Boyce

What are some problems with natural langauge?

There is lots of ambiguity from identical word forms

There is also dependency on punctuation or intonation

How well did you know this?

Not at all

Perfectly

What types of texts exist?

Formal News

Polemic News (argumentative)

Speech

Historic, Poetic, Musical

Social Media

How well did you know this?

Not at all

Perfectly

What is a sentence?

A unit of written language

How well did you know this?

Not at all

Perfectly

What is an utterance?

It is a unit of spoken language

How well did you know this?

Not at all

Perfectly

What is a word form?

It is the inflected form as it appears in the corpus

How well did you know this?

Not at all

Perfectly

What is a lemma?

It is an abstract form shared by word forms having the same stem, POS, word sense

How well did you know this?

Not at all

Perfectly

What are function words?

Indicate the grammatical relationship between terms but have little topical meaning

How well did you know this?

Not at all

Perfectly

What are types?

They are a number of distinct words in a corpus

How well did you know this?

Not at all

Perfectly

What are tokens?

It is the collection of all words

How well did you know this?

Not at all

Perfectly

What are some lexical analysis steps we can take?

Stripping punctuation, folding cases, removing function words, lemmatising and stemming text, taking an index for each of the words

How well did you know this?

Not at all

Perfectly

How much of text is function words?

They account for up to 60 percent of text

How well did you know this?

Not at all

Perfectly

What does repetition signal?

It signals intention

How well did you know this?

Not at all

Perfectly

What do wordclouds provide?

They provide visual representation of statistical summary

How well did you know this?

Not at all

Perfectly

What is tokenization?

It is the process of turning a stream of characters into a sequence of words

How well did you know this?

Not at all

Perfectly

What is a token?

A token is a lexical construct that can be assigned grammatical and semantic roles

How well did you know this?

Not at all

Perfectly

What is a naive solution to tokenization?

Study These Flashcards

Break on spaces and punctuation - too simple for general case, is useful piece of information for parses, helps to indicate sentence boundaries

What are some tokenization issues?

Study These Flashcards

Punctuation can be internal, for example abbreviations, prices, times etc

We can have multiword expressions (New York, rock ‘n’ roll)

Clitic contractions (we’re, I’m, etc.)

Numeric expressions

What tokenization method is better than splitting on spaces and punctuation?

Study These Flashcards

Pattern tokenizaation

What is pattern tokenization?

Study These Flashcards

It is the use of regular expressions or other pattern matching styles to define tokenization rules

What are some problems with pattern tokenization?

Study These Flashcards

Rules will be corpus specific and probably overfitted to achieve accuracy for a very specific task

Besides splitting on punctuation/spaces and pattern tokenization, what is another way to tokenize text?

Study These Flashcards

We can learn common patterns from the corpus itself, or a similar training corpus

Words can be split into subword units (morphemes, significant punctuation)

How can we split words?

Study These Flashcards

Lemmatization is a way of determining words with different superficial forms but have the same root

Words are composed of subword units called morphemes

Word parts with recognised meanings can have an affix (stems + affixes)

What is stemming?

Study These Flashcards

Stemming is a lemmatizaation process that removes suffixes by applying a sequence of rewrites

What is a popular stemmer?

Study These Flashcards

The Porter Stemmer

What are some issues with the Porter Stemmer?

It is very language specific and is very crude

What is BPE?

Byte-Pair Encoding

How does BPE work?

It works by starting with a vocabulary which is the set of individual characters and tries to learn new tokens. This is repeated k times examining the corpus.

What happens each time BPE repeats?

It selects the two symbols that are most frequently adjacent. (A, B) It adds the new merged symbol to the vocabulary (AB) It replaces every adjacent A B in the corpus with the new AB

What happens when BPE is run with thousands of merges on a very large corpus?

It represents most words as full symbols and only the rare words and unknown words will have to represented by their parts

Explain what the image shows

It shows BPE in process. We start with the vocabulary and the word end. It then looks for the most common pair of adjacent tokens, which is 'er'. We then create the ‘er’ token and rewrite any appearances of this token with the combined token, and then repeat this process. This continues until we have enough symbols

What are some alternatives to BPE?

WordPiece - Is based on an n-gram language model using multiple-adjacent words as single tokens SentencePiece - extends these as a simple and language independent text tokenizer

How do we compute if words are similar?

Levenshtein Distance Minimum Edit Distance

How does Levenshtein distance work?

It is the shortest sequence of edits to transform one string into another. In the image, we make 3 substitutions, a delete and an insert, so we have a total of 5 edits

How does Minimum Edit Distance work?

We works by writing down the set of edits needed to go from one word to another. Its applications include plagiarism analysis, alignments in parallel corpora, similarities in different generations for software updates

How can the Levenshtein metric be calculated?

It can be calculated recursively

Words Flashcards

(35 cards)