Low-Level Analysis Flashcards
Text Preprocessing
Break up the text:
- Sentence segmentation (sentence tokenization). Treat each sentence as a token.
- Lexical analysis (word tokenization). Treat each word as a token.
Sentence Segmentation
Treating each sentence as a token. A period, exclamation, or question mark can be used to segment sentences. Periods for abbreviations need to be handled so they aren’t necessarily treated as sentence breaks. Sentence tokenizers take these into account.
-Use case will determine if sentence segmentation is needed or if you can go straight to word tokenization
-95% of use cases can segment sentences by looking for:
1) a period, it ends a sentence
2) If the preceding token is i the hand-compiled list of abbreviations, then it doesn’t end a sentence.
3) If the next token is capitalized, then it ends a sentence.
Sentence tokenizers: Slr StandardTokenizer, Open NLP english.Tokernizer, OpenNLP SimpleTokenizer
Word Tokenization
Once sentences have been tokenized, then words can be tokenized. Contractions like “didn’t” isn’t tokenized. Punctuation can’t be included, e.g., jumping, You end up with a Bag of Words (BoW) and you can figure out word frequency.
- Problem cutting on space for words like New Mexico, South America, etc. (state entities tokenized together, etc.)
- Examples that get interesting, small words ( xp, ma, j lo), hyphenated words (e-bay, wal-mart, t-shirts), special characters (URLs, code), capitalized words (Bush vs. Apple), apostrophes (can’t, 80’s, master’s degree, numbers (nokia 3250, unites 93, etc.), periods (I.B.M, Ph.D.)
Text Normalization
These involves cleaning or adjusting text so that it can be compared to other documents. Some examples include:
- Contractions (we expand them)
- Removal of stop words (the, and, a)
- Resolving misspellings
- Stemming, if needed (so that words like run, running, ran, etc. are not treated as words that are not associated with each other). Stemming reduces a word to its root form. Examples: Algorithmic stemmer uses a program to decide whether two words are related based on word suffixes, dictionary based stemmer relies on pre-created dictionaries of related words. Porter stemmer is an example of dictionary-based stemmer.
- Lemmatization: Uses lexical knowledge bases such as WordNet to obtain word base forms.
Function words
Function words are a closed class and lack content. They connect content words. They don't answer the 6 Ws (who, what, why, when, where, how). Examples: is, am, are, was, were, he, she, you, we, they, if, then, therefore, possibly. Y'all was the last new function word. Default list of stop words are the function words of a language, however it depends on the case on hand. Content words can be included as stop words.
Content Words
Content words are newly formed all the time and are considered an open class. Many many more content words than function words.
Handling misspellings
Two fundamental approaches:
- Edit-distance method: Checks how many edits you would have to make to turn it into a properly spelled word, add one character, delete one character, replace one character. Returns the word with the lowest edit distance. Stops after 3 edits.
- Fuzzy string comparison: Characters in common as a percentage of total characters.
Special short-circuits: some simple corrections can be made before invoking our spell correctors, like finding repeated characters “mostllly” or sticking a space accidentally between words.
Stemming
Break out a word into morphemes (cats morphemes are cat and s). Stems and affixes or suffixes. Reimagining, imagine is the stem, re is an affix. Can’t be words by themselves.
Low-level document feature extraction
Primary features:
- We are examining only the document itself
- Word frequencies, collocations (n-grams)
Secondary features:
- Requires us to compare features of the document to those of other documents
- Differential frequency analysis (TF/IDF)
- Relative lexical diversity
- Reading level
Terminology Extraction
- Frequency based extraction, remove stop words
- Get Collocations (bi-grams two words, trigrams, or n-grams)
Differential Frequency Analysis
-Analyze frequency comparing to other texts (turtle example)
Term Frequency Inverse Document Frequency (TF-IDF)
This is a way to determine a word’s importance in this document relative to how often important it is overall, multiplying term frequency by inverse document frequency
- Term frequency: How often a term occurs in a document
- Document frequency: How common the term is within a domain represented by a corpus of documents
- Inverse document frequency: Dividing the total number of documents in the corpus by the number of documents containing our target term, and applying a log scale.
If a term is rare, then the number is bigger. If a word is common, it will be a smaller number. The higher the number, the more unusual that word is,
Lexical Diversity
How broad vocab is compared to length of document. Same words over and over or varied vocabulary. Compare one document to others.
Readability Formula
- Needs to be a secondary feature by comparing to other documents
- Compare readability to a list of words of known and mastered by 80% of 4th graders (Dale-Chall readability formula)
- readabilityformulas.com
Automatic Tagging
Tag depends on the word and its context within a sentence.
-Lookup Tagger: Find the x most frequent words and store their most likely tag. If word doesn’t exist, then it gets the default tagger.
-N-gram tagging: Considered standard way to tag.
Context is the current word together with the POS tags of n-1 preceding tokens (e.g., n=3)