B03 Text Analytics II Flashcards
Text Pre-Processing Steps
-Tokenize Text -Normalize Spelling -Remove Stop Words -Stem Words -Normalize case -Tag Parts of Speech -Detect Sentence Boundaries
Pre-Processing Definitions: Tokenization
-This involves breaking up text data into individual units or tokens. In other words, creating a bag of words from a document. -Tokens are most often a single word, but can also be an n-gram, sentence, or paragraph.
Pre-Processing Definitions: n-gram
An n-gram is a contiguous sequence of n words within text. “The cow jumped over the moon;” “the cow” “cow jumped” “jumped over” “over the” “the moon”
Pre-Processing Definitions: Part-Of-Speech (POS) Tagging
-This involves labeling a word in text as belonging to a particular part of speech such as nouns, verbs, adjectives, etc. -Part-of-speech tagging is also known as ‘grammatical tagging’ or ‘word-category disambiguation’.
Pre-Processing Definitions: Stop Words
-This involves removing words that occur a lot but have very little analytic impact. -Common examples of stop words include: “a”, “an”, “and”, “but”, “by”, “if”, “it”, “that”, “the”, etc.
Pre-Processing Definitions: Stemming
-The process of normalizing related word tokens into a single form. -This typically includes the identification and removal of prefixes, suffixes, and inappropriate pluralizations.
Pre-Processing Definitions: Lemmatization
-A more advanced form of stemming that attempts to group words based on their core concept or lemma. -It uses both the context surrounding the word and additional grammatical information such as part of speech to determine the lemma.
Pre-Processing Definitions: Spelling Normalization
This involves resolving spelling mistakes or eliminating spelling variations. Approaches include using: 1. A dictionary-based approach. 2. Fuzzy matching algorithms. 3. Word clustering and concept expansion techniques.
Pre-Processing Definitions: Sentence Boundary Detection
-The process of breaking down entire documents into individual grammatical sentences. -For English text, it is almost as easy as finding every occurrence of punctuations like “.”; “?”; or “!”.
Pre-Processing Definitions: Case Normalization
-This involves converting the entire document to either completely lower case or completely upper case characters. -While mixed case text may be helpful to humans in order to differentiate between nouns and proper nouns, they are not always useful for algorithms.
Define Vector-space Model
-In the vector-space model, the rows represent documents and the columns represent tokens (vectors). -The elements of the model represent the occurrence of tokens within the text. -A commonly used vector-space model in text analytics is the document-term-matrix.
Define the ‘bag-of-words assumption’
The vector-space model makes an implicit assumption that the order of words (or tokens) in a document does not matter. This is known as the bag-of-words assumption.
The Vector-space Model can take on 3 forms, these are:
- binary representation 2. frequency count 3. float-valued weighted vector.
Ex. Binary Representation
Ex. Frequency Count