Chapter 5 - Predictive Analytics II: Text, Web, and Social Media Flashcards
What is Text Mining?
The semiautomated process of extracting patterns from large amounts of unstructured data sources.
What are the Seven (7) Application Areas of Text Mining?
- Information Extraction
- Topic Tracking
- Summarization
- Categorization
- Clustering
- Concept Linking
- Question Answering
What are the Fourteen (1-5) Text Mining Terms we need to know?
- Unstructured Data - Data that does not have a predetermined format and is stored as textual documents.
- Corpus - A large and structured set of texts prepared for the purpose of conducting knowledge discovery.
- Terms - Single word or phrase extracted directly from the corpus
- Concepts - Features generated from a collection of documents
- Stemming - Reducing inflected words to their base or root form
What are the Fourteen (6-10) Text Mining Terms we need to know?
- Stop Words - Words that are filtered out prior to or after processing of natural language data.
- Synonyms and polysemes - Polysemes are also called homonyms (spelled exactly the same)
- Tokenizing - Assignment of meaning to blocks of text (also known as tokens)
- Term Dictionary - Collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus
- Word Frequency - Number of times a word is found
What are the Fourteen (11-14) Text Mining Terms we need to know?
- Part-of-Speech Tagging - Marking up the words in a text as corresponding to a particular part of speech based on a word’s definition and the context in which it is used.
- Morphology - Studies the internal structure of words
- Term-By-Document Matrix (Occurrence Matrix)
- Singular Value Decomposition (Latent Semantic Indexing)
What does NLP Stand For and How is it Defined?
Natural Language Processing studies the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.
What are some of the Challenges Related to NLP? (6)
Part-Of-Speech Tagging
Text Segmentation
Word Sense Disambiguation
Syntactic Ambiguity
Imperfect or Irregular Imput
Speech Acts
What is Deception Detection as it Relates to Text Mining?
It is used in prediction models to differentiate deceptive statements from truthful ones
What is Part-Of-Speech Tagging?
Tokenized terms (words) are matched and interpreted against the text based on the term’s definition and the context that it is being used.
What are the Three (3) Steps/Tasks for Text Mining?
- Establish the Corpus - Collect all documents related to the context being studied and transform them in a manner that they are all in the same representational form for computer processing.
- Create the Term-Document Matrix - Rows represent documents and columns represent terms. Relationships between the terms and documents are characterized by indices.
- Extract the Knowledge - Main extraction methods are Classification, Clustering, Association, and Trend Analysis.
What is a TDM?
A Term-Document Matrix that indexes the relationships between terms and documents.
What is SVD?
Singular Value Decomposition reduces the overall dimensionality of the input matrix to a lower-dimensional space where each consecutive dimension represents the largest degree of variability between words and documents.
What is Sentiment Analysis?
Sentiment analysis is trying to answer the question “What do people feel about a certain topic?” by digging into opinions using a variety of automated tools.
What are the Seven (7) Discrete Sentiment Analysis Applications Stated by the Author?
- Voice of the Customer (VOC)
- Voice of the Market (VOM)
- Voice of the Employee (VOE)
- Brand Management
- Financial Markets
- Politics
- Government Intelligence
What is the Sentiment Analysis Process?
- Sentiment Detection
- N-P Polarity Classification
- Target Identification
- Collection and Aggregation