Chapter 5 - Predictive Analytics II: Text, Web, and Social Media Flashcards

1
Q

What is Text Mining?

A

The semiautomated process of extracting patterns from large amounts of unstructured data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the Seven (7) Application Areas of Text Mining?

A
  1. Information Extraction
  2. Topic Tracking
  3. Summarization
  4. Categorization
  5. Clustering
  6. Concept Linking
  7. Question Answering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the Fourteen (1-5) Text Mining Terms we need to know?

A
  1. Unstructured Data - Data that does not have a predetermined format and is stored as textual documents.
  2. Corpus - A large and structured set of texts prepared for the purpose of conducting knowledge discovery.
  3. Terms - Single word or phrase extracted directly from the corpus
  4. Concepts - Features generated from a collection of documents
  5. Stemming - Reducing inflected words to their base or root form
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the Fourteen (6-10) Text Mining Terms we need to know?

A
  1. Stop Words - Words that are filtered out prior to or after processing of natural language data.
  2. Synonyms and polysemes - Polysemes are also called homonyms (spelled exactly the same)
  3. Tokenizing - Assignment of meaning to blocks of text (also known as tokens)
  4. Term Dictionary - Collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus
  5. Word Frequency - Number of times a word is found
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the Fourteen (11-14) Text Mining Terms we need to know?

A
  1. Part-of-Speech Tagging - Marking up the words in a text as corresponding to a particular part of speech based on a word’s definition and the context in which it is used.
  2. Morphology - Studies the internal structure of words
  3. Term-By-Document Matrix (Occurrence Matrix)
  4. Singular Value Decomposition (Latent Semantic Indexing)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does NLP Stand For and How is it Defined?

A

Natural Language Processing studies the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some of the Challenges Related to NLP? (6)

A

Part-Of-Speech Tagging
Text Segmentation
Word Sense Disambiguation
Syntactic Ambiguity
Imperfect or Irregular Imput
Speech Acts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Deception Detection as it Relates to Text Mining?

A

It is used in prediction models to differentiate deceptive statements from truthful ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Part-Of-Speech Tagging?

A

Tokenized terms (words) are matched and interpreted against the text based on the term’s definition and the context that it is being used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the Three (3) Steps/Tasks for Text Mining?

A
  1. Establish the Corpus - Collect all documents related to the context being studied and transform them in a manner that they are all in the same representational form for computer processing.
  2. Create the Term-Document Matrix - Rows represent documents and columns represent terms. Relationships between the terms and documents are characterized by indices.
  3. Extract the Knowledge - Main extraction methods are Classification, Clustering, Association, and Trend Analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a TDM?

A

A Term-Document Matrix that indexes the relationships between terms and documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is SVD?

A

Singular Value Decomposition reduces the overall dimensionality of the input matrix to a lower-dimensional space where each consecutive dimension represents the largest degree of variability between words and documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Sentiment Analysis?

A

Sentiment analysis is trying to answer the question “What do people feel about a certain topic?” by digging into opinions using a variety of automated tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the Seven (7) Discrete Sentiment Analysis Applications Stated by the Author?

A
  1. Voice of the Customer (VOC)
  2. Voice of the Market (VOM)
  3. Voice of the Employee (VOE)
  4. Brand Management
  5. Financial Markets
  6. Politics
  7. Government Intelligence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Sentiment Analysis Process?

A
  1. Sentiment Detection
  2. N-P Polarity Classification
  3. Target Identification
  4. Collection and Aggregation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the Three (3) Different Elements of Sentiment Analysis?

A

Polarity Identification
Identifying Semantic Orientation of Sentences and Phrases
Identifying Semantic Orientation of Documents

17
Q

What is Polarity Identification?

A

The process of identifying the sentiments under one of two opposing polarities, or locate the position along the continuum between the polarities.

18
Q

What are the Two (2) Methods of Polarity Identification?

A
  1. Using a lexicon as a reference library
  2. Using a collection of training documents as the source of knowledge about the polarity of terms within a specific domain.
19
Q

What is Web Mining?

A

The process of discovering intrinsic relationships from Web data, which are expressed in the form of textual, linkage, or usage information.

20
Q

What are Web Crawlers?

A

AKA Spiders are used to read through the content of a website automatically.

21
Q

What is an Authoritative Page?

A

Use of a web page or a relevance index that improves the search results and rankings of relevant pages.

22
Q

What is a HITS?

A

A hyperlink-induced topic search. It is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.

23
Q

What is Web Structure Mining and Why is it Important?

A

Web Structure mining is the process of extracting useful information from the links embedded in Web documents. It is used to identify authoritative pages and hubs which are the cornerstones of page-rank algorithms relied upon by Google and other search engines.

24
Q

What is SEO?

A

Search Engine Optimization is the intentional activity of affecting the visibility of a website in a search engine’s natural search results.

25
Q

What is Clickstream Analysis?

A

The analysis of the information collected by Web servers that help us better understand user behavior. It is used to discern interesting patterns from clickstreams.

26
Q

What is Social Analytics?

A

Monitoring, analyzing, measuring, and interpreting digital interactions and relationships of people, topics, ideas, and content.

27
Q

What are the Three (3) Social Network Categories?

A

Connections
Distributions
Segmentation

28
Q

What are the Subcategories for Connections? (5)

A

Homophily - Actors form ties with similar vs. dissimilar others
Multiplexity - Number of content forms contained in a tie
Mutuality/Reciprocity - How much two actors reciprocate interaction or friendship
Network Closure - Measure of the completeness of relational triads
Propinquity - Tendency to have more ties with geographically close others

29
Q

What are the Subcategories for Distributions? (6)

A

Bridge - An individual whose weak ties fill a structural hole, providing the only link between two individuals or clusters
Centrality - Metrics that aim to quantify the importance of a particular node within a network
Density - Proportion of direct ties in a network relative to the total
Distance - Minimum number of ties needed to connect two actors
Structural Holes - Absence of ties between two parts of a network
Tie Strength - Linear combination of time, emotional intensity, intimacy, and reciprocity

30
Q

What are the Subcategories for Segmentation? (3)

A

Cliques and Social Circles - Cliques if every individual is tied to every other individual and Social Circles if there is less stringency of direct contact
Clustering Coefficient - Likelihood two members of a node are associates. Higher clustering indicates a great cliquishness
Cohesion - Degree to which actors are connected directly to each other by cohesive bonds