Chapter 7: Text and Web Mining Flashcards
Association
A popular and well researched technique for discovering interesting relationships among variables in large databases.
Authoritative pages
The collective endorsement of a given page by different developers on the Web may indicate the importance of the page (Miller, 2005).
Classification
A task to classify a given data instance into a predetermined set of categories (or classes).
Clickstream Analysis
Analysis of the information collected by Web servers.
Clustering
An unsupervised process whereby objects are classified into “natural” groups called clusters.
Corpus
A large and structured set of texts prepared for the purpose of conducting knowledge discovery.
Customer Experience Management (CEM)
Application designed to provide a more qualitative view of online visitor behavior, report on overall user experience, and report direct feedback given by visitors and customers.
Deception Detection
Applying text mining to a large set of real-world criminal statements to develop prediction models that differentiate deceptive statements from truthful ones.
Hubs
One or more Web pages that provide a collection of links to authoritative pages.
Hyperlink-induced topic search (HITS)
Originally developed by Kleinberg (1999), HITS is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.
Inverse Document Frequency
A common and very useful transformation that reflects both the specificity of words as well as the overall frequency of their occurrences (Manning and Schutze, 2009).
Natural Language Processing
A study of the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.
Part-of-speech tagging
Also known as shallow-parsing, is the process of marking up a work in a text (corpus) as corresponding to a part of speech, based on both its definition and its context.
Polyseme
Also known as homonyms, are syntactically identical words with different meaning.
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services using many textual data sources.