Chapter 7: Text and Web Mining Flashcards
Association
A popular and well researched technique for discovering interesting relationships among variables in large databases.
Authoritative pages
The collective endorsement of a given page by different developers on the Web may indicate the importance of the page (Miller, 2005).
Classification
A task to classify a given data instance into a predetermined set of categories (or classes).
Clickstream Analysis
Analysis of the information collected by Web servers.
Clustering
An unsupervised process whereby objects are classified into “natural” groups called clusters.
Corpus
A large and structured set of texts prepared for the purpose of conducting knowledge discovery.
Customer Experience Management (CEM)
Application designed to provide a more qualitative view of online visitor behavior, report on overall user experience, and report direct feedback given by visitors and customers.
Deception Detection
Applying text mining to a large set of real-world criminal statements to develop prediction models that differentiate deceptive statements from truthful ones.
Hubs
One or more Web pages that provide a collection of links to authoritative pages.
Hyperlink-induced topic search (HITS)
Originally developed by Kleinberg (1999), HITS is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.
Inverse Document Frequency
A common and very useful transformation that reflects both the specificity of words as well as the overall frequency of their occurrences (Manning and Schutze, 2009).
Natural Language Processing
A study of the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.
Part-of-speech tagging
Also known as shallow-parsing, is the process of marking up a work in a text (corpus) as corresponding to a part of speech, based on both its definition and its context.
Polyseme
Also known as homonyms, are syntactically identical words with different meaning.
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services using many textual data sources.
Sequence Discovery
Finding statistically relevant patterns between data examples where the values are delivered in a sequence.
Singular Value Decomposition
Reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability possible (Manning and Schutze, 1999).
Stemming
The process of reducing inflected words to their stem form.
Stop Words
Words that are filtered out prior to or after processing of natural language data.
Term-Document Matrix (TDM)
The second task in the text mining process, digitized and organized documents are used to create the TDM, where the rows represent the documents and the columns represent the terms. The relationship between the terms and documents are characterized by indices.
Text Mining
A semi-automated process of extracting patters from large amounts of unstructured data sources.
Tokenizing
A token is a categorized block of text. The block of text corresponding to the token is categorized according to the function it performs.
Trend Analysis
A method of analysis that allows the comparison of a collection of documents to predict future trends.
Unstructured Data
Data that does not have a predetermined format and is stored in the form of textual documents.
Voice of Customer
See Customer Experience Management (CEM)
Web Analytics
The underlying technology that provides the ability to log, parse, and report on the clickstream behavior of site visitors.
Web Content Mining
A reference to the extraction of useful information from Web pages.
Web Crawler
Used to read through the content of a Web site automatically.
Web Mining
the process of discovering intrinsic relationships from Web data, which are expressed in the form of textual, linkage, or usage information.
Web Structure Mining
the process of extracting useful information from the links embedded in Web documents. It is used to identify authoritative pages and hubs.
Web Usage Mining
The extraction of useful information from data generated through Web page visits and transactions.