Chapter 7: Text and Web Mining Flashcards by Joel Sanborn

Association

A popular and well researched technique for discovering interesting relationships among variables in large databases.

How well did you know this?

Not at all

Perfectly

Authoritative pages

The collective endorsement of a given page by different developers on the Web may indicate the importance of the page (Miller, 2005).

How well did you know this?

Not at all

Perfectly

Classification

A task to classify a given data instance into a predetermined set of categories (or classes).

How well did you know this?

Not at all

Perfectly

Clickstream Analysis

Analysis of the information collected by Web servers.

How well did you know this?

Not at all

Perfectly

Clustering

An unsupervised process whereby objects are classified into “natural” groups called clusters.

How well did you know this?

Not at all

Perfectly

Corpus

A large and structured set of texts prepared for the purpose of conducting knowledge discovery.

How well did you know this?

Not at all

Perfectly

Customer Experience Management (CEM)

Application designed to provide a more qualitative view of online visitor behavior, report on overall user experience, and report direct feedback given by visitors and customers.

How well did you know this?

Not at all

Perfectly

Deception Detection

Applying text mining to a large set of real-world criminal statements to develop prediction models that differentiate deceptive statements from truthful ones.

How well did you know this?

Not at all

Perfectly

Hubs

One or more Web pages that provide a collection of links to authoritative pages.

How well did you know this?

Not at all

Perfectly

Hyperlink-induced topic search (HITS)

Originally developed by Kleinberg (1999), HITS is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.

How well did you know this?

Not at all

Perfectly

Inverse Document Frequency

A common and very useful transformation that reflects both the specificity of words as well as the overall frequency of their occurrences (Manning and Schutze, 2009).

How well did you know this?

Not at all

Perfectly

Natural Language Processing

A study of the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.

How well did you know this?

Not at all

Perfectly

Part-of-speech tagging

Also known as shallow-parsing, is the process of marking up a work in a text (corpus) as corresponding to a part of speech, based on both its definition and its context.

How well did you know this?

Not at all

Perfectly

Polyseme

Also known as homonyms, are syntactically identical words with different meaning.

How well did you know this?

Not at all

Perfectly

Sentiment Analysis

A technique used to detect favorable and unfavorable opinions toward specific products and services using many textual data sources.

How well did you know this?

Not at all

Perfectly

Sequence Discovery

Study These Flashcards

Finding statistically relevant patterns between data examples where the values are delivered in a sequence.

Singular Value Decomposition

Study These Flashcards

Reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability possible (Manning and Schutze, 1999).

Stemming

Study These Flashcards

The process of reducing inflected words to their stem form.

Stop Words

Study These Flashcards

Words that are filtered out prior to or after processing of natural language data.

Term-Document Matrix (TDM)

Study These Flashcards

The second task in the text mining process, digitized and organized documents are used to create the TDM, where the rows represent the documents and the columns represent the terms. The relationship between the terms and documents are characterized by indices.

Text Mining

Study These Flashcards

A semi-automated process of extracting patters from large amounts of unstructured data sources.

Tokenizing

Study These Flashcards

A token is a categorized block of text. The block of text corresponding to the token is categorized according to the function it performs.

Trend Analysis

Study These Flashcards

A method of analysis that allows the comparison of a collection of documents to predict future trends.

Unstructured Data

Study These Flashcards

Data that does not have a predetermined format and is stored in the form of textual documents.

Voice of Customer

See Customer Experience Management (CEM)

Web Analytics

The underlying technology that provides the ability to log, parse, and report on the clickstream behavior of site visitors.

Web Content Mining

A reference to the extraction of useful information from Web pages.

Web Crawler

Used to read through the content of a Web site automatically.

Web Mining

the process of discovering intrinsic relationships from Web data, which are expressed in the form of textual, linkage, or usage information.

Web Structure Mining

the process of extracting useful information from the links embedded in Web documents. It is used to identify authoritative pages and hubs.

Web Usage Mining

The extraction of useful information from data generated through Web page visits and transactions.

Chapter 7: Text and Web Mining Flashcards

(31 cards)