Chapter 7: Text and Web Mining Flashcards

1
Q

Association

A

A popular and well researched technique for discovering interesting relationships among variables in large databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Authoritative pages

A

The collective endorsement of a given page by different developers on the Web may indicate the importance of the page (Miller, 2005).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification

A

A task to classify a given data instance into a predetermined set of categories (or classes).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Clickstream Analysis

A

Analysis of the information collected by Web servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Clustering

A

An unsupervised process whereby objects are classified into “natural” groups called clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Corpus

A

A large and structured set of texts prepared for the purpose of conducting knowledge discovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Customer Experience Management (CEM)

A

Application designed to provide a more qualitative view of online visitor behavior, report on overall user experience, and report direct feedback given by visitors and customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deception Detection

A

Applying text mining to a large set of real-world criminal statements to develop prediction models that differentiate deceptive statements from truthful ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hubs

A

One or more Web pages that provide a collection of links to authoritative pages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hyperlink-induced topic search (HITS)

A

Originally developed by Kleinberg (1999), HITS is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Inverse Document Frequency

A

A common and very useful transformation that reflects both the specificity of words as well as the overall frequency of their occurrences (Manning and Schutze, 2009).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Natural Language Processing

A

A study of the problem of “understanding” the natural human language, with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Part-of-speech tagging

A

Also known as shallow-parsing, is the process of marking up a work in a text (corpus) as corresponding to a part of speech, based on both its definition and its context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Polyseme

A

Also known as homonyms, are syntactically identical words with different meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sentiment Analysis

A

A technique used to detect favorable and unfavorable opinions toward specific products and services using many textual data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sequence Discovery

A

Finding statistically relevant patterns between data examples where the values are delivered in a sequence.

17
Q

Singular Value Decomposition

A

Reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability possible (Manning and Schutze, 1999).

18
Q

Stemming

A

The process of reducing inflected words to their stem form.

19
Q

Stop Words

A

Words that are filtered out prior to or after processing of natural language data.

20
Q

Term-Document Matrix (TDM)

A

The second task in the text mining process, digitized and organized documents are used to create the TDM, where the rows represent the documents and the columns represent the terms. The relationship between the terms and documents are characterized by indices.

21
Q

Text Mining

A

A semi-automated process of extracting patters from large amounts of unstructured data sources.

22
Q

Tokenizing

A

A token is a categorized block of text. The block of text corresponding to the token is categorized according to the function it performs.

23
Q

Trend Analysis

A

A method of analysis that allows the comparison of a collection of documents to predict future trends.

24
Q

Unstructured Data

A

Data that does not have a predetermined format and is stored in the form of textual documents.

25
Q

Voice of Customer

A

See Customer Experience Management (CEM)

26
Q

Web Analytics

A

The underlying technology that provides the ability to log, parse, and report on the clickstream behavior of site visitors.

27
Q

Web Content Mining

A

A reference to the extraction of useful information from Web pages.

28
Q

Web Crawler

A

Used to read through the content of a Web site automatically.

29
Q

Web Mining

A

the process of discovering intrinsic relationships from Web data, which are expressed in the form of textual, linkage, or usage information.

30
Q

Web Structure Mining

A

the process of extracting useful information from the links embedded in Web documents. It is used to identify authoritative pages and hubs.

31
Q

Web Usage Mining

A

The extraction of useful information from data generated through Web page visits and transactions.