Chapter 7 Flashcards
authoritative pages
Web pages that are identified as particularly popular based on links by other Web pages and directories.
clickstream analysis
The analysis of data that occur in the Web environment.
clustering
Partitioning a database into segments in which the members of a segment share similar qualities.
corpus
In linguistics, a large and structured set of texts (now usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.
deception detection
A way of identifying deception (intentionally propagating beliefs that are not true) in voice, text, and/or body language of humans.
hubs
One or more Web pages that provide a collection of links to authoritative pages.
hyperlink-induced topic search
(HTS) The most popular publicly known and referenced algorithm in Web mining used to discover hubs and authorities.
polarity identification
Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities or to locate its position on the continuum between these two polarities.
Word / Term level.
1) use a lexicon as a reference library.
2) use a collection of training documents.
polyseme
Words also called homonyms, they are syntactically identical words (i.e., spelled exactly the same) with different meanings (e.g., bow can mean “to bend forward,”
“the front of the ship,” “the weapon that shoots arrows,” or “a
kind of tied ribbon”).
search engine
A program that finds and lists Web sites or pages (designated by URLs) that match some user-selected criteria.
sentiment analysis
The technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (customer feedback in the form of Web postings).
SentiWordNet
An extension of WordNet used for sentiment identification.
singular value decomposition
Closely related to principal components analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents).
social media analytics
The systematic and scientific way to consume the vast amount of content created by Web-based social media outlets, tools, and techniques for the betterment of an organization’s competitiveness.
social network analysis
(SNA) The mapping and measuring of relationships and information flows among people, groups, organizations, computers, and other information - or knowledge-processing entities. The nodes in the network are the people and groups, whereas the links show relationships or flows between the nodes.
spider
An application used to read through the content of a Web site automatically (Web Crawler).
stemming
A process of reducing words to their respective root forms in order to better represent them in a text mining project.
stop words
Words that are filtered out prior to or after processing of natural language data (i.e., text).
term-document matrix
A frequency matrix created from digitized and organized documents (the corpus) where the columns represent the terms while rows represent the individual documents.
text mining
The application of data mining to nonstructured or less structured text files. It entails the generation of meaningful numeric indices from the unstructured text and then processing those indices using various data mining algorithms.
tokenizing
Categorizing a block of text (token) according to the function it performs
trend anaylsis
The collecting of information and attempting to spot a pattern, or trend, in the information.
voice of the customer (VOC)
Applications that focus on “who and how” questions by gathering and reporting direct feedback from site visitors, by benchmarking against other sites and offline channels, and by supporting predictive modeling of future visitor behavior.
Web analytics
The application of business analytics activities to Web-based processes, including e-commerce.