Week 5 Flashcards by Boris Philippo

What does it mean that the miner is often the end user?

Data mining is carried out by knowledge persons within the different business units.

How well did you know this?

Not at all

Perfectly

What is the output of regression data mining (belongs to predictions)

A number

How well did you know this?

Not at all

Perfectly

What does link analysis try to achieve?

Find patterns in relationship to each other.

How well did you know this?

Not at all

Perfectly

What does the robustness of data mining refer to?

Its ability to overcome noisy data to make somewhat accurate predictions.

How well did you know this?

Not at all

Perfectly

What does the accuracy of data mining refer to?

Its ability to predict the outcome of a previously unknown data set accurately.

How well did you know this?

Not at all

Perfectly

In estimating the accuracy of data mining (or other) classification models, the true positive rate is

the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives.

How well did you know this?

Not at all

Perfectly

When would the iteration of steps 3 and 4 stop in K-means clustering?

2 awnsers

When the recalculation of center points does not lead to a reassignment of data points anymore.
When a pre-defined number of iterations have been carried out.

How well did you know this?

Not at all

Perfectly

Would the algorithm always show the same results if we keep K the same and all other parameters the same?

No, because the initial selection of cluster center points is random.

How well did you know this?

Not at all

Perfectly

What is the output variable in Association?

There is no output variable.

How well did you know this?

Not at all

Perfectly

What is the Euclidian distance

Ordinary distance between two points that one would measure with a ruler.

How well did you know this?

Not at all

Perfectly

What is the manhattan distance?

rectilinear distance, or taxicab distance, between two points)

Its the total travel distance if one can only move along grid lines.

How well did you know this?

Not at all

Perfectly

What can cluster analysis be used for?

Cluster analysis can be used for automatic identification of natural groupings of things.

How well did you know this?

Not at all

Perfectly

What kind of learning does cluster analysis use?

Unsupervised learning

How well did you know this?

Not at all

Perfectly

What kind of data set does supervised learning use?

A labeled data set

How well did you know this?

Not at all

Perfectly

How does clustering work?

It works by learning the clusters of things form past data, then assigning new instances.

How well did you know this?

Not at all

Perfectly

What are some of the use cases of clustering?

Identify natural groupings of customers;
Identify rules for assigning new cases to classes for targeting/diagnostic purposes;
Provide characterization, definition, labeling of populations,
Decrease the size and complexity of problems for other data mining methods;
Identify outliers in a specific domain.

How well did you know this?

Not at all

Perfectly

How is the optimal amount of clusters determined?

There is no optimal way to calculate the amount of clusters, hence heuristics are often used.

How well did you know this?

Not at all

Perfectly

What does K-means clustering mean?

K-means clustering means that there is a pre-determined number of clusters.

How well did you know this?

Not at all

Perfectly

What are the steps of K-means clustering?

Determine the value of k;
Randomly generate k random points as initial cluster centers;
Assign each point to the nearest cluster center;
Re-compute the new cluster centers;
Repeat steps 3 and 4 until some convergence criterion is met.

How well did you know this?

Not at all

Perfectly

When is the conversion criteria met?

Study These Flashcards

The convergence criterion is usually met when the assignment of points to clusters becomes stable.

What is association rule mining?

Study These Flashcards

It finds interesting relationships (affinities) between variables (items or events).

What kind of machine learning does association utilize?

Study These Flashcards

Unsupervised.

What is the input of association?

Study These Flashcards

Point-of-sale transaction data

What is the output of association?

Study These Flashcards

Most frequent affinities among items.

What are the use cases for association for business and medicine?

Business: cross-marketing, cross-selling, store design, catalog design, e-commerce, site design, optimization of online advertising, product pricing, and sales promotion/configuration. Medicine: relationships between symptoms and illnesses, diagnosis and patient characteristics and treatments, and genes and their functions.

What is support in association?

How often X and Y go together Baskets with both X and Y divided by total baskets

What is the confidence in association?

How often Y goes together with X SuPP X-->Y divided by Supp X

Why is text mining needed to stay competitive?

Because 85-90 percent of corporate data is unstructured.

What is text mining?

A semi-automatic process of extracting knowledge from unstructured data sources.

What is the difference between data mining and text mining?

Text mining uses unstructured data.

What does text mining do?

Text mining imposes structure to the data and mines the structured data

What is structured data?

Composed of clearly defined data types whose pattern makes them easily searchable.

What is unstructured data?

Everything else from structured. There is some internal structure but are not structured via pre-defined models or schemas. Usually ont easily searchable.

What is semi-structured data?

Data have some internal structure but also unstructured elements (emails, xml files)

What is the corpus?

A large and structured set of texts prepared for the purpose of conducting knowledge discovery.

What is tokenizing?

A token is categorized block of text in a sentence. The assignment of meaning to blocks of text is known as tokenizing.

What are terms?

A term is a single word or multi word phrase extracted directly from the corpus of a specific domain by means of natural language processing (NLP) methods.

What are concepts?

Concepts are features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms, concepts are the result of higher-level abtraction.

What is stemming?

THe process of reducing inflected words to their stem (or base or root) form.

What are stop words (and include words)

Words that are filtered out prior to or after processing of natural language data. Usually a list of verbs that are removed.

What are synonyms?

Syntactically different words with identical or similar meanings.

What are polysemes?

Syntactically identical words with different meanings.

What is the term dictionary?

A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.

What is word frequency?

The number of times a word is found in a specific document.

What is natural language processing (NLP)

A subfield of AI and computational linguistics that studies the understanding of the natural human language.

What is bag-of-words?

Disregarding grammar or order of words?

What are the applications of NLP?

- Language translation applications; - Checking grammatical accuracy; - Personal assistant applications.

What are the steps in the text mining process?

1. Establish the corpus 2. Create the term-document matrix 3. Extract patterns/knowledge

What is the term-by-document Matrix (TDM)?

This matrix describes how often terms are mentioned in each document. Decisions such as which words to include (stop words, synonyms/homonyms and stemming) and kind of indices to use. The dimensionality of the TDM can be reduced manually by an export. The matrix is transformed using singular value decomposition (SVD).

Week 5 Flashcards

(49 cards)