Week 5 Flashcards

1
Q

What does it mean that the miner is often the end user?

A

Data mining is carried out by knowledge persons within the different business units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What is the output of regression data mining (belongs to predictions)

A

A number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does link analysis try to achieve?

A

Find patterns in relationship to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the robustness of data mining refer to?

A

Its ability to overcome noisy data to make somewhat accurate predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the accuracy of data mining refer to?

A

Its ability to predict the outcome of a previously unknown data set accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In estimating the accuracy of data mining (or other) classification models, the true positive rate is

A

the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When would the iteration of steps 3 and 4 stop in K-means clustering?

2 awnsers

A
  1. When the recalculation of center points does not lead to a reassignment of data points anymore.
  2. When a pre-defined number of iterations have been carried out.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Would the algorithm always show the same results if we keep K the same and all other parameters the same?

A

No, because the initial selection of cluster center points is random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the output variable in Association?

A

There is no output variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Euclidian distance

A

Ordinary distance between two points that one would measure with a ruler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the manhattan distance?

A

rectilinear distance, or taxicab distance, between two points)

Its the total travel distance if one can only move along grid lines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can cluster analysis be used for?

A

Cluster analysis can be used for automatic identification of natural groupings of things.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kind of learning does cluster analysis use?

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of data set does supervised learning use?

A

A labeled data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does clustering work?

A

It works by learning the clusters of things form past data, then assigning new instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some of the use cases of clustering?

A
  • Identify natural groupings of customers;
  • Identify rules for assigning new cases to classes for targeting/diagnostic purposes;
  • Provide characterization, definition, labeling of populations,
  • Decrease the size and complexity of problems for other data mining methods;
  • Identify outliers in a specific domain.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the optimal amount of clusters determined?

A

There is no optimal way to calculate the amount of clusters, hence heuristics are often used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does K-means clustering mean?

A

K-means clustering means that there is a pre-determined number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the steps of K-means clustering?

A
  1. Determine the value of k;
  2. Randomly generate k random points as initial cluster centers;
  3. Assign each point to the nearest cluster center;
  4. Re-compute the new cluster centers;
  5. Repeat steps 3 and 4 until some convergence criterion is met.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When is the conversion criteria met?

A

The convergence criterion is usually met when the assignment of points to clusters becomes stable.

20
Q

What is association rule mining?

A

It finds interesting relationships (affinities) between variables (items or events).

21
Q

What kind of machine learning does association utilize?

A

Unsupervised.

22
Q

What is the input of association?

A

Point-of-sale transaction data

23
Q

What is the output of association?

A

Most frequent affinities among items.

24
Q

What are the use cases for association for business and medicine?

A

Business: cross-marketing, cross-selling, store design, catalog design, e-commerce, site design, optimization of online advertising, product pricing, and sales promotion/configuration.

Medicine: relationships between symptoms and illnesses, diagnosis and patient characteristics and treatments, and genes and their functions.

25
Q

What is support in association?

A

How often X and Y go together

Baskets with both X and Y divided by total baskets

26
Q

What is the confidence in association?

A

How often Y goes together with X

SuPP X–>Y divided by Supp X

27
Q

Why is text mining needed to stay competitive?

A

Because 85-90 percent of corporate data is unstructured.

28
Q

What is text mining?

A

A semi-automatic process of extracting knowledge from unstructured data sources.

29
Q

What is the difference between data mining and text mining?

A

Text mining uses unstructured data.

30
Q

What does text mining do?

A

Text mining imposes structure to the data and mines the structured data

31
Q

What is structured data?

A

Composed of clearly defined data types whose pattern makes them easily searchable.

32
Q

What is unstructured data?

A

Everything else from structured. There is some internal structure but are not structured via pre-defined models or schemas. Usually ont easily searchable.

33
Q

What is semi-structured data?

A

Data have some internal structure but also unstructured elements (emails, xml files)

34
Q

What is the corpus?

A

A large and structured set of texts prepared for the purpose of conducting knowledge discovery.

35
Q

What is tokenizing?

A

A token is categorized block of text in a sentence. The assignment of meaning to blocks of text is known as tokenizing.

36
Q

What are terms?

A

A term is a single word or multi word phrase extracted directly from the corpus of a specific domain by means of natural language processing (NLP) methods.

37
Q

What are concepts?

A

Concepts are features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms, concepts are the result of higher-level abtraction.

38
Q

What is stemming?

A

THe process of reducing inflected words to their stem (or base or root) form.

39
Q

What are stop words (and include words)

A

Words that are filtered out prior to or after processing of natural language data. Usually a list of verbs that are removed.

40
Q

What are synonyms?

A

Syntactically different words with identical or similar meanings.

41
Q

What are polysemes?

A

Syntactically identical words with different meanings.

42
Q

What is the term dictionary?

A

A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.

43
Q

What is word frequency?

A

The number of times a word is found in a specific document.

44
Q

What is natural language processing (NLP)

A

A subfield of AI and computational linguistics that studies the understanding of the natural human language.

45
Q

What is bag-of-words?

A

Disregarding grammar or order of words?

46
Q

What are the applications of NLP?

A
  • Language translation applications;
  • Checking grammatical accuracy;
  • Personal assistant applications.
47
Q

What are the steps in the text mining process?

A
  1. Establish the corpus
  2. Create the term-document matrix
  3. Extract patterns/knowledge
48
Q

What is the term-by-document Matrix (TDM)?

A

This matrix describes how often terms are mentioned in each document. Decisions such as which words to include (stop words, synonyms/homonyms and stemming) and kind of indices to use.

The dimensionality of the TDM can be reduced manually by an export. The matrix is transformed using singular value decomposition (SVD).