Chapter 4 Flashcards

1
Q

Apriori algorithm

A

The most commonly used algorithm to discover association rules by recursively identifying frequent itemsets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

area under the ROC curve

A

A graphical assessment technique for binary classification models where the true positive rate is plotted on the Y-axis and the false positive rate is plotted on the X-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

association

A

A category of data mining algorithm that establishes relationships about items that occur together in a given record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

bootstrapping

A

A sampling technique where a fixed number of instances from the original data is sampled (with replacement) for training and the rest of the data set is used for testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

categorical data

A

Data that represent the labels of multiple classes used to divide a variable into specific groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

classification

A

Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

clustering

A

Partitioning a database into segments in which the members of a segment share similar qualities, unsupervised, used to find natural groupings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

confidence

A

In association rules, the conditional probability of finding the RHS of the rule present in a list of transactions where the LHS of the rule already exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CRISP-DM

A

Cross-Industry Standard Process for Data Mining
1) business understanding
2) data understanding
3) data preprocessing
4) model building
5) test and evaluate
6) deploy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

data mining

A

A process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

decision tree

A

A graphical presentation of a sequence of interrelated decisions to be made under assumed risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

distance measure

A

A method used to calculate the closeness between pairs of items in most cluster analysis methods (Euclidean, Manhattan)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ensemble

A

These are combinations of the outcomes produced by two or more analytics models into a compound output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

entropy

A

A metric that measures the extent of uncertainty or randomness in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gini index

A

A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

information gain

A

The splitting mechanism used in ID3 (a popular decision-tree algorithm)

17
Q

interval data

A

Variables that can be measured on interval scales

18
Q

k-fold cross validation

A

A popular accuracy assessment technique for prediction models where the complete data set is randomly split into k mutually exclusive subsets of approximately equal size. The classification model is trained and tested k times. Each time it is trained on all but one fold and then tested on the remaining single fold. The cross-validation estimate of the overall accuracy of a model is calculated by simply averaging the k individual accuracy measures

19
Q

KNIME

A

An open-source, free-of-charge, platform-agnostic analytics software tool

20
Q

knowledge discovery in databases (KDD)

A

A machine-learning process that performs rule induction or a related procedure to establish knowledge from large databases

21
Q

lift

A

a tool used to answer “Are all association rules interesting and useful?”

22
Q

link analysis

A

The linkage among many objects of interest is discovered automatically, such as the link between Web pages and referential relationships among groups of academic publication authors

23
Q

Microsoft Enterprise Consortium

A

serves as the worldwide source for access to Microsoft’s SQL Server software suite for academic purposes—teaching and research

24
Q

Microsoft SQL Server

A

data and the models are stored in the same relational database environment

25
Q

nominal data

A

a value that can be reduced to a label. e.g. Marital Status (1: Single, 2: Married, 3: Divorced, etc)

26
Q

numeric data

A

A type of data that represent the numeric values of specific variables

27
Q

ordinal data

A

similar to nominal data except there is a natural hierarchy or order e.g. Credit Score (1: low, 2: medium, 3: high)

28
Q

prediction

A

The act of telling about the future

29
Q

RapidMiner

A

A popular, open-source, free-of-charge data mining software suite that employs a graphically enhanced user interface, a rather large number of algorithms, and a variety of data visualization features

30
Q

regression

A

A data mining method for real-world prediction problems where the predicted values (i.e., the output variable or dependent variable) are numeric (e.g., predicting the temperature for tomorrow as 68°F)

31
Q

SEMMA

A

a data mining process: sample, explore, modify, model, assess

32
Q

sensitivity analysis

A

A study of the effect of a change in one or more input variables on a proposed solution

33
Q

sequence mining

A

A pattern discovery method where relationships among the things are examined in terms of their order of occurrence to identify associations over time

34
Q

simple split

A

Data are partitioned into two mutually exclusive subsets called a training set and a test set (or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set

35
Q

support

A

The measure of how often products and/or services appear together in the same transaction; that is, the proportion of transactions in the data set that contain all of the products and/or services mentioned in a specific rule

36
Q

Weka

A

A popular, free-of-charge, open-source suite of machine-learning software written in Java, developed at the University
of Waikato

37
Q

data mining task / method taxonomy

A

[
Prediction [
Classification,
Regression,
Time-Series],
Association [
Market Basket,
Link Analysis,
Sequence Analysis],
Segmentation [
Clustering,
Outlier Analysis
]]