Chapter 4 Flashcards
Apriori algorithm
The most commonly used algorithm to discover association rules by recursively identifying frequent itemsets
area under the ROC curve
A graphical assessment technique for binary classification models where the true positive rate is plotted on the Y-axis and the false positive rate is plotted on the X-axis
association
A category of data mining algorithm that establishes relationships about items that occur together in a given record.
bootstrapping
A sampling technique where a fixed number of instances from the original data is sampled (with replacement) for training and the rest of the data set is used for testing
categorical data
Data that represent the labels of multiple classes used to divide a variable into specific groups
classification
Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior.
clustering
Partitioning a database into segments in which the members of a segment share similar qualities, unsupervised, used to find natural groupings
confidence
In association rules, the conditional probability of finding the RHS of the rule present in a list of transactions where the LHS of the rule already exists
CRISP-DM
Cross-Industry Standard Process for Data Mining
1) business understanding
2) data understanding
3) data preprocessing
4) model building
5) test and evaluate
6) deploy
data mining
A process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases
decision tree
A graphical presentation of a sequence of interrelated decisions to be made under assumed risk
distance measure
A method used to calculate the closeness between pairs of items in most cluster analysis methods (Euclidean, Manhattan)
ensemble
These are combinations of the outcomes produced by two or more analytics models into a compound output.
entropy
A metric that measures the extent of uncertainty or randomness in a data set
Gini index
A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable
information gain
The splitting mechanism used in ID3 (a popular decision-tree algorithm)
interval data
Variables that can be measured on interval scales
k-fold cross validation
A popular accuracy assessment technique for prediction models where the complete data set is randomly split into k mutually exclusive subsets of approximately equal size. The classification model is trained and tested k times. Each time it is trained on all but one fold and then tested on the remaining single fold. The cross-validation estimate of the overall accuracy of a model is calculated by simply averaging the k individual accuracy measures
KNIME
An open-source, free-of-charge, platform-agnostic analytics software tool
knowledge discovery in databases (KDD)
A machine-learning process that performs rule induction or a related procedure to establish knowledge from large databases
lift
a tool used to answer “Are all association rules interesting and useful?”
link analysis
The linkage among many objects of interest is discovered automatically, such as the link between Web pages and referential relationships among groups of academic publication authors
Microsoft Enterprise Consortium
serves as the worldwide source for access to Microsoft’s SQL Server software suite for academic purposes—teaching and research
Microsoft SQL Server
data and the models are stored in the same relational database environment