Chapter 5: Data Mining for Business Intelligence Flashcards
Apriori algorithm
The most frequently used algorithm to find association rules. This algorithm identifies subsets that are frequent to a minimum number of the itemsets. The frequent subsets are extended one item at a time. This means it will increase from one-item subsets to two-item subsets, then three-item subsets and so on until there are no more successful extensions found.
Area under the ROC curve
It is a graphical plot in which the true positive rate is plotted on the Y- axis and the false positive rate is plotted on the X-axis. The area under the ROC curve deter determines the accuracy measure of a classifier: value1 indicates a perfect classifier, a value of 0.5 indicates no better than a random chance.
Associations (Association rule learning in data mining)
It is a research technique that is used to identify relationships among variables in the database. In the retail industry, associations rule mining is called market basket analysis. Link analysis and sequence mining are derivatives of association rule mining.
Bootstrapping
is when a fixed number of instances from the original data is sampled (with replacement) for training and the rest of the data set is used for testing. This process can be repeated if need it.
Categorical data
are labels for classes that are used to divide a variable into specific groups. Categorical data is also call discrete data mining that is represents finite number of values with no continuum between them.
Classification
or supervised induction. It is a very common data mining task. This task analyzes the historical data and generates a model that can predict future behavior. This model consists of generalizations over the records of a training dataset, which help distinguish predefined classes. The expectation is that this model can be used to predict the classes of other unclassified records and even to predict actual future events.
Clustering
is the process of partitioning a collection of objects, events, etc. presented in a dataset, into natural groups (sub-classes) where the members share similar characteristics. Commonly used clustering techniques are k-means (in statistics) and self-organizing maps (in machine learning).
Confidence
is one of the metrics that association rules mining uses to answer the question: “Are all association rules interesting and useful?” Confidence measures how often consequent go together with antecedent.
CRISP-DM
(Cross Industry Standard Process for Data Mining) is a general process for doing data mining projects.
Data mining
in the discovering of patterns and significant knowledge in large quantity of data.
Decision tree
builds classification or regression models in the form of a tree structure. It breaks down data into smaller and smaller subsets. Input variables in a decision tree are called attributes. A tree has branches and nodes; a branch represents the outcome of a test to classify a pattern, and each leaf node holds a class label. The topmost node in the tree is the root node.
Discovering-driven data mining
is a technique used to find patterns, associations, and other relationships hidden within datasets. It usually discovers facts that organization had not previously known.
Distance measure
is used in cluster analysis methods to calculate the closeness between pairs of items. Well known distance measures are Euclidian distance (distance between two points that can be measured with a ruler) and the Manhattan distance (the rectilinear distance, or taxicab distance, between two points).
Entropy
measures the extent of uncertainty or randomness in a dataset.
Gini index
can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable.
Hypothesis-driven data mining
it is a technique that begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition.
Information gain
is the splitting mechanism used in ID3.The most well-known decision tree algorithm.