Descriptive Data Mining Flashcards
Antecedent
The item set corresponding to the if portion of an if—then association rule.
Association rule
An if—then statement describing the relationship between item sets.
Centroid linkage
Uses the averaging concept of cluster centroids to define between-cluster similarity.
Complete linkage
Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters.
Confidence
The conditional probability that the consequent of an association rule occurs given the antecedent occurs.
Consequent
The item set corresponding to the then portion of an if—then association rule.
Dendrogram
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering.
Dimension reduction
Process of reducing the number of variables to consider in a data-mining approach.
Euclidean distance
Geometric measure of dissimilarity between observations based on the Pythagorean theorem.
Group average linkage
Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters.
Hierarchical clustering
Process of agglomerating observations into a series of nested groups based on a measure of similarity.
Jaccard’s coefficient
Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries.
k-means clustering
Process of organizing observations into one of k groups based on a measure of similarity.
Lift ratio
The ratio of the confidence of an association rule to the benchmark confidence.
market basket analysis
Analysis of items frequently co-occuring in transactions (such as purchases).
matching coefficient
Measure of similarity between observations based on the number of matching values of categorical variables.
McQuitty’s method
Measure that computes the dissimilarity between a cluster AB (formed by merging clusters A and B) and a cluster C by averaging the distance between A and C and the distance between B and C.
Median linkage
Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters.
Missing at random
The case when data for a variable is missing due to a relationship a relationship between other variables.
Missing completely at random
The case when data for a variable is missing purely due to random chance.
Missing not at random
The case when data for a variable is missing due to its unrecorded value.
Observation
A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.
Single linkage
Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters.
Support count
The number of times that a collection of items occurs together in a transaction data set.
Unsupervised learning
Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process.
Ward’s method
procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation