Data Mining Flashcards
What is data mining used for?
to make decisions, classifications, diagnoses, and recommendations that affect our lives
Is data meaningful?
no – all information is meaningless without someone to analyze, and make sense of it
What is data mining?
process of looking for patterns in large data sets
What are two data mining tasks?
- classification (decision tree method) – supervised technique
- clustering (k-means method) – unsupervised technique
What is classification?
using previously categorized data to determine how to categorize new data
putting things into groups that already exist
What is clustering?
partitioning a set of items into subgroups to ensure certain measures of quality (ie. ‘similar’ items are grouped together)
Compare classification vs. clustering.
- data
- output groups
- goal
data:
- classification: labelled
- clustering: unlabelled
output groups:
- classification: known
- clustering: unknown
goal:
- classification: used to predict future observations
- clustering: used to understand or describe observations
Why might clustering be used?
explore data for hidden patterns or correlations
- once you see something interesting, you can delve further by other means
- quickly see if there are any possible missed relationships
helps organize data
reduces number of data points (ie. can reduce a cluster to a representative data point)
results might be fed into other data mining techniques
What is clustering by numbers?
clustering points, typically in a high-dimensional space (2D, multidimensional)
What is the goal in clustering data?
points that are ‘near’ each other
What are the 3 possible criteria used to measure cluster quality?
- intra-class (intra-cluster) similarity
- inter-class (inter-cluster) dissimilarity
- size similarity
What is intra-class (intra-cluster) similarity?
- points within same cluster are close to each other (or at least to their closest neighbours)
- distances are minimized
- items in same cluster are similar
What is inter-class (inter-cluster) dissimilarity?
- points in two different clusters are far from each other (or at least to their closest neighbours in other clusters)
- distances are maximized
What is size similarity?
clusters have similar size
What are some limitations of data mining?
dirty data (obsolete, inaccurate, and missing info) – results are not meaningful