Data Mining Flashcards
What is data mining used for?
to make decisions, classifications, diagnoses, and recommendations that affect our lives
Is data meaningful?
no – all information is meaningless without someone to analyze, and make sense of it
What is data mining?
process of looking for patterns in large data sets
What are two data mining tasks?
- classification (decision tree method) – supervised technique
- clustering (k-means method) – unsupervised technique
What is classification?
using previously categorized data to determine how to categorize new data
putting things into groups that already exist
What is clustering?
partitioning a set of items into subgroups to ensure certain measures of quality (ie. ‘similar’ items are grouped together)
Compare classification vs. clustering.
- data
- output groups
- goal
data:
- classification: labelled
- clustering: unlabelled
output groups:
- classification: known
- clustering: unknown
goal:
- classification: used to predict future observations
- clustering: used to understand or describe observations
Why might clustering be used?
explore data for hidden patterns or correlations
- once you see something interesting, you can delve further by other means
- quickly see if there are any possible missed relationships
helps organize data
reduces number of data points (ie. can reduce a cluster to a representative data point)
results might be fed into other data mining techniques
What is clustering by numbers?
clustering points, typically in a high-dimensional space (2D, multidimensional)
What is the goal in clustering data?
points that are ‘near’ each other
What are the 3 possible criteria used to measure cluster quality?
- intra-class (intra-cluster) similarity
- inter-class (inter-cluster) dissimilarity
- size similarity
What is intra-class (intra-cluster) similarity?
- points within same cluster are close to each other (or at least to their closest neighbours)
- distances are minimized
- items in same cluster are similar
What is inter-class (inter-cluster) dissimilarity?
- points in two different clusters are far from each other (or at least to their closest neighbours in other clusters)
- distances are maximized
What is size similarity?
clusters have similar size
What are some limitations of data mining?
dirty data (obsolete, inaccurate, and missing info) – results are not meaningful
What do data scientists call cleaning up dirty data?
data wrangling
data munging
data janitor work
What are issues with dirty data?
- cleaning dirty data up is an increasingly important and overlooked job that can help prevent costly mistakes
- data scientists spend from 50-80% of their time doing labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets
What can the K-means clustering algorithm be used on?
any number of data points – works for 2D and multidimensional data
What is K in the K-means clustering algorithm?
number of clusters you want to end up with – you choose it
depending on data points, it is possible that there is no final answer so you should pick the max number of times you want to run it
What are some limitations of K-means clustering?
algorithm may give different cluster solutions depending on how initial centroids are chosen
- studies will do many runs with different interior centroids
not always clear how to choose K (number of clusters)
- if size of data set is small, different values of K can be chosen
- OR, large value of K can be chosen, then clusters can be merged to yield hierarchical cluster structure
Can the final clusters assigned depend on initial positions of centroids in k-means clustering?
yes