Data Mining Flashcards

Question 1

Q

What is data mining used for?

Answer

A

to make decisions, classifications, diagnoses, and recommendations that affect our lives

Question 2

Q

Is data meaningful?

Answer

A

no – all information is meaningless without someone to analyze, and make sense of it

Question 3

Q

What is data mining?

Answer

A

process of looking for patterns in large data sets

Question 4

Q

What are two data mining tasks?

Answer

A

classification (decision tree method) – supervised technique
clustering (k-means method) – unsupervised technique

Question 5

Q

What is classification?

Answer

A

using previously categorized data to determine how to categorize new data

putting things into groups that already exist

Question 6

Q

What is clustering?

Answer

A

partitioning a set of items into subgroups to ensure certain measures of quality (ie. ‘similar’ items are grouped together)

Question 7

Q

Compare classification vs. clustering.

data
output groups
goal

Answer

A

data:

classification: labelled
clustering: unlabelled

output groups:

classification: known
clustering: unknown

goal:

classification: used to predict future observations
clustering: used to understand or describe observations

Question 8

Q

Why might clustering be used?

Answer

A

explore data for hidden patterns or correlations

once you see something interesting, you can delve further by other means
quickly see if there are any possible missed relationships

helps organize data

reduces number of data points (ie. can reduce a cluster to a representative data point)

results might be fed into other data mining techniques

Question 9

Q

What is clustering by numbers?

Answer

A

clustering points, typically in a high-dimensional space (2D, multidimensional)

Question 10

Q

What is the goal in clustering data?

Answer

A

points that are ‘near’ each other

Question 11

Q

What are the 3 possible criteria used to measure cluster quality?

Answer

A

intra-class (intra-cluster) similarity
inter-class (inter-cluster) dissimilarity
size similarity

Question 12

Q

What is intra-class (intra-cluster) similarity?

Answer

A

points within same cluster are close to each other (or at least to their closest neighbours)
distances are minimized
items in same cluster are similar

Question 13

Q

What is inter-class (inter-cluster) dissimilarity?

Answer

A

points in two different clusters are far from each other (or at least to their closest neighbours in other clusters)
distances are maximized

Question 14

Q

What is size similarity?

Answer

A

clusters have similar size

Question 15

Q

What are some limitations of data mining?

Answer

A

dirty data (obsolete, inaccurate, and missing info) – results are not meaningful

Question 16

Q

What do data scientists call cleaning up dirty data?

Answer

A

data wrangling
data munging
data janitor work

Question 17

Q

What are issues with dirty data?

Answer

A

cleaning dirty data up is an increasingly important and overlooked job that can help prevent costly mistakes
data scientists spend from 50-80% of their time doing labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets

Question 18

Q

What can the K-means clustering algorithm be used on?

Answer

A

any number of data points – works for 2D and multidimensional data

Question 19

Q

What is K in the K-means clustering algorithm?

Answer

A

number of clusters you want to end up with – you choose it

depending on data points, it is possible that there is no final answer so you should pick the max number of times you want to run it

Question 20

Q

What are some limitations of K-means clustering?

Answer

A

algorithm may give different cluster solutions depending on how initial centroids are chosen
- studies will do many runs with different interior centroids

not always clear how to choose K (number of clusters)

if size of data set is small, different values of K can be chosen
OR, large value of K can be chosen, then clusters can be merged to yield hierarchical cluster structure

Question 21

Q

Can the final clusters assigned depend on initial positions of centroids in k-means clustering?