Data Mining Flashcards

1
Q

What is data mining used for?

A

to make decisions, classifications, diagnoses, and recommendations that affect our lives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is data meaningful?

A

no – all information is meaningless without someone to analyze, and make sense of it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data mining?

A

process of looking for patterns in large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are two data mining tasks?

A
  • classification (decision tree method) – supervised technique
  • clustering (k-means method) – unsupervised technique
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is classification?

A

using previously categorized data to determine how to categorize new data

putting things into groups that already exist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is clustering?

A

partitioning a set of items into subgroups to ensure certain measures of quality (ie. ‘similar’ items are grouped together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Compare classification vs. clustering.

  • data
  • output groups
  • goal
A

data:

  • classification: labelled
  • clustering: unlabelled

output groups:

  • classification: known
  • clustering: unknown

goal:

  • classification: used to predict future observations
  • clustering: used to understand or describe observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why might clustering be used?

A

explore data for hidden patterns or correlations

  • once you see something interesting, you can delve further by other means
  • quickly see if there are any possible missed relationships

helps organize data

reduces number of data points (ie. can reduce a cluster to a representative data point)

results might be fed into other data mining techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is clustering by numbers?

A

clustering points, typically in a high-dimensional space (2D, multidimensional)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the goal in clustering data?

A

points that are ‘near’ each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 possible criteria used to measure cluster quality?

A
  • intra-class (intra-cluster) similarity
  • inter-class (inter-cluster) dissimilarity
  • size similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is intra-class (intra-cluster) similarity?

A
  • points within same cluster are close to each other (or at least to their closest neighbours)
  • distances are minimized
  • items in same cluster are similar
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is inter-class (inter-cluster) dissimilarity?

A
  • points in two different clusters are far from each other (or at least to their closest neighbours in other clusters)
  • distances are maximized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is size similarity?

A

clusters have similar size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some limitations of data mining?

A

dirty data (obsolete, inaccurate, and missing info) – results are not meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do data scientists call cleaning up dirty data?

A

data wrangling
data munging
data janitor work

17
Q

What are issues with dirty data?

A
  • cleaning dirty data up is an increasingly important and overlooked job that can help prevent costly mistakes
  • data scientists spend from 50-80% of their time doing labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets
18
Q

What can the K-means clustering algorithm be used on?

A

any number of data points – works for 2D and multidimensional data

19
Q

What is K in the K-means clustering algorithm?

A

number of clusters you want to end up with – you choose it

depending on data points, it is possible that there is no final answer so you should pick the max number of times you want to run it

20
Q

What are some limitations of K-means clustering?

A

algorithm may give different cluster solutions depending on how initial centroids are chosen
- studies will do many runs with different interior centroids

not always clear how to choose K (number of clusters)

  • if size of data set is small, different values of K can be chosen
  • OR, large value of K can be chosen, then clusters can be merged to yield hierarchical cluster structure
21
Q

Can the final clusters assigned depend on initial positions of centroids in k-means clustering?

A

yes