Data Mining Flashcards

1
Q

What is data mining?

A

The analysis of data to discover structure, patterns and relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula for covariance?

A

σ^(m,n)=1/(T-1) ∑(x_t^m - μ_t^m)(x_t^n - μ_t^n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the formula for the covariance matrix?

A

σ = UDU^T
D is a diagonal matrix
U corresponds to a rotation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are the results of principle component analysis interpreted?

A

The larger the values in D the larger the variance in that dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is agglomerative clustering?

A

Start by assuming each data point belongs to its own unique cluster
Clusters are combined until the required number of clusters is reached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is divisive clustering?

A

Start by assuming there’s one centroid

At each step the centroids are replaced by two centroids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does K represent in K-means clustering?

A

The number of required centroids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps to K-means clustering?

A
  1. initialise K clusters
  2. assign each point to the closest centroid
  3. calculate the mean value for the group of points allocated to each centroid
  4. Set the new centroid to these means
    Repeat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the formula for a 1-dimensional Gaussian PDF?

A

p(x)=1/√2πσ exp(-(x-μ)^2/2σ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for an N dimensional Gaussian PDF?

A

p(x)=1/√((2π)^N |Σ|) exp(-1/2 (x-m)^T Σ^(-1) (x-m))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Gaussian mixture model?

A

A weighted average of several component Gaussian PDFs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the E-M algorithm used for?

A

Estimating the values of μ and σ that best fit the given data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the steps of the E-M algorithm?

A
  1. Choose a number of components, M, and initial parameters
  2. For each sample and each component calculate P(m|y)
  3. Define new values of μ and σ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is dynamic programming used for?

A

Finding the optimal path between two sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the formula for ad(i, j)?

A

ad(i, j) is the sum of distances along the best path from (1, 1) to (i, j)
ad(i,j)=min{ad(i-1,j) + K_del + d(i,j), ad(i-1,j-1) + d(i,j), ad(i,j-1) + K_ins + d(i,j)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the three distances used in dynamic programming?

A

d_1 = city block
d_2 = Euclidean
d_∞

17
Q

What is the formula for city block distance?

A

∑|x_n - y_n |

18
Q

What is the formula for Euclidean distance?

A

√(∑(x_n - y_n )^2 )

19
Q

What is the formula for d_∞?

A

max |x_n - y_n|

20
Q

What is distortion?

A

How well a set of centroids models a set of data

21
Q

How is distortion calculated?

A

The sum of distances between each data point and its nearest centroid