Data Mining Flashcards

Question 1

Q

What is data mining?

Answer

A

The analysis of data to discover structure, patterns and relationships

Question 2

Q

What is the formula for covariance?

Answer

A

σ^(m,n)=1/(T-1) ∑(x_t^m - μ_t^m)(x_t^n - μ_t^n)

Question 3

Q

What is the formula for the covariance matrix?

Answer

A

σ = UDU^T
D is a diagonal matrix
U corresponds to a rotation

Question 4

Q

How are the results of principle component analysis interpreted?

Answer

A

The larger the values in D the larger the variance in that dimension

Question 5

Q

What is agglomerative clustering?

Answer

A

Start by assuming each data point belongs to its own unique cluster
Clusters are combined until the required number of clusters is reached

Question 6

Q

What is divisive clustering?

Answer

A

Start by assuming there’s one centroid

At each step the centroids are replaced by two centroids

Question 7

Q

What does K represent in K-means clustering?

Answer

A

The number of required centroids

Question 8

Q

What are the steps to K-means clustering?

Answer

A

initialise K clusters
assign each point to the closest centroid
calculate the mean value for the group of points allocated to each centroid
Set the new centroid to these means
Repeat

Question 9

Q

What is the formula for a 1-dimensional Gaussian PDF?

Answer

A

p(x)=1/√2πσ exp(-(x-μ)^2/2σ)

Question 10

Q

What is the formula for an N dimensional Gaussian PDF?

Answer

A

p(x)=1/√((2π)^N |Σ|) exp(-1/2 (x-m)^T Σ^(-1) (x-m))

Question 11

Q

What is a Gaussian mixture model?

Answer

A

A weighted average of several component Gaussian PDFs

Question 12

Q

What is the E-M algorithm used for?

Answer

A

Estimating the values of μ and σ that best fit the given data

Question 13

Q

What are the steps of the E-M algorithm?

Answer

A

Choose a number of components, M, and initial parameters
For each sample and each component calculate P(m|y)
Define new values of μ and σ

Question 14

Q

What is dynamic programming used for?

Answer

A

Finding the optimal path between two sequences

Question 15

Q

What is the formula for ad(i, j)?

Answer

A

ad(i, j) is the sum of distances along the best path from (1, 1) to (i, j)
ad(i,j)=min{ad(i-1,j) + K_del + d(i,j), ad(i-1,j-1) + d(i,j), ad(i,j-1) + K_ins + d(i,j)}

Question 16

Q

What are the three distances used in dynamic programming?

Answer

Study These Flashcards

A

d_1 = city block
d_2 = Euclidean
d_∞

Question 17

Q

What is the formula for city block distance?

Answer

Study These Flashcards

A

∑|x_n - y_n |

Question 18

Q

What is the formula for Euclidean distance?

Answer

Study These Flashcards

A

√(∑(x_n - y_n )^2 )

Question 19

Q

What is the formula for d_∞?

Answer

Study These Flashcards

A

max |x_n - y_n|

Question 20

Q

What is distortion?

Answer

Study These Flashcards

A

How well a set of centroids models a set of data

Question 21

Q

How is distortion calculated?

Answer

Study These Flashcards

A

The sum of distances between each data point and its nearest centroid

Data Mining Flashcards

(21 cards)