Data Mining Flashcards
What is data mining?
The analysis of data to discover structure, patterns and relationships
What is the formula for covariance?
σ^(m,n)=1/(T-1) ∑(x_t^m - μ_t^m)(x_t^n - μ_t^n)
What is the formula for the covariance matrix?
σ = UDU^T
D is a diagonal matrix
U corresponds to a rotation
How are the results of principle component analysis interpreted?
The larger the values in D the larger the variance in that dimension
What is agglomerative clustering?
Start by assuming each data point belongs to its own unique cluster
Clusters are combined until the required number of clusters is reached
What is divisive clustering?
Start by assuming there’s one centroid
At each step the centroids are replaced by two centroids
What does K represent in K-means clustering?
The number of required centroids
What are the steps to K-means clustering?
- initialise K clusters
- assign each point to the closest centroid
- calculate the mean value for the group of points allocated to each centroid
- Set the new centroid to these means
Repeat
What is the formula for a 1-dimensional Gaussian PDF?
p(x)=1/√2πσ exp(-(x-μ)^2/2σ)
What is the formula for an N dimensional Gaussian PDF?
p(x)=1/√((2π)^N |Σ|) exp(-1/2 (x-m)^T Σ^(-1) (x-m))
What is a Gaussian mixture model?
A weighted average of several component Gaussian PDFs
What is the E-M algorithm used for?
Estimating the values of μ and σ that best fit the given data
What are the steps of the E-M algorithm?
- Choose a number of components, M, and initial parameters
- For each sample and each component calculate P(m|y)
- Define new values of μ and σ
What is dynamic programming used for?
Finding the optimal path between two sequences
What is the formula for ad(i, j)?
ad(i, j) is the sum of distances along the best path from (1, 1) to (i, j)
ad(i,j)=min{ad(i-1,j) + K_del + d(i,j), ad(i-1,j-1) + d(i,j), ad(i,j-1) + K_ins + d(i,j)}