Oleg Flashcards

1
Q

Principal component analysis (PCA). Definition and main goals and steps.

A
  • Dimensional reduction
  • Data visualization
  • Future extraction
  1. Standardize the range of continuous initial variables.
  2. Compute the covariance matrix to identify correlations.
  3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components.
  4. Create a feature vector to decide which principal components to keep.
  5. Recast the data along the principal components axes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Optimization of PCA dimension by cumulative proportion of explained variance (elbow rule)

A
  • A scree plot display how musch variation each principal component captures from the data.
  • choose the amount where the curve flattens out “the elbow”
  • Proportion of variance plot: the selected PCs should be able to describe at least 80% of the variance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Multidimensional scaling (MDS). Definition and main goals and steps.

A

In general, the metric MDS calculates distances between each pair of points in the original high-dimensional space and then maps it to lower-dimensional space

  • Assign a number of points to coordinates in n-dimensional space.
  • Calculate Euclidean distances for all pairs of points.
  • Compare the similarity matrix with the original input matrix.
  • Adjust coordinates, if necessary, to minimize stress.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification. Main goals and steps (discrimination, validation, testing).

A

{discrimination}
* identify relevant features for the classification problem and propose models and methods that allow to develop reasonable classification rules.

{validation}
* (Validate) how these methods perform on actual data sets and decide for the optimal method.

{test}
* test how the optimal method performs on a data set that was not used for the discrimination and method selection stages.
*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear discriminant analysis (LDA)

A
  • Dimension reduction technique
  • Data must be normal or gaussian distributed

+ only few parameters to estimate, accurate estimates

  • less flexible (linear decision boundary)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quadratic discriminant analysis (QDA)

A

Quadratic discriminant analysis is quite similar to Linear discriminant analysis except we relaxed the assumption that the mean and covariance of all the classes were equal. Therefore, we required to calculate it separately.

+ many parameters to estimate, less accurate estimates

  • more flexible (quadratic decision boundary)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
Unsupervised learning (PCA and cluster analysis). Clustering methods. Main goals and
steps.
A
  • PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance.
  • Clustering looks for homogeneous subgroups among the observations.
  • (Agglomerative) hierarchical clustering.
  • (Divisive) In K-means clustering (incl. K-medoids(PAM)), we seek to partition the observations into a pre-specified number of clusters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hierarchical (Agglomerative) clustering.

A
  • (Agglomerative) hierarchical clustering, when we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
  • Observations that are grouped together at some point cannot be separated anymore later.
  • By cutting the tree at a certain height, one obtains a number of clusters.
  • Results depend on how we measure distances between observations and between clusters.

+ Obtain solution for all possible numbers of
clusters at once.

  • slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dissimilarities between groups of data, linkages (single, complete, average, centroid)

A

How do we measure the distance between two clusters A and B?

Suitable for finding stretched-out cluster
* Single linkage: dAB = mini∈A,j∈B dij;
(minimal distance of all element pairs of both clusters)

Suitable for finding compact but not well separated cluster.
* Complete linkage: dAB = maxi∈A,j∈B dij;
(maximal distance of all element pairs of both clusters)

Suitable for finding well separated, potato-shaped cluster.
*Average linkage: dAB = 1 / nA nB {Sum} i∈A,j ∈B dij;
(average distance of all element pairs of both cluster)¨

  • centroid linkage:
    (distance between average point-center to another average point-center)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Partitioning (Divisive) clustering (K -means, K -medoids (PAM))

A
  • (Divisive) In K-means clustering (incl. K-medoids(PAM)), we seek to partition the observations into a pre-specified number of clusters.

+ Fast and scales well to large data.

  • No underlying model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Within cluster variation.

A

Goal: minimize group within-cluster variation

WCW(Ck) = 1 / || Ck || {SUM}i,l E Ck ||xi-xl ||^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Interpretation of clustering results and model checking. Selection the number of clusters.

A

Option 1:
* Look at position of cluster centers or cluster representatives (especially easy in PAM).

Option 2:

  • Apply a dimension reduction technique (such as PCA).
  • Plot the reduced dimensional data (e.g., PC scores).
  • Label/color the points according to the cluster they belong.

Quality of clustering: Silhouette plot

  • S(i) large: well clustered.
  • S(i) small: badly clustered.
  • S(i) negative: assigned to wrong cluster.
  • Clusters average S over 0.5 is acceptable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Supervised learning

A
  • Supervision: The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to.
  • New data is classified based on the models built from the training set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unsupervised learning (clustering)

A
  • The class labels of training data are unknown.

* Given a set of observations or measurements, establish the possible existence of classes or clusters in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Classification or numeric prediction

A
{Classification}
* Predict categorical class labels (discrete or nominal).
  • Construct a model based on the training set and the class labels(the values in a classifying attribute) and use it in classifying new data.

{Numeric prediction}
*Model continuous-valued functions (i.e., predict unknown or missing values).

*Given a set of observations or measurements, establish the possible existence of classes or clusters in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Optimal classification (Bayes rule)

A

Then the optimal classification rule that minimizes the expected misclassification risk E(C) is given by

R1: f1(x)/f0(x) >= c, R0: f1(x)/f0(x) < c

If for given prior distribution (c = 1), the Bayes rule is the same as maximum likelihood