Oleg Flashcards
Principal component analysis (PCA). Definition and main goals and steps.
- Dimensional reduction
- Data visualization
- Future extraction
- Standardize the range of continuous initial variables.
- Compute the covariance matrix to identify correlations.
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components.
- Create a feature vector to decide which principal components to keep.
- Recast the data along the principal components axes.
Optimization of PCA dimension by cumulative proportion of explained variance (elbow rule)
- A scree plot display how musch variation each principal component captures from the data.
- choose the amount where the curve flattens out “the elbow”
- Proportion of variance plot: the selected PCs should be able to describe at least 80% of the variance.
Multidimensional scaling (MDS). Definition and main goals and steps.
In general, the metric MDS calculates distances between each pair of points in the original high-dimensional space and then maps it to lower-dimensional space
- Assign a number of points to coordinates in n-dimensional space.
- Calculate Euclidean distances for all pairs of points.
- Compare the similarity matrix with the original input matrix.
- Adjust coordinates, if necessary, to minimize stress.
Classification. Main goals and steps (discrimination, validation, testing).
{discrimination}
* identify relevant features for the classification problem and propose models and methods that allow to develop reasonable classification rules.
{validation}
* (Validate) how these methods perform on actual data sets and decide for the optimal method.
{test}
* test how the optimal method performs on a data set that was not used for the discrimination and method selection stages.
*
Linear discriminant analysis (LDA)
- Dimension reduction technique
- Data must be normal or gaussian distributed
+ only few parameters to estimate, accurate estimates
- less flexible (linear decision boundary)
Quadratic discriminant analysis (QDA)
Quadratic discriminant analysis is quite similar to Linear discriminant analysis except we relaxed the assumption that the mean and covariance of all the classes were equal. Therefore, we required to calculate it separately.
+ many parameters to estimate, less accurate estimates
- more flexible (quadratic decision boundary)
Unsupervised learning (PCA and cluster analysis). Clustering methods. Main goals and steps.
- PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance.
- Clustering looks for homogeneous subgroups among the observations.
- (Agglomerative) hierarchical clustering.
- (Divisive) In K-means clustering (incl. K-medoids(PAM)), we seek to partition the observations into a pre-specified number of clusters.
Hierarchical (Agglomerative) clustering.
- (Agglomerative) hierarchical clustering, when we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
- Observations that are grouped together at some point cannot be separated anymore later.
- By cutting the tree at a certain height, one obtains a number of clusters.
- Results depend on how we measure distances between observations and between clusters.
+ Obtain solution for all possible numbers of
clusters at once.
- slow
Dissimilarities between groups of data, linkages (single, complete, average, centroid)
How do we measure the distance between two clusters A and B?
Suitable for finding stretched-out cluster
* Single linkage: dAB = mini∈A,j∈B dij;
(minimal distance of all element pairs of both clusters)
Suitable for finding compact but not well separated cluster.
* Complete linkage: dAB = maxi∈A,j∈B dij;
(maximal distance of all element pairs of both clusters)
Suitable for finding well separated, potato-shaped cluster.
*Average linkage: dAB = 1 / nA nB {Sum} i∈A,j ∈B dij;
(average distance of all element pairs of both cluster)¨
- centroid linkage:
(distance between average point-center to another average point-center)
Partitioning (Divisive) clustering (K -means, K -medoids (PAM))
- (Divisive) In K-means clustering (incl. K-medoids(PAM)), we seek to partition the observations into a pre-specified number of clusters.
+ Fast and scales well to large data.
- No underlying model
Within cluster variation.
Goal: minimize group within-cluster variation
WCW(Ck) = 1 / || Ck || {SUM}i,l E Ck ||xi-xl ||^2
Interpretation of clustering results and model checking. Selection the number of clusters.
Option 1:
* Look at position of cluster centers or cluster representatives (especially easy in PAM).
Option 2:
- Apply a dimension reduction technique (such as PCA).
- Plot the reduced dimensional data (e.g., PC scores).
- Label/color the points according to the cluster they belong.
Quality of clustering: Silhouette plot
- S(i) large: well clustered.
- S(i) small: badly clustered.
- S(i) negative: assigned to wrong cluster.
- Clusters average S over 0.5 is acceptable.
Supervised learning
- Supervision: The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to.
- New data is classified based on the models built from the training set.
Unsupervised learning (clustering)
- The class labels of training data are unknown.
* Given a set of observations or measurements, establish the possible existence of classes or clusters in the data.
Classification or numeric prediction
{Classification} * Predict categorical class labels (discrete or nominal).
- Construct a model based on the training set and the class labels(the values in a classifying attribute) and use it in classifying new data.
{Numeric prediction}
*Model continuous-valued functions (i.e., predict unknown or missing values).
*Given a set of observations or measurements, establish the possible existence of classes or clusters in the data.