Data Transformations and Unsupervised Learning Techniques (10-20%) Flashcards by Q Volo

Define a variable.

A variable is a measurement that is recorded and makes up part of the original dataset before any transformation takes place (where we do not consider data cleaning to be a transformation). This is a machine leaning point of view where variables are more closely associated with the raw data.
- Represents the predictors in the model

How well did you know this?

Not at all

Perfectly

Define a feature.

Derived from the original variables or final inputs into the model (provides an alternative view of the info contained in the data)

How well did you know this?

Not at all

Perfectly

What’s the distinction between a variable and a final input?

The measurements that make up the raw data.

Note: a raw variable (assuming it has undergone data cleaning) can itself be considered a feature and used in the final model with no transformation

How well did you know this?

Not at all

Perfectly

Unstructured and structured data features?

Unstructured data - taking the raw text variable and then generating features that are derived from the text variable, providing an alternative, easier to model view of the data

Structured data - features are more easily understood in the context of text, image, and audio data. It still applied to traditional data. Each item is useful info that could be predictive and have a more direct relationship with the target variable of interest, but it isn’t immediately represented by the raw data

How well did you know this?

Not at all

Perfectly

Limitations of modeling algorithms?

The curse of dimensionality (leading to statistical insignificance)
The need to represent the signal in a meaningful and interpretable way (or alternatively, capturing the signal in an efficient way)
The need to capture complex signals in the data accuracy
Computational feasibility when the number of variables/features get large

How well did you know this?

Not at all

Perfectly

Need for feature generation and feature selection?

Feature generation and selection aim to strike a balance between complexity and interpretability. Instead of letting the model do all the work and become overly complex, we try to transform the data by extracting the important features, allowing us to build a much simpler model.

How well did you know this?

Not at all

Perfectly

Define principle of parsimony.

When you have 2 comparatively effective models, the simpler one is better (feature selection principle) ie. the model with the smaller number of variables.
- also called “Occam’s razor”

When a transformation is applied to the original data, we can see if from a different viewpoint (ie. the feature space), which may provide a viewpoint from which a simpler model can achieve the same or even more predictive power than a complex model built in the original input space (especially true for classification models)

How well did you know this?

Not at all

Perfectly

Define feature generation.

The process of deriving new features based on underlying variables or features that serve as the final input into a model.
- A straightforward approach is to apply a transformation to a raw variable or other feature. ie. age from birthdate or calculating the change in a variable over time through stock price movement
- A more complex approach is creating multiple features from a single variable or feature. ie. binarization
- Combine multiple variables into a single feature using a transformation. ie. combining sex and smoking status into a single feature indicating if someone is a female smoker, male smoker, and so on (more examples in notes)

How well did you know this?

Not at all

Perfectly

Define binarization.

Where a categorical variable is turned into several binary variables. Binarization creates a single variable indicating whether the observation has a specific value for a variable or not, which not only allows the model to focus on that particular value, but also allows no useful values to be filtered out as part of the feature selection process.

How well did you know this?

Not at all

Perfectly

Types of data transformations?

Exponential, perform log transformation to make the effect look more linear. ie. simpler

How well did you know this?

Not at all

Perfectly

Explain how to address non-linear relationships and skewed distributions.

Identifying relationships between variables can be difficult when there are skewed distributions.
To address skewed distributions, apply log transformation to skewed data. After transformations, the points are more spread out so that patterns, if any, can be identified.

Caution: when modeling with transformed variables, it is important to remember to transform the resulting predictions back to untransformed numbers.

How well did you know this?

Not at all

Perfectly

Explain the steps for cleaning data and then selecting the features for use in future modeling.

Obvious adjustments and checks - ie. check the min and max values for ech variable to ensure they are somewhat realistic, at this point we are checking for errors
Check for outliers - note: variables that use numbers to represent factor levels cannot have outliers
Make appropriate transformations - normally categorical variables are not transformed other than to conduct a binarization if needed
Create appropriate new features - PCA and clustering can be used to create new features
Final datasets
- might consider making scatter plots or calculating corrections
- may decide a few of the variables have no predictive power and could be eliminated, therefore automated methods for removing variables (lasso) or retain all existing variables for future use

How well did you know this?

Not at all

Perfectly

Types of unsupervised learning?

Principal component analysis (PCA)
Cluster analysis

How well did you know this?

Not at all

Perfectly

Define Principal Component Analysis (PCA).

Objective is to summarize high-dimensional data into fewer variables such that we retain as much info as possible. The kind of info that PCA attempts to preserve is the spread of the data. That is, lots of data gathered close together in a domain tells us less info about that domain than data that are spread broadly across the domain.
A technique that finds directions of maximum variance in the data that are mutually orthogonal (perpendicular).
PCA attempts to make composite variables (principal component, “PC”) that are more directly reflect the underlying pattern

How well did you know this?

Not at all

Perfectly

Define Principal Components.

Goal of principal components is to create a variable with the largest possible variance.
Advantage of PCs is that most of the variance is now concentrated in the first one.
Each variable should be centered and scaled. ie. the mean for each variable has been subtracted. Therefore, the average is zero (due to centering)
It is not necessary to standardized each variable, but it is depending on whether you use a correlation or covariance matrix
A PC turns each record into a single number by taking a linear combo of the variable values.
The coefficients are called loadings or weights

How well did you know this?

Not at all

Perfectly

What is the purpose of loadings for PCs?

Study These Flashcards

The loadings are determined so that the variance is maximized subject to the constraint that the sum of the squares of the loadings is equal to one. Without this constraint, the variance could simply be increased by increasing the loadings.
- The latest loadings must be orthogonal to the previous loadings
- Note: if the first component has over 99% of the variance, this means that the variables are highly correlated, and there is only one dimension here
- Can be visualized through a biplot

General properties of PCA.

Study These Flashcards

The maximum number of PCs is the minimum of (number of variables, number of data points -1)
The more variance of data explained by a principal component (PC), the higher that PC is ranked:
- More variance/most information -> first PC
- Second-most variance -> second PC; it’s mathematically guaranteed not to overlap with the first PC

When to use PCA?

Study These Flashcards

PCA can be considered as:
1. Feature transformation - the new principal components are linear combo’s of the original variables
2. Feature extraction - you produce new variable, that is, the principal components
3. Feature selection - hopefully, in most cases, fewer variables are needed to capture most info

Summary of PCA.

Study These Flashcards

PCA can only be applied on numeric data. Categorical variables have to be converted beforehand.
Data have to be centered, but not necessarily standardized, depending on whether you use a correlation or covariance matrix.
PCA is a systematic way to transform variables into components.
PCs are a linear combination of original variables.
The maximum number of PCs is the minimum of the number of variables and data points.
The more variance of data explained by a PC, the higher that PC is ranked:
Most variance/most information -> first PC
Second-most variance -> second PC; it’s mathematically guaranteed to be independent of the first PC (to not overlap).
PCs can be used as new variables in the data.
PCs are great at finding latent variables. (latent is representing what the measurable variables are, not measuring them)

What are the K-Means Clustering algorithm steps?

Study These Flashcards

Pick a random starting position
Assign each observation to the closest cluster center
Calculate the new centers of each cluster formed in step 2
If the difference in the objective functions between the old and new centers is below some threshold, stop. Otherwise, return to step 2.

This algorithm guarantees a reduction in the objective function at each iteration (i.e., it is guaranteed to converge to a locally optimal solution), although it doesn’t guarantee that we will find the best (global) solution each time. That is why it is often important to try different starting points.

Describe the curse of Dimensionality for K-Means clustering.

Study These Flashcards

Visualization of the results of clustering become problematic
As the number of dimensions increase, the data points, on average are the same distance away from each other. So, if you choose a single data point, the closest data point and the furthest data point to your chosen point will be the same distance away! Because clustering is dependent on the notion that some points are closer than others, in high-dimensional spaces, clustering techniques become almost meaningless

Unfortunately there is not much we can do about this problem other than be aware of it and try to reduce the dimensionality as much as we can (ie. by using PCA) before using a clustering technique.

How to choose the number of clusters, k, in k-means clustering method.

Study These Flashcards

k is assumed to be chosen before the algorithm is ran. We usually don’t know what the value of k should be, it isn’t obvious just by looking at the data how many clusters there should be but using the Elbow Methods is useful.

Define the Elbow Method for Selecting k for the k-means clustering method.

Study These Flashcards

The elbow method is based on the idea that each cluster models the data by replacing each data point with the center of the cluster it is assigned to. New data points are assigned to the cluster that is closest. In this way, clustering acts in a similar way to a predictive model—we are predicting which cluster the data point belongs to, although the key difference is that there isn’t a right answer. We know from previous study of linear regression models that better models explain more of the variance in the data. In the clustering case, the variance of the model (i.e., the centers) would be similar to that of the data points. We are essentially calculating the F-statistic (between sum of squares / total sum of squares). Thus, a good model would explain a lot of the variance in the data. The elbow method looks at the percentage of variance explained (the between-group variance divided by the total variance statistic) as we add each new cluster. As soon as the amount of additional variance being explained by a new cluster drops off (increases less significantly), we say that we have reached an appropriate number of clusters to represent the data.

What is the purpose of clustering application?

Study These Flashcards

Variable reduction or generation.

Define hierarchical clustering.

Building a hierarchy of clusters without specifying the number of clusters in advance. The cluster hierarchy naturally forms a tree, and a dissimilarity/distance measure can be used to plot the distance between nodes at each level to form a dendrogram. Dendrograms provide a convenient way to view the hierarchy. Like all unsupervised learning techniques, hierarchical clustering is used to explore the data to look for meaningful groupings without considering a specific label.

Ways to approach hierarchical clustering.

There are 2 ways to approach the hierarchical clustering: 1. Agglomerative - clustering starts by considering each observation as its own cluster, then gradually grouping them with nearby clusters at each stage until you only have one cluster left (ie. bottom-up) 2. Divisive - clustering starts by considering all the observations as a single cluster and then progressively splitting into subcluters recursively (ie. top-down)

Define single-linkage and complete-linkage.

Single-linkage - the minimum distance between clusters by comparing the elements (data points) within the cluster that are closest. Complete-linkage - calculating the distance between the two furthest elements for two clusters (furthest neighbours).

What is the purpose of clustering?

To see structures and relationships in the data (unsupervised learning).

What is the difference between k-means clustering and hierarchical clustering?

K-means clustering - we provide the number of clusters and let the algorithm define the shape of the individual clusters (which data points are included in each). Hierarchical clustering - we sequentially group or split the data into clusters. Both techniques provide us with tools to explore the patterns in our data that could be used for dimensionality reduction. In other words, the group assignments created during clustering can be used as features in a supervised model.

Data Transformations and Unsupervised Learning Techniques (10-20%) Flashcards

(29 cards)