Data Transformations and Unsupervised Learning Techniques (10-20%) Flashcards
Define a variable.
A variable is a measurement that is recorded and makes up part of the original dataset before any transformation takes place (where we do not consider data cleaning to be a transformation). This is a machine leaning point of view where variables are more closely associated with the raw data.
- Represents the predictors in the model
Define a feature.
Derived from the original variables or final inputs into the model (provides an alternative view of the info contained in the data)
What’s the distinction between a variable and a final input?
The measurements that make up the raw data.
Note: a raw variable (assuming it has undergone data cleaning) can itself be considered a feature and used in the final model with no transformation
Unstructured and structured data features?
Unstructured data - taking the raw text variable and then generating features that are derived from the text variable, providing an alternative, easier to model view of the data
Structured data - features are more easily understood in the context of text, image, and audio data. It still applied to traditional data. Each item is useful info that could be predictive and have a more direct relationship with the target variable of interest, but it isn’t immediately represented by the raw data
Limitations of modeling algorithms?
- The curse of dimensionality (leading to statistical insignificance)
- The need to represent the signal in a meaningful and interpretable way (or alternatively, capturing the signal in an efficient way)
- The need to capture complex signals in the data accuracy
- Computational feasibility when the number of variables/features get large
Need for feature generation and feature selection?
Feature generation and selection aim to strike a balance between complexity and interpretability. Instead of letting the model do all the work and become overly complex, we try to transform the data by extracting the important features, allowing us to build a much simpler model.
Define principle of parsimony.
When you have 2 comparatively effective models, the simpler one is better (feature selection principle) ie. the model with the smaller number of variables.
- also called “Occam’s razor”
When a transformation is applied to the original data, we can see if from a different viewpoint (ie. the feature space), which may provide a viewpoint from which a simpler model can achieve the same or even more predictive power than a complex model built in the original input space (especially true for classification models)
Define feature generation.
The process of deriving new features based on underlying variables or features that serve as the final input into a model.
- A straightforward approach is to apply a transformation to a raw variable or other feature. ie. age from birthdate or calculating the change in a variable over time through stock price movement
- A more complex approach is creating multiple features from a single variable or feature. ie. binarization
- Combine multiple variables into a single feature using a transformation. ie. combining sex and smoking status into a single feature indicating if someone is a female smoker, male smoker, and so on (more examples in notes)
Define binarization.
Where a categorical variable is turned into several binary variables. Binarization creates a single variable indicating whether the observation has a specific value for a variable or not, which not only allows the model to focus on that particular value, but also allows no useful values to be filtered out as part of the feature selection process.
Types of data transformations?
Exponential, perform log transformation to make the effect look more linear. ie. simpler
Explain how to address non-linear relationships and skewed distributions.
- Identifying relationships between variables can be difficult when there are skewed distributions.
- To address skewed distributions, apply log transformation to skewed data. After transformations, the points are more spread out so that patterns, if any, can be identified.
Caution: when modeling with transformed variables, it is important to remember to transform the resulting predictions back to untransformed numbers.
Explain the steps for cleaning data and then selecting the features for use in future modeling.
- Obvious adjustments and checks - ie. check the min and max values for ech variable to ensure they are somewhat realistic, at this point we are checking for errors
- Check for outliers - note: variables that use numbers to represent factor levels cannot have outliers
- Make appropriate transformations - normally categorical variables are not transformed other than to conduct a binarization if needed
- Create appropriate new features - PCA and clustering can be used to create new features
- Final datasets
- might consider making scatter plots or calculating corrections
- may decide a few of the variables have no predictive power and could be eliminated, therefore automated methods for removing variables (lasso) or retain all existing variables for future use
Types of unsupervised learning?
- Principal component analysis (PCA)
- Cluster analysis
Define Principal Component Analysis (PCA).
- Objective is to summarize high-dimensional data into fewer variables such that we retain as much info as possible. The kind of info that PCA attempts to preserve is the spread of the data. That is, lots of data gathered close together in a domain tells us less info about that domain than data that are spread broadly across the domain.
- A technique that finds directions of maximum variance in the data that are mutually orthogonal (perpendicular).
- PCA attempts to make composite variables (principal component, “PC”) that are more directly reflect the underlying pattern
Define Principal Components.
- Goal of principal components is to create a variable with the largest possible variance.
- Advantage of PCs is that most of the variance is now concentrated in the first one.
- Each variable should be centered and scaled. ie. the mean for each variable has been subtracted. Therefore, the average is zero (due to centering)
- It is not necessary to standardized each variable, but it is depending on whether you use a correlation or covariance matrix
- A PC turns each record into a single number by taking a linear combo of the variable values.
- The coefficients are called loadings or weights