Modules Flashcards
Permutation Feature Importance
Feature Selction
Permutation feature importance works by randomly changing the values of each feature column, one column at a time, and then evaluating the model.
The rankings provided by permutation feature importance are often different from the ones you get from Filter Based Feature Selection, which calculates scores before a model is created.
This is because permutation feature importance doesn’t measure the association between a feature and a target value, but instead captures how much influence each feature has on predictions from the model.
Filter Based Feature Selection
Feature Selection
The Filter Based Feature Selection module provides multiple feature selection algorithms to choose from, including correlation methods such as Pearsons’s or Kendall’s correlation, mutual information scores, and chi-squared values
Fisher Linear Discriminant Analysis
Feature Selection
Identifies the linear combination of feature variables that can best group data into separate classes.
Captures the combination of features that best separates two or more classes.
This method is often used for dimensionality reduction, because it projects a set of features onto a smaller feature space while preserving the information that discriminates between classes. This not only reduces computational costs for a given classification task, but can help prevent overfitting.
Synthetic Minority Oversampling Technique (SMOTE)
Manipulation
Use the SMOTE module in Azure Machine Learning Studio to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
Vowpal Wabbit
Text Analytics
Vowpal Wabbit (VW) is a fast, parallel machine learning framework that was developed for distributed computing by Yahoo! Research. Later it was ported to Windows and adapted by John Langford (Microsoft Research) for scientific computing in parallel architectures.
Features of Vowpal Wabbit that are important for machine learning include continuous learning (online learning), dimensionality reduction, and interactive learning. Vowpal Wabbit is also a solution for problems when you cannot fit the model data into memory.
Root Mean Square Error
Evaluate Model - Regression
Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.
R-Squared
Evaluate Model - Regression
Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect.
F1 score
Evaluate Model - Classification
F-score is computed as the weighted average of precision and recall between 0 and 1, where the ideal F-score value is 1
k-fold cross-validation
Cross Validate Module - Regression / Classification
Cross-validation is a technique often used in machine learning to assess both the variability of a dataset and the reliability of any model trained through that data.
The Cross Validate Model module takes as input a labeled dataset, together with an untrained classification or regression model. It divides the dataset into some number of subsets (folds), builds a model on each fold, and then returns a set of accuracy statistics for each fold. By comparing the accuracy statistics for all the folds, you can interpret the quality of the data set. You can then understand whether the model is susceptible to variations in the data.
Assign Data to Clusters
xxx
Load Trained Model
xxx
C. Partition and Sample
xxx
D. Tune Model-Hyperparameters
Integrated train and tune: You configure a set of parameters to use, and then let the module iterate over multiple combinations, measuring accuracy until it finds a "best" model. With most learner modules, you can choose which parameters should be changed during the training process, and which should remain fixed. We recommend that you use Cross-Validate Model to establish the goodness of the model given the specified parameters. Use Tune Model Hyperparameters to identify the optimal parameters.
Build Counting Transform
Build Counting Transform module in Azure Machine Learning Studio, to analyze training data. From this data, the module builds a count table as well as a set of count-based features that can be used in a predictive model.
Missing Values Scrubber
The Missing Values Scrubber module is deprecated
Feature Hashing
Feature hashing is used for linguistics, and works by converting unique tokens into integers
Clean Missing Data
to remove, replace, or infer missing values
Replace Discrete Values
the Replace Discrete Values module in Azure Machine Learning Studio is used to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values.
Import Data
xxx
Latetent Dirichlet Transformation
Latent Dirichlet Allocation module in Azure Machine Learning Studio, to group otherwise unclassified text into a number of categories. Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to find texts that are similar. Another common term is topic modeling.
Partition and Sample
xxx
Convert to Indicator Values
Use the Convert to Indicator Values module in Azure Machine Learning Studio. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.
Clean Missing Data
xxx
Remove Duplicate Rows
xxx