Modules Flashcards
Permutation Feature Importance
Feature Selction
Permutation feature importance works by randomly changing the values of each feature column, one column at a time, and then evaluating the model.
The rankings provided by permutation feature importance are often different from the ones you get from Filter Based Feature Selection, which calculates scores before a model is created.
This is because permutation feature importance doesn’t measure the association between a feature and a target value, but instead captures how much influence each feature has on predictions from the model.
Filter Based Feature Selection
Feature Selection
The Filter Based Feature Selection module provides multiple feature selection algorithms to choose from, including correlation methods such as Pearsons’s or Kendall’s correlation, mutual information scores, and chi-squared values
Fisher Linear Discriminant Analysis
Feature Selection
Identifies the linear combination of feature variables that can best group data into separate classes.
Captures the combination of features that best separates two or more classes.
This method is often used for dimensionality reduction, because it projects a set of features onto a smaller feature space while preserving the information that discriminates between classes. This not only reduces computational costs for a given classification task, but can help prevent overfitting.
Synthetic Minority Oversampling Technique (SMOTE)
Manipulation
Use the SMOTE module in Azure Machine Learning Studio to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
Vowpal Wabbit
Text Analytics
Vowpal Wabbit (VW) is a fast, parallel machine learning framework that was developed for distributed computing by Yahoo! Research. Later it was ported to Windows and adapted by John Langford (Microsoft Research) for scientific computing in parallel architectures.
Features of Vowpal Wabbit that are important for machine learning include continuous learning (online learning), dimensionality reduction, and interactive learning. Vowpal Wabbit is also a solution for problems when you cannot fit the model data into memory.
Root Mean Square Error
Evaluate Model - Regression
Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.
R-Squared
Evaluate Model - Regression
Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect.
F1 score
Evaluate Model - Classification
F-score is computed as the weighted average of precision and recall between 0 and 1, where the ideal F-score value is 1
k-fold cross-validation
Cross Validate Module - Regression / Classification
Cross-validation is a technique often used in machine learning to assess both the variability of a dataset and the reliability of any model trained through that data.
The Cross Validate Model module takes as input a labeled dataset, together with an untrained classification or regression model. It divides the dataset into some number of subsets (folds), builds a model on each fold, and then returns a set of accuracy statistics for each fold. By comparing the accuracy statistics for all the folds, you can interpret the quality of the data set. You can then understand whether the model is susceptible to variations in the data.
Assign Data to Clusters
xxx
Load Trained Model
xxx
C. Partition and Sample
xxx
D. Tune Model-Hyperparameters
Integrated train and tune: You configure a set of parameters to use, and then let the module iterate over multiple combinations, measuring accuracy until it finds a "best" model. With most learner modules, you can choose which parameters should be changed during the training process, and which should remain fixed. We recommend that you use Cross-Validate Model to establish the goodness of the model given the specified parameters. Use Tune Model Hyperparameters to identify the optimal parameters.
Build Counting Transform
Build Counting Transform module in Azure Machine Learning Studio, to analyze training data. From this data, the module builds a count table as well as a set of count-based features that can be used in a predictive model.
Missing Values Scrubber
The Missing Values Scrubber module is deprecated
Feature Hashing
Feature hashing is used for linguistics, and works by converting unique tokens into integers
Clean Missing Data
to remove, replace, or infer missing values
Replace Discrete Values
the Replace Discrete Values module in Azure Machine Learning Studio is used to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values.
Import Data
xxx
Latetent Dirichlet Transformation
Latent Dirichlet Allocation module in Azure Machine Learning Studio, to group otherwise unclassified text into a number of categories. Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to find texts that are similar. Another common term is topic modeling.
Partition and Sample
xxx
Convert to Indicator Values
Use the Convert to Indicator Values module in Azure Machine Learning Studio. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.
Clean Missing Data
xxx
Remove Duplicate Rows
xxx
Synthetic Minority Oversampling Technique (SMOTE)
xxx
Stratified split
xxx
Computer Linear Correlation
The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset.
B. Export Count Table
The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules.
C. Execute Python Script
With Python, you can perform tasks that aren’t currently supported by existing Studio modules such as:
Visualizing data using matplotlib
Using Python libraries to enumerate datasets and models in your workspace
Reading, loading, and manipulating data from sources not supported by the Import Data module
D. Convert to Indicator Values
The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.
E. Summarize Data
Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:
How many missing values are there in each column?
How many unique values are there in a feature column?
What is the mean and standard deviation for each column?
The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input.
Test Hypothesis Using t-Test
xxx
Remove stop words
Remove words to optimize information retrieval.
Remove stop words: Select this option if you want to apply a predefined stopword list to the text column. Stop word removal is performed before any other processes.
Lemmatization
Ensure that multiple related words from a single canonical form.
Lemmatization converts multiple related words to a single canonical form
Remove special characters
Remove special characters: Use this option to replace any non-alphanumeric special characters with the pipe | character.
Group data into bins
xxx
Group data into bins
xxx
Synthetic Minority Oversampling Technique (SMOTE)
xxx
Scale and Reduce
xxx
Boosted Decision Tree Regression
xxx
Online Gradient Descent
xxx
Baysian Linear Regression
xxx
Neural Network Regression
xxx
Linear Regression
xxx
Decision Forest Regression
xxx
Clean Missing Data
xxx
Multiple Imputation by Chained Equations (MICE)
xxx
Equal Width with Custom Start and Stop binning
xxx
Entropy MDL binning mode
xxx
Apply a Quantiles binning mode with a PQuantile normalization
xxx
Entropy MDL binning mode
xxx
Synthetic Minority Oversampling Technique (SMOTE)
xxxx
Last Observation Carried Forward (LOCF)
xxx
Multiple Imputation by Chained Equations (MICE)
xxx
Permutation Feature Importance
xxx
Edit Metadata
xxx
Filter Based Feature Selection
xxx
Execute Python Script
xxx
Latent Dirichlet Allocation
xxx
Fortsätt Page 29
https://www.google.com/search?q=site:https://www.examtopics.com/exams/microsoft/dp-100/+%22studio-module-reference%E2%80%9D&rlz=1C1GCEA_enSE827SE827&sxsrf=ALeKk02QGVPNz-bJYoGOfw2svk4SdpRtPw:1591636394795&ei=qnHeXpGKMNL6qwGZoZGAAQ&start=10&sa=N&ved=2ahUKEwiRh7HP2_LpAhVS_SoKHZlQBBAQ8NMDegQIDBAz&biw=845&bih=927