Use Cases Flashcards
https://www.examtopics.com/exams/microsoft/dp-100/view/23/ - Question #4
”
Testing -
You must produce multiple partitions of a dataset based on sampling using the Partition and Sample module in Azure Machine Learning Studio.”
“Cross-validation -
You must create three equal partitions for cross-validation. You must also configure the cross-validation process so that the rows in the test and training datasets are divided evenly by properties that are near each city’s main river. You must complete this task before the data goes through the sampling process.”
Question
You need to identify the methods for dividing the data according to the testing requirements. Which properties should you select? To answer, select the appropriate options in the answer area.
Scenario: Testing -
Box 1: Assign to folds -
Use Assign to folds option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups.
Not Head: Use Head mode to get only the first n rows. This option is useful if you want to test a pipeline on a small number of rows, and don’t need the data to be balanced or sampled in any way.
Not Sampling: The Sampling option supports simple random sampling or stratified random sampling. This is useful if you want to create a smaller representative sample dataset for testing.
Box 2: Partition evenly -
Specify the partitioner method: Indicate how you want data to be apportioned to each partition, using these options:
✑ Partition evenly: Use this option to place an equal number of rows in each partition. To specify the number of output partitions, type a whole number in the “Specify number of folds to split evenly” into text box.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/partition-and-sample
https://www.examtopics.com/exams/microsoft/dp-100/view/23/ - Question #5
An initial investigation shows that the datasets are identical in structure apart from the MedianValue column. The smaller Paris dataset contains the MedianValue in text format, whereas the larger London dataset contains the MedianValue in numerical format.
In each case, the predictor of the dataset is the column named MedianValue. You must ensure that the datatype of the MedianValue column of the Paris dataset matches the structure of the London dataset.
Question:
You need to configure the Edit Metadata module so that the structure of the datasets match.
Which configuration options should you select? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
- Launch the column selector and choose the column or set of columns to work with
- Select the Data type option if you need to ** assign a different data type to the selected columns **
- Set ‘Categorical’ to ‘Uncategorical’
https://www.examtopics.com/exams/microsoft/dp-100/view/23/ - Question #1
“You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another.”
Question
You need to define an evaluation strategy for the crowd sentiment models.
Which three actions should you perform in sequence?
To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
- Add new features for retraining supervised models
- Evaluate the changes in correlation between model error rate and centroid distance
- Filter labled cases for retraining using the shortest distance from centroids.
- Impute unavailable features with centroid aligned models
- Filter labled cases for retraining using the longest from centroids.
- Remove features before retraining supervised models
- Add new features for retraining supervised models
- Evaluate the changes in correlation between model error rate and centroid distance
- Filter labled cases for retraining using the shortest distance from centroids.
????
https://www.examtopics.com/exams/microsoft/dp-100/view/23/ - Question 2
“Data scientists must build notebooks in a local environment using automatic feature engineering and model building in machine learning pipelines.
Experiments for local crowd sentiment models must combine local penalty detection data.
All shared features for local models are continuous variables.”
Question
You need to implement a feature engineering strategy for the crowd sentiment local models.
What should you do?
Correct Answer: D
The linear discriminant analysis method works only on continuous variables, not categorical or ordinal variables. Linear discriminant analysis is similar to analysis of variance (ANOVA) in that it works by comparing the means of the variables.
https://www.examtopics.com/exams/microsoft/dp-100/view/24/ - Question #1
Model training -
Permutation Feature Importance -
Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You must be determined the absolute fit for the model.
Question
You need to set up the Permutation Feature Importance module according to the model training requirements.
Which properties should you select? To answer, select the appropriate options in the answer area.
Metric for measuring performance for classification
- F-score
- Precison
- Recall
- Accurracy
Metric for measuring performance for regression
- Root of mean squared error
- R-squared
- Mean Zero one error
- Mean absolute error
Box 2: R-Squared
https://www.examtopics.com/exams/microsoft/dp-100/view/24/ - Question #2
Experiment requirements -
You must set up the experiment to cross-validate the Linear Regression and Bayesian Linear Regression modules to evaluate performance. In each case, the predictor of the dataset is the column named MedianValue. You must ensure that the datatype of the MedianValue column of the Paris dataset matches the structure of the London dataset.
You must prioritize the columns of data for predicting the outcome.
** You must use non-parametric statistics to measure relationships. **
* You must use a feature selection algorithm to analyze the relationship between the MedianValue and AvgRoomsInHouse columns.*
Question
You need to configure the Feature Based Feature Selection module based on the experiment requirements and datasets.
How should you configure the module properties? To answer, select the appropriate options in the dialog box in the answer area. NOTE: Each correct selection is worth one point.
Feature Scoring Method?
- Fisher
- Chi-squared
- Mutual Information
- Counts
Target Column?
- MedianValue
- AvgRoomsInHous
Box 1: Mutual Information.
The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.
Box 2: MedianValue -
MedianValue is the feature column, , it is the predictor of the dataset.
Scenario: The MedianValue and AvgRoomsinHouse columns both hold data in numeric format. You need to select a feature selection algorithm to analyze the relationship between the two columns in more detail.
References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/filter-based-feature-selection
https://www.examtopics.com/exams/microsoft/dp-100/view/24/ - Question #3
Question:
You need to select a feature extraction method.
Which method should you use? A. Mutual information B. Mood's median test C. Kendall correlation D. Permutation Feature Importance
C. Kendall correlation
https://www.examtopics.com/exams/microsoft/dp-100/view/24/ - Question #4
Model training -
Permutation Feature Importance -
Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You must be determined the absolute fit for the model.
Question
You need to configure the Permutation Feature Importance module for the model training requirements.
What should you do? To answer, select the appropriate options in the dialog box in the answer area.
NOTE: Each correct selection is worth one point.
Box 1: 500 -
For Random seed, type a value to use as seed for randomization. If you specify 0 (the default), a number is generated based on the system clock.
A seed value is optional, but you should provide a value if you want reproducibility across runs of the same experiment.
Here we must replicate the findings.
Box 2: Mean Absolute Error -
Scenario: Given a trained model and a test dataset, you must compute the Permutation Feature Importance scores of feature variables. You need to set up the
Permutation Feature Importance module to select the correct metric to investigate the model’s accuracy and replicate the findings.
Regression. Choose one of the following: Precision, Recall, Mean Absolute Error , Root Mean Squared Error, Relative Absolute Error, Relative Squared Error,
Coefficient of Determination -
References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/permutation-feature-importance
https://www.examtopics.com/exams/microsoft/dp-100/view/22/ - Question#2
Question
HOTSPOT -
You need to use the Python language to build a sampling strategy for the global penalty detection models.
How should you complete the code segment?
To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point.
ddd
https://www.examtopics.com/exams/microsoft/dp-100/view/22/ - Question #1
Missing values -
The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values.
Question
You need to replace the missing data in the AccessibilityToHighway columns. How should you configure the Clean Missing Data module?
To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point.
Cleaning Mode
- Replace using MICE
- Replace using Mean
- Replace using Median
- Replace using Mode
Cols with all missing Values
- ?
Box 1: Replace using MICE -
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as
“Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.
Scenario: The AccessibilityToHighway column in both datasets contains missing values. The missing data must be replaced with new data so that it is modeled conditionally using the other variables in the data before filling in the missing values.
Box 2: Propagate -
Cols with all missing values indicate if columns of all missing values should be preserved in the output.
References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data
https://www.examtopics.com/exams/microsoft/dp-100/view/22/ - Question #2
Data visualization -
You need to provide the test results to the Fabrikam Residences team. You create data visualizations to aid in presenting the results.
You must produce a Receiver Operating Characteristic (ROC) curve to conduct a diagnostic test evaluation of the model. You need to select appropriate methods for producing the ROC curve in Azure Machine Learning Studio to compare the Two-Class Decision Forest and the Two-Class Decision Jungle modules with one another.
Question
You need to produce a visualization for the diagnostic test evaluation according to the data visualization requirements.
Which three modules should you recommend be used in sequence? To answer, move the appropriate modules from the list of modules to the answer area and arrange them in the correct order.
Select and Place:
- Score Matchbox Recommender
- Apply Transformation
- Evaluate Recommender
- Evaluate Model
- Train Model
- Sweep Clustering
- Score Model
- Load Training Model
- Load Training Model
- Score Model
- Evaluate Model
https://www.examtopics.com/exams/microsoft/dp-100/view/22/ - Question #3
Question
You need to visually identify whether outliers exist in the Age column and quantify the outliers before the outliers are removed.
Which three Azure Machine Learning Studio modules should you use? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Create Scatterplot B. Summarize Data C. Clip Values D. Replace Discrete Values E. Build Counting Transform
Correct Answer: ABC
B: To have a global view, the summarize data module can be used. Add the module and connect it to the data set that needs to be visualized.
A: One way to quickly identify Outliers visually is to create scatter plots.
C: The easiest way to treat the outliers in Azure ML is to use the Clip Values module. It can identify and optionally replace data values that are above or below a specified threshold. You can use the Clip Values module in Azure Machine Learning Studio, to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, a constant, or other substitute value.
References:
https://blogs.msdn.microsoft.com/azuredev/2017/05/27/data-cleansing-tools-in-azure-machine-learning/ https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clip-values
https://www.examtopics.com/exams/microsoft/dp-100/view/26/
xxx
https://www.examtopics.com/exams/microsoft/dp-100/view/31/
xxx
https://www.examtopics.com/exams/microsoft/dp-100/view/21/
xxx