Questions set 1 Flashcards

1
Q

[EXAM- UDEMY] You are asked to solve a classification task.

You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits.

You need to configure the k parameter for the cross-validation.

Which value should you use?

k = 10
k = 0.9
K = 0.5
K = 1
A

Leave One Out (LOO) cross-validation

Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), this is a special case of the K-fold approach.

LOO CV is sometimes useful but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.

This is why the usual choice is K=5 or 10. This provides a good compromise for the bias-variance tradeoff.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

[PERSONAL] what is the purpose of K-fold cross validation

A
  • maximize the use of the available data for training and then testing a model.
  • assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

[PERSONAL] what is the purpose of cross-validation?

A

Cross validation (CV) is one of the technique used to test the effectiveness of a machine learning models, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample/portion of the data on which is not used to train the model, later use this sample for testing/validating.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

[PERSONAL] Give the variations on cross-validation

A
Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model.
LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short.
Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.
Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

[EXAM- UDEMY] Your manager asked you to analyze a numerical dataset which contains missing values in several columns.

You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.

You need to analyze a full dataset to include all values.

Solution:

Use the Last Observation Carried Forward (LOCF) method to impute the missing data points.

A

Explanation
Instead of using Last Observation Carried Forward method, you need to use the Multiple Imputation by Chained Equations (MICE) method.

Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.

Note:

Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

[PERSONAL] Pro’s and Cons of mean/median imputation

A

Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.

Will give poor results on encoded categorical features (do NOT use it on categorical features).

Not very accurate.
Doesn’t account for the uncertainty in the imputations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

[PERSONAL] Pro’s and Cons of Most Frequent or Zero/Constant values

A

Pros:
Works well with categorical features.

Cons:
It also doesn’t factor the correlations between features.

It can introduce bias in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[PERSONAL] Pro’s and Cons

Imputation Using k-NN

A

Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive.
KNN works by storing the whole training dataset in memory.

K-NN is quite sensitive to outliers in the data (unlike SVM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[PERSONAL] Pro’s and Cons

Imputation Using Multivariate Imputation by Chained Equation (MICE)

A

pro’s

  • better
  • flexible: can handle different data types
  • can handle complexities

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[PERSONAL] what is Hot-Deck imputation

A

Works by randomly choosing the missing value from a set of related and similar variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[PERSONAL] what is Extrapolation and Interpolation imputation?

A

It tries to estimate values from other observations within the range of a discrete set of known data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

[PERSONAL] what is Stochastic regression imputation

A

It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

[EXAM - UDEMY]
You are a senior data scientist of your company and you use Azure Machine Learning Studio.

You are asked to normalize values to produce an output column into bins to predict a target column.

Solution:

Apply a Quantiles normalization with a QuantileIndex normalization.

Does the solution meet the goal?

A
Quantile Normalization: 
Summary of YT-video
- start with highest value 
- calculate mean 
- put elements of different distribution on the mean. 

https://www.youtube.com/watch?reload=9&v=ecjN6Xpv6SE

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile-normalize a test distribution to a reference distribution of the same length, sort the test distribution and sort the reference distribution.

This has nothing to do with bins.

Entropy MDL: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

[EXAM - UDEMY]
You are analyzing a raw dataset that requires cleaning.

You must perform transformations and manipulations by using Azure Machine Learning Studio.

You need to identify the correct module to perform the below transformation.

Which module should you choose?

Scenario:

Remove potential duplicates from a dataset

  • remove duplicate rows
  • SMOTE
  • Convert to indicator values
  • Clean missing data
  • Threshold filter
A

Use the Remove Duplicate Rows module in Azure Machine Learning Studio (classic), to remove potential duplicates from a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

[PERSONAL]

What are all the categories in the data transformation category?

A

Data Transformation - Filter
Learning with Counts
Data Transformation - Manipulation

Data Transformation - Sample and Split

Data Transformation - Scale and Reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

[PERSONAL] Data Transformation - Filter

Give al the types of filters and what they do

A

Apply Filter: Applies a filter to specified columns of a dataset.
FIR Filter: Creates an FIR filter for signal processing.
See also
IIR Filter: Creates an IIR filter for signal processing.
Median Filter: Creates a median filter that’s used to smooth data for trend analysis.
Moving Average Filter: Creates a moving average filter that smooths data for trend analysis.
Threshold Filter: Creates a threshold filter that constrains values.
User-Defined Filter: Creates a custom FIR or IIR filter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

[PERSONAL] Data Transformation - Learning with Counts

A
The basic idea of count-based featurization is that by calculating counts, you can quickly and easily get a
summary of what columns contain the most important information. The module counts the number of times a
value appears, and then provides that information as a feature for input to a model.

Build Counting Transform: Creates a count table and count-based features from a dataset, and then saves
the table and features as a transformation.

Export Count Table: Exports a count table from a counting transform. This module supports backward
compatibility with experiments that create count-based features by using Build Count Table (deprecated)
and Count Featurizer (deprecated).

Import Count Table: Imports an existing count table. This module supports backward compatibility with
experiments that create count-based features by using Build Count Table (deprecated) and Count Featurizer
(deprecated). The module supports conversion of count tables to count transformations.

Merge Count Transform: Merges two sets of count-based features.

Modify Count Table Parameters: Modifies count-based features that are derived from an existing count table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

[PERSONAL]
Data Transformation - Manipulation
Give some modules of this module.

A

Add Columns: Adds a set of columns from one dataset to another.
See also

Add Rows: Appends a set of rows from an input dataset to the end of another dataset.

Apply SQL Transformation: Runs a SQLite query on input datasets to transform the data.

Clean Missing Data: Specifies how to handle values that are missing from a dataset. This module replaces
Missing Values Scrubber, which has been deprecated.

Convert to Indicator Values: Converts categorical values in columns to indicator values.
Edit Metadata: Edits metadata that’s associated with columns in a dataset.

Group Categorical Values:

Groups data from multiple categories into a new category.

Join Data: Joins two datasets.

Remove Duplicate Rows: Removes duplicate rows from a dataset.

Select Columns in Dataset: Selects columns to include in a dataset or exclude from a dataset in an operation.

Select Columns Transform: Creates a transformation that selects the same subset of columns as in a
specified dataset.

SMOTE: Increases the number of low-incidence examples in a dataset by using synthetic minority
oversampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

[PERSONAL] Data Transformation - Sample and Split

Give the two modules and what they do.

A

Partition and Sample: Creates multiple partitions of a dataset based on sampling.

Split Data: Partitions the rows of a dataset into two distinct sets.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-sample-and-split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

[PERSONAL] Data Transformation - Scale and Reduce

A

Clip Values: Detects outliers, and then clips or replaces their values.
Group Data into Bins: Puts numerical data into bins.
Normalize Data: Rescales numeric data to constrain dataset values to a standard range.
Principal Component Analysis: Computes a set of features that have reduced dimensionality for more efficient
learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

[EXAM - UDEMY]

You are a data scientist using Azure Machine Learning Studio.

You are performing a filter-based feature selection for a dataset to build a multi-class classifier by using Azure Machine Learning Studio.

The dataset contains categorical features that are highly correlated to the output label column.

You need to select the appropriate feature scoring statistical method to identify the key predictors.

Which method should you use?

  • spearman correlation
  • Kendal correlation
  • Chi-squared
  • Pearson correlation
A

Explanation
The chi-square statistic is used to show whether or not there is a relationship between two categorical variables

Incorrect Answer:

Pearson’s correlation coefficient (r) is used to demonstrate whether two variables are correlated or related to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

[PERSONAL]

Explain CHI-squared test, for what is it used?

A

is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.

How likely is it that two sets of observations arose from the same distribution?

YT: https://www.youtube.com/watch?v=2QeDRsxSF9M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

[PERSONAL]

spearman correlation

A

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed

Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a relationship between two variables. The result will always be between 1 and minus 1.

A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

[PERSONAL]

Kendal correlation

A

n statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. … can be formulated as special cases of a more general correlation coefficient.

In the normal case, the Kendall correlation is preferred than the Spearman correlation because of a smaller gross error sensitivity (GES) (more robust) and a smaller asymptotic variance (AV) (more efficient).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

[PERSONAL] Pearson correlation

A

is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1. A value of +1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

Correlation is a technique for investigating the relationship between two quantitative, continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

[EXAM - UDEMY]

You are a data scientist and you use Azure Machine Learning Studio for your experiments.

You are creating a new experiment in Azure Machine Learning Studio.

One class has a much smaller number of observations than the other classes in the training set.

You need to select an appropriate data sampling strategy to compensate for the class imbalance.

Solution:

You use the Principal Components Analysis (PCA) sampling mode.

Does the solution meet the goal?

A

Explanation
Instead of using Principal Components Analysis, use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.

Note:

SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

Incorrect Answers:

The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

[EXAM - UDEMY]

You are a data scientist and you use Azure Machine Learning Studio for your experiments.

You are creating a new experiment in Azure Machine Learning Studio.

One class has a much smaller number of observations than the other classes in the training set.

You need to select an appropriate data sampling strategy to compensate for the class imbalance.

Solution:

You use the Principal Components Analysis (PCA) sampling mode.

Does the solution meet the goal?

A

Explanation
Instead of using Principal Components Analysis, use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.

Note:

SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

Incorrect Answers:

The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

[PERSONAL] Explain PCA

A

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

[EXAM - UDEMY] - (duplicate question)
You are a data scientist using Azure Machine Learning Studio.

You are using Azure Machine Learning Studio to perform feature engineering on a dataset.

You need to normalize values to produce a feature column grouped into bins.

Solution:

Apply an Entropy Minimum Description Length (MDL) binning mode.

Does the solution meet the goal?

A

Explanation
Entropy MDL binning mode:

This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named quantized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

[EXAM - UDEMY]
HOTSPOT
You are a data scientist of your company

You are working on a classification task.

You have a dataset indicating whether a student would like to play soccer and associated attributes.

The dataset includes the following columns:

isPlayerSoccer? boolean
Gender? M or F 
PrevExamMarks: Stores values from 0 - 100
Height: in centimeters
Weight: stores in kilograms

Which are continuous variables?

A

Too obvious :)

  • heigth
  • weight
  • PrevExamMarks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

[EXAM - UDEMY]
HOTSPOT

Your manager have asked you to create a binary classification model to predict whether a person has a disease.

You need to detect possible classification errors.

Which error type should you choose for below description?

A person has a disease. The model calssifies the case as having no disease.

A

False negative

A false negative is an outcome where the model incorrectly predicts the negative class.

Note:

Let’s make the following definitions:

“Wolf” is a positive class.

“No wolf” is a negative class.

We can summarize our “wolf-prediction” model using a 2x2 confusion matrix that depicts all four possible outcomes:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

[EXAM - UDEMY]
You use the Azure Machine Learning service to create a tabular dataset named training_data. You plan to use this dataset in a training script.

You create a variable that references the dataset using the following code:

training_ds = workspace.datasets.get(“training_data”)

You define an estimator to run the script.

You need to set the correct property of the estimator to ensure that your script can access the training_data dataset.

Which property should you set?

  • source_directory = training_ds
  • inputs = [training_ds.as_named.input(‘training_ds’)]
  • environment_definition = {‘training_ds”: training_ds}
  • script_params = {“– training_ds: training_ds”}
A

inputs =[training_ds.as_named.input(‘training_ds’)]

Estimator. Represents a generic estimator to train data using any supplied framework. This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn

  • inputs (list):
    A list of DataReference or DatasetConsumptionConfig objects to use as input.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

[PERSONAL]

What is an estimator?

A

Estimator. Represents a generic estimator to train data using any supplied framework. This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn

The Estimator class wraps run configuration information to help simplify the tasks of specifying how a script is executed. It supports single-node as well as multi-node execution. Running the estimator produces a model in the output directory specified in your training script.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

[PERSONAL] what are the parameters of an estimator

A

Parameters for estimator
source_directory (str)
A local directory containing experiment configuration and code files needed for a training job.
compute_target (AbstractComputeTarget or str)
The compute target where training will happen. This can either be an object or the string “local”.
vm_size (str)
The VM size of the compute target that will be created for the training. Supported values: Any Azure VM size.
vm_priority (str)
The VM priority of the compute target that will be created for the training. If not specified, ‘dedicated’ is used. Supported values: ‘dedicated’ and ‘lowpriority’.This takes effect only when the vm_size parameter is specified in the input.
entry_script (str)
The relative path to the file used to start training.
script_params (dict)
A dictionary of command-line arguments to pass to the training script specified in entry_script.
node_count (int)
The number of nodes in the compute target used for training. If greater than 1, an MPI distributed job will be run.
process_count_per_node (int)
The number of processes (or “workers”) to run on each node. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.
distributed_backend (str)
The communication backend for distributed training.
DEPRECATED. Use the distributed_training parameter.
Supported values: ‘mpi’. ‘mpi’ represents MPI/Horovod.
This parameter is required when node_count or process_count_per_node > 1.
When node_count == 1 and process_count_per_node == 1, no backend will be used unless the backend is explicitly set. Only the AmlCompute target is supported for distributed training.
distributed_training (Mpi)
Parameters for running a distributed training job.

For running a distributed job with MPI backend, use Mpi object to specify process_count_per_node.
use_gpu (bool)
Indicates whether the environment to run the experiment should support GPUs. If true, a GPU-based default Docker image will be used in the environment. If false, a CPU-based image will be used. Default Docker images (CPU or GPU) will be used only if the custom_docker_image parameter is not set. This setting is used only in Docker enabled compute targets.
use_docker (bool)
Specifies whether the environment to run the experiment should be Docker-based.
custom_docker_base_image (str)
The name of the Docker image from which the image to use for training will be built.
DEPRECATED. Use the custom_docker_image parameter.
If not set, a default CPU-based image will be used as the base image.
custom_docker_image (str)
The name of the Docker image from which the image to use for training will be built. If not set, a default CPU-based image will be used as the base image. Only specify images available in public docker repositories (Docker Hub). To use an image from a private docker repository, use the constructor’s environment_definition parameter instead.
image_registry_details (ContainerRegistry)
The details of the Docker image registry.
user_managed (bool)
Specifies whether Azure ML reuses an existing Python environment. If false, a Python environment is created based on the conda dependencies specification.
conda_packages (list)
A list of strings representing conda packages to be added to the Python environment for the experiment.
pip_packages (list)
A list of strings representing pip packages to be added to the Python environment for the experiment.
conda_dependencies_file_path (str)
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages
DEPRECATED. Use the conda_dependencies_file paramenter.

Specify either conda_dependencies_file_path or conda_dependencies_file. If both are specified, conda_dependencies_file is used.
pip_requirements_file_path (str)
The relative path to the pip requirements text file.
DEPRECATED. Use the pip_requirements_file parameter.
This parameter can be specified in combination with the pip_packages parameter. Specify either pip_requirements_file_path or pip_requirements_file. If both are specified, pip_requirements_file is used.
conda_dependencies_file (str)
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages.
pip_requirements_file (str)
The relative path to the pip requirements text file. This parameter can be specified in combination with the pip_packages parameter.
environment_variables (dict)
A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed.
environment_definition (Environment)
The environment definition for the experiment. It includes PythonSection, DockerSection, and environment variables. Any environment option not directly exposed through other parameters to the Estimator construction can be set using this parameter. If this parameter is specified, it will take precedence over other environment-related parameters like use_gpu, custom_docker_image, conda_packages, or pip_packages. Errors will be reported on invalid combinations.
Inputs (list)
A list of DataReference or DatasetConsumptionConfig objects to use as input.
source_directory_data_store (Datastore)
The backing data store for the project share.
shm_size (str)
The size of the Docker container’s shared memory block. If not set, the default azureml.core.environment._DEFAULT_SHM_SIZE is used. For more information, see Docker run reference.
resume_from (DataPath)
The data path containing the checkpoint or model files from which to resume the experiment.

max_run_duration_seconds (int)
The maximum allowed time for the run. Azure ML will attempt to automatically cancel the run if it take longer than this value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

[PERSONAL] Write code for an estimator that uses the remote compute.

A
# Get the training dataset
diabetes_ds = ws.datasets.get("Diabetes Dataset")
# Create an estimator that uses the remote compute
hyper_estimator = SKLearn(source_directory=experiment_folder,
                           inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input
                           compute_target = cpu_cluster,
                           conda_packages=['pandas','ipykernel','matplotlib'],
                           pip_packages=['azureml-sdk','argparse','pyarrow'],
                           entry_script='diabetes_training.py') 

source (this is a good source for general setup): https://notebooks.azure.com/GraemeMalcolm/projects/azureml-primers/html/04%20-%20Optimizing%20Model%20Training.ipynb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

[EXAM - UDEMY]
You are creating a new experiment in Azure Machine Learning Studio.

You have a small dataset that has missing values in many columns.

The data does not require the application of predictors for each column.

You plan to use the Clean Missing Data. You need to select a data cleaning method.

Which method should you use?

  • SMOTE ( synthetic minority oversampling technique)
  • Replace using probabilistic PCA
  • Replace using MICE
  • Normalization
A

Instead of using Clean Missing Data, use Replace using Probabilistic PCA

Replace using Probabilistic PCA: Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

[PERSONAL]

Replace using Probabilistic PCA

A

YT: https://www.youtube.com/watch?v=6z6yipdfe3o

Replaces the missing values by using a linear model that analyzes the correlations between the columns and estimates a low-dimensional approximation of the data, from which the full data is reconstructed. The underlying dimensionality reduction is a probabilistic form of Principal Component Analysis (PCA), and it implements a variant of the model proposed in the Journal of the Royal Statistical Society, Series B 21(3), 611–622 by Tipping and Bishop.

Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns.

The key limitations of this method are that it expands categorical columns into numerical indicators and computes a dense covariance matrix of the resulting data. It also is not optimized for sparse representations. For these reasons, datasets with large numbers of columns and/or large categorical domains (tens of thousands) are not supported due to prohibitive space consumption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

[PERSONAL] Pro’s and cons for using Probabilistic PCA

A

not requiring the application of predictors for each column. it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns

  • computes dens covariance matrix
  • not optimized for spare representations
    Not good for datasets with large numbers of columns ans/or large categorical domaines.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.

You are evaluating a completed binary classification machine learning model.

You need to use the precision as the evaluation metric.

Which visualization should you use?

  • box-plot
  • binary classification confusion matrix
  • violin plot
  • gradient descent
A

Explanation
Incorrect Answers:

1) A violin plot is a visual that traditionally combines a box plot and a kernel density plot.
2) Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
3) A box plot lets you see basic distribution information about your data, such as median, mean, range and quartiles but doesn’t show you how your data looks throughout its range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

[EXAM - UDEMY] You are analyzing a raw dataset that requires cleaning.

You must perform transformations and manipulations by using Azure Machine Learning Studio.

You need to identify the correct module to perform the below transformation.

Which module should you choose?

Scenario:

Replace missing values by removing rows and columns

  • clean missing data
  • convert to indicator values
  • remove duplicate rows
  • threshold filter
  • smote
A

Clean missing data

Each time that you apply the Clean Missing Data module to a set of data, the same cleaning operation is applied to all columns that you select. Therefore, if you need to clean different columns using different methods, use separate instances of the module.

Add the Clean Missing Data module to your pipeline, and connect the dataset that has missing values.

For Columns to be cleaned, choose the columns that contain the missing values you want to change. You can choose multiple columns, but you must use the same replacement method in all selected columns. Therefore, typically you need to clean string columns and numeric columns separately.

For example, to check for missing values in all numeric columns:

Select the Clean Missing Data module, and click on Edit column in the right panel of the module.

For Include, select Column types from the dropdown list, and then select Numeric.

Any cleaning or replacement method that you choose must be applicable to all columns in the selection. If the data in any column is incompatible with the specified operation, the module returns an error and stops the pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.

You are creating a machine learning model.

You need to identify outliers in the data.

Which two visualizations can you use?

  • random forest diagram
  • Venn diagram
  • Scatter plot
  • ROC-curve
  • BOX-plot
A

Explanation
The box-plot algorithm can be used to display outliers.

One other way to quickly identify Outliers and represent visually is to create scatter plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

[PERSONAL]

ROC-curve

A

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

[PERSONAL] Fisher score

A

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features.

In mathematical statistics, the Fisher information (sometimes simply called information[1]) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X.

Extra information: https://towardsdatascience.com/overview-of-feature-selection-methods-a2d115c7a8f7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

[PERSONAL]

Mutual Information

A

Mutual Information

The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

[EXAM - UDEMY]

You are data science instructor of your company

You plan to deliver a hands-on workshop to several students.

The workshop will focus on creating data visualizations using Python.

Each student will use a device that has internet access.

Student devices are not configured for Python development.

Students do not have administrator access to install software on their devices.

Azure subscriptions are not available for students.

You need to ensure that students can run Python-based data visualization code.

  • azure notebooks
  • Azure ML service
  • Anacond data science platform
  • Azure batchAI
A
  • azure notebooks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

[EXAM - UDEMY]
Your supervisor asked you to preprocess text from CSV files.

You load the Azure Machine Learning Studio default stop words list.

You need to configure the Preprocess Text module to meet the following requirements:

§ Ensure that multiple related words from a single canonical form.

§ Remove pipe characters from text.

§ Remove words to optimize information retrieval.

Which three options should you select?

A
  • remove stop words
  • lemmatization
  • remove special characters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

[PERSONAL] Lemmatisation (or lemmatization) and difference with stemming.

A

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

[PERSONAL] Stemming

A

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

[EXAM - UDEMY]

You plan to explore demographic data for home ownership in various cities. The data is in a CSV file with the following format:

age,city,income,home_owner 
21,Chicago,50000,0 
35,Seattle,120000,1 
23,Seattle,65000,0 
45,Seattle,130000,1 
18,Chicago,48000,0
You need to run an experiment in your Azure Machine Learning workspace to explore the data and log the results. The experiment must log the following information:
  • the number of observations in the dataset
  • a box plot of income by home_owner
  • a dictionary containing the city names and the average income for each city

You need to use the appropriate logging methods of the experiment’s run object to log the required information.

How should you complete the code?

A

log
log_image
log_table

Explanation
Box 1: log The number of observations in the dataset.

run.log(name, value, description=”) Scalar values: Log a numerical or string value to the run with the given name. Logging a metric to a run causes that metric to be stored in the run record in the experiment. You can log the same metric multiple times within a run, the result being considered a vector of that metric.

Example: run.log(“accuracy”, 0.95)

Box 2: log_image A box plot of income by home_owner.

log_image Log an image to the run record. Use log_image to log a .PNG image file or a matplotlib plot to the run. These images will be visible and comparable in the run record. Example: run.log_image(“ROC”, plot=plt)

Box 3: log_table A dictionary containing the city names and the average income for each city. log_table: Log a dictionary object to the run with the given name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

[EXAM - UDEMY]

You are a data scientist using Azure Machine Learning Studio.

You are creating a machine learning model in Python.

The provided dataset contains several numerical columns and one text column.

The text column represents a product’s category.

The product category will always be one of the following:

§ Bikes

§ Cars

§ Vans

§ Boats

You are building a regression model using the scikit-learn Python package.

You need to transform the text data to be compatible with the scikit-learn Python package.

How should you complete the code segment? To answer, select the appropriate options in the answer area.

A

Import pandas as dataframe
Use .map method of the dataframe

Explanation
Box 1: pandas as df

Pandas takes data like a CSV or TSV file, or a SQL database and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example).

Box 2: map[ProductCategoryMapping]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

[EXAM - UDEMY]

DRAG DROP

You have a dataset that contains over 150 features.

You use the dataset to train a Support Vector Machine (SVM) binary classifier.

You need to use the Permutation Feature Importance module in Azure Machine Learning Studio to compute a set of feature importance scores for the dataset.

In which order should you perform the actions?

Add a Two-Class Support Vector Machine module to initialize the SVM classifier.

Add a dataset to the experiment

Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment.

Add a Split Data module to create training and test dataset.

Add a Permutation Feature Importance module and connect to the trained model and test dataset.

A

Step 1: Add a Two-Class Support Vector Machine module to initialize the SVM classifier.

Step 2: Add a dataset to the experiment

Step 3: Add a Split Data module to create training and test dataset.

Step 4: Add a Permutation Feature Importance module and connect to the trained model and test dataset.

Step 5: Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

[EXAM - UDEMY]

You are a data scientist and you use Azure Machine Learning Studio.

You use Azure Machine Learning Studio to build a machine learning experiment.

You need to divide data into two distinct datasets.

Which module should you use?

A

Split data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

[EXAM - UDEMY]

You are a data scientist using Azure Machine Learning Studio.

You are using the Azure Machine Learning Service to automate hyperparameter exploration of your neural network classification model.

You must define the hyperparameter space to automatically tune hyperparameters using random sampling according to following requirements:

§ The learning rate must be selected from a normal distribution with a mean value of 10 and a standard deviation of 3.

§ Batch size must be 16, 32 and 64.

§ Keep probability must be a value selected from a uniform distribution between the range of 0.05 and 0.1.

You need to use the param_sampling method of the Python API for the Azure Machine Learning Service.

How should you complete the code segment?

param_sampling = RandomParameterSampling ({“learning_rate” = ?, “batch_size”: ?, “keep_probability: ?”})

A

normal(10,3)
batch_size = choice(16,32,64)
keep_probability = uniform(0.05,0.1)

Explanation

Random sampling allows the search space to include both discrete and continuous hyperparameters.

In random sampling, hyperparameter values are randomly selected from the defined search space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

[EXAM - UDEMY]

You are a data scientist using Azure Machine Learning Studio.

You are analyzing a dataset by using Azure Machine Learning Studio.

You need to generate a statistical summary that contains the p-value and the unique count for each feature column.

Which two modules can you use?

  • export count table
  • computer linear correlation
  • execute python script
  • summarize data
  • convert to indicator values
A
Explanation
The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules.

Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:

§ How many missing values are there in each column?

§ How many unique values are there in a feature column?

§ What is the mean and standard deviation for each column?

The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input.

Incorrect Answers:

The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset.

With Python, you can perform tasks that aren’t currently supported by existing Studio modules such as:

§ Visualizing data using matplotlib

§ Using Python libraries to enumerate datasets and models in your workspace

§ Reading, loading, and manipulating data from sources not supported by the Import Data module

The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

[EXAM - UDEMY]

You are building a regression model for estimating the number of calls during an event hosting by your company.

You need to determine whether the feature values achieve the conditions to build a Poisson regression model.

Which two conditions must the feature set contain? Each correct answer presents part of the solution.

  • sign of the label data?
  • type of number?
A

Label data must be positive whole numbers

Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, you should use this module to create your regression model only if the values you are trying to predict fit the following conditions:

§ The response variable has a Poisson distribution.

§ Counts cannot be negative. The method will fail outright if you attempt to use it with negative labels.

§ A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/poisson-regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

[Personal]

Poisson distribution

A

is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.[1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

[Personal]

Melt function in pandas:

A

pd.melt(dataframe, id_vars = ‘‘shop”, value_vars = [‘2017’,’2018’])

Pandas melt() function is used to change the DataFrame format from wide to long. It’s used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns – variable and value.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

[EXAM - UDEMY]
You are senior data scientist of your company

You are evaluating a completed binary classification machine learning model.

You need to use the precision as the evaluation metric.

Which visualization should you use?

A

Explanation
Receiver operating characteristic (or ROC) is a plot of the correctly classified labels vs. the incorrectly classified labels for a particular model.

Incorrect Answers:

A violin plot is a visual that traditionally combines a box plot and a kernel density plot.

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

A scatter plot graphs the actual values in your data against the values predicted by the model. The scatter plot displays the actual values along the X-axis, and displays the predicted values along the Y-axis. It also displays a line that illustrates the perfect prediction, where the predicted value exactly matches the actual value.

59
Q

[EXAM - UDEMY]

You are employee of digiTechClouds and you are analyzing the asymmetry in a statistical distribution.

The following image contains two density curves that show the probability distribution of two datasets.

A

https://www.statisticshowto.com/probability-and-statistics/skewed-distribution

60
Q

[EXAM - UDEMY]

You are junior data scientist of your company.

You are building a machine learning model for translating English language textual content into French language textual content.

You need to build and train the machine learning model to learn the sequence of the textual content.

Which type of neural network should you use?

  • Recurrent Neural networkds
  • Convolutional neural networks
  • mulitlayer perceptions
  • Generative adversial networks
A

Recurrent Neural networkds
You need to build a recurrent neural network (RNN) to translate a corpus of English text to French.

Note:

RNNs are designed to take sequences of text as inputs or return sequences of text as outputs, or both. They’re called recurrent because the network’s hidden layers have a loop in which the output and cell state from each time step become inputs at the next time step. This recurrence serves as a form of memory.

It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step.

61
Q

[PERSONAL]

Multi-layer perceptron neural networks

A

https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/

MLPs are suitable for classification prediction problems where inputs are assigned a class or label.

They are also suitable for regression prediction problems where a real-valued quantity is predicted given a set of inputs. Data is often provided in a tabular format, such as you would see in a CSV file or a spreadsheet.

Use MLPs For:

Tabular datasets
Classification prediction problems
Regression prediction problems

As such, if your data is in a form other than a tabular dataset, such as an image, document, or time series, I would recommend at least testing an MLP on your problem. The results can be used as a baseline point of comparison to confirm that other models that may appear better suited add value.

Try MLPs On:

Image data
Text Data
Time series data
Other types of data

62
Q

[PERSONAL] Convolutional Neural Networks

A

Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable.

They have proven so effective that they are the go-to method for any type of prediction problem involving image data as an input.

The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This allows the model to learn position and scale in variant structures in the data, which is important when working with images.

Use CNNs For:

  • Image data
  • Classification prediction problems
  • Regression prediction problems

More generally, CNNs work well with data that has a spatial relationship.

The CNN input is traditionally two-dimensional, a field or matrix, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence.

This allows the CNN to be used more generally on other types of data that has a spatial relationship. For example, there is an order relationship between words in a document of text. There is an ordered relationship in the time steps of a time series.

Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as document classification used in sentiment analysis and related problems.

63
Q

[PERSONAL] Convolutional Neural Networks

A

Recurrent Neural Networks, or RNNs, were designed to work with sequence prediction problems.

Sequence prediction problems come in many forms and are best described by the types of inputs and outputs supported.

Some examples of sequence prediction problems include:

One-to-Many: An observation as input mapped to a sequence with multiple steps as an output.
Many-to-One: A sequence of multiple steps as input mapped to class or quantity prediction.
Many-to-Many: A sequence of multiple steps as input mapped to a sequence with multiple steps as output

RNNs in general and LSTMs in particular have received the most success when working with sequences of words and paragraphs, generally called natural language processing.

This includes both sequences of text and sequences of spoken language represented as a time series. They are also used as generative models that require a sequence output, not only with text, but on applications such as generating handwriting.

Use RNNs For:

Text data
Speech data
Classification prediction problems
Regression prediction problems
Generative models 

Don’t Use RNNs For:

Tabular data
Image data

64
Q

[PERSONAL] Generative adversarial network

A

They help to solve such tasks as image generation from descriptions, getting high resolution images from low resolution ones, predicting which drug could treat a certain disease, retrieving images that contain a given pattern, etc

65
Q

[EXAM - UDEMY]
You are creating a model to predict the price of a student’s artwork depending on the following variables: the student’s length of education, degree type, and art form.

You start by creating a linear regression model.

You need to evaluate the linear regression model.

Solution:

Use the following metrics: Mean Absolute Error, Root Mean Absolute Error, Relative Absolute Error, Relative Squared Error, and the Coefficient of Determination.

Does the solution meet the goal?

A

The following metrics are reported for evaluating regression models. When you compare models, they are ranked by the metric you select for evaluation.

Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a lower score is better.

Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.

Relative absolute error (RAE) is the relative absolute difference between expected and actual values; relative because the mean difference is divided by the arithmetic mean.

Relative squared error (RSE) similarly normalizes the total squared error of the predicted values by dividing by the total squared error of the actual values.

Mean Zero One Error (MZOE) indicates whether the prediction was correct or not. In other words: ZeroOneLoss(x,y) = 1 when x!=y; otherwise 0.

Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

66
Q

[EXAM - UDEMY]
You are junior data scientist of your company and you use Azure ML Studio for your work.

You create a binary classification model by using Azure Machine Learning Studio.

You must tune hyperparameters by performing a parameter sweep of the model.

The parameter sweep must meet the following requirements:
§ iterate all possible combinations of hyperparameters

§ minimize computing resources required to perform the sweep

You need to perform a parameter sweep of the model.

Which parameter sweep mode should you use?

  • entire grid
  • sweep clustering
  • random sweep
  • random grid
A

Explanation
Maximum number of runs on random grid:

This option also controls the number of iterations over a random sampling of parameter values, but the values are not generated randomly from the specified range; instead, a matrix is created of all possible combinations of parameter values and a random sampling is taken over the matrix. This method is more efficient and less prone to regional oversampling or undersampling.

For Random seed, type a number to use when initializing the parameter sweep.

If you are training a model that supports an integrated parameter sweep, you can also set a range of seed values to use and iterate over the random seeds as well. This is optional, but can be useful for avoiding bias introduced by seed selection.

Incorrect Answers:

If you are building a clustering model, use Sweep Clustering to automatically determine the optimum number of clusters and other parameters.

Entire grid: When you select this option, the module loops over a grid predefined by the system, to try different combinations and identify the best learner. This option is useful for cases where you don’t know what the best parameter settings might be and want to try all possible combination of values.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

67
Q

[PERSONAL]

Integrated train and tune

A

Integrated train and tune: You configure a set of parameters to use, and then let the module iterate over multiple combinations, measuring accuracy until it finds a “best” model. With most learner modules, you can choose which parameters should be changed during the training process, and which should remain fixed.

Depending on how long you want the tuning process to run, you might decide to exhaustively test all combinations, or you could shorten the process by establishing a grid of parameter combinations and testing a randomized subset of the parameter grid.

68
Q

[PERSONAL]

Cross validation with tuning

A

With this option, you divide your data into some number of folds and then build and test models on each fold. This method provides the best accuracy and can help find problems with the dataset; however, it takes longer to train.

69
Q

[PERSONAL] all types of parameter sweeps

A

Random sweep: This option trains a model using a set number of iterations.

You specify a range of values to iterate over, and the module uses a randomly chosen subset of those values. Values are chosen with replacement, meaning that numbers previously chosen at random are not removed from the pool of available numbers. Thus, the chance of any value being selected remains the same across all passes.

Grid sweep: This option creates a matrix, or grid, that includes every combination of the parameters in the value range you specify. When you start tuning with this module, multiple models are trained using combinations of these parameters.

Entire grid: The option to use the entire grid means just that: each and every combination is tested. This option can be considered the most thorough, but requires the most time.

Random grid: If you select this option, the matrix of all combinations is calculated and values are sampled from the matrix, over the number of iterations you specified.

Recent research has shown that random sweeps can perform better than grid sweeps.

70
Q

[PERSONAL]

Define accuracy

A

The proportion of true results to total cases

71
Q

[PERSONAL]

Define Precision

A

The proportion of true results to positive results

72
Q

[PERSONAL]

Define Recall

A

The fraction of all correct results over all results

73
Q

[PERSONAL]

Define F-score

A

A measure that balances precision and recall

74
Q

[PERSONAL]

Define AUC

A

A value that represents the area under the curve when false positives are plotted on the x-axis and true positives are plotted on the y-axis

75
Q

[PERSONAL]

Define Average Log Loss

A

he difference between two probability distributions: the true one, and the one in the model.

76
Q

[PERSONAL]

Define Train Log Loss

A

Train Log Loss The improvement provided by the model over a random prediction

77
Q

[EXAM - UDEMY]
You are data scientist of your company and you use Azure Machine Learning Studio

You are developing a linear regression model in Azure Machine Learning Studio. You run an experiment to compare different algorithms.

The following image displays the results dataset output:

  • Set the decrease learning rate option to true
  • Set the decrease learning rate option to false
  • Set the create trainer mode option to parameter range
  • Increase the number of epochs
  • Decrease the number of epochs
A

Set the create trainer mode option to parameter range

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/linear-regression

For Create trainer mode, indicate whether you want to train the model with a predefined set of parameters, or if you want to optimize the model by using a parameter sweep.

Single Parameter: If you know how you want to configure the linear regression network, you can provide a specific set of values as arguments.

Parameter Range: If you want the algorithm to find the best parameters for you, set Create trainer mode option to Parameter Range. You can then specify multiple values for the algorithm to try.

78
Q

[EXAM - UDEMY]
You are data scientist of your company and you use Azure Machine Learning Studio

You are developing a linear regression model in Azure Machine Learning Studio. You run an experiment to compare different algorithms.

The following image displays the results dataset output:

which approach should you use to find the best parameters for a linear regression model for the online gradient descent method?

  • Set the decrease learning rate option to true
  • Set the decrease learning rate option to false
  • Set the create trainer mode option to parameter range
  • Increase the number of epochs
  • Decrease the number of epochs
A

Set the create trainer mode option to parameter range

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/linear-regression

For Create trainer mode, indicate whether you want to train the model with a predefined set of parameters, or if you want to optimize the model by using a parameter sweep.

Single Parameter: If you know how you want to configure the linear regression network, you can provide a specific set of values as arguments.

Parameter Range: If you want the algorithm to find the best parameters for you, set Create trainer mode option to Parameter Range. You can then specify multiple values for the algorithm to try.

79
Q

[PERSONAL] what is the number of training epochs parameter with linear regression?

A

a value that indicates how many times the algorithm should iterate through examples. For datasets with a small number of examples, this number should be large to reach convergence.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/linear-regression

80
Q

[EXAM - UDEMY]

You are working as a ML professional for your company.

Your manager have asked you to build a recurrent neural network to perform a binary classification.

You review the training loss, validation loss, training accuracy, and validation accuracy for each training epoch.

You need to analyze model performance.

You need to identify whether the classification model is overfitted.

Which of the following is correct?

  • the training loss stays constant and the validation loss stays on a constant value and close to the training loss value when training the model
  • the training loss increases while the validation loss decreases when training the model
  • the training loss decreases while the validation loss increases when training the model.
  • the training loss stays constant and the validation loss decreases when training the model.
A

Explanation
An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade.

  • the training loss decreases while the validation loss increases when training the model.
    https: //machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/
81
Q

[EXAM - UDEMY]
You are creating a model to predict the price of a student’s artwork depending on the following variables: the student’s length of education, degree type, and art form.

You start by creating a linear regression model.

You need to evaluate the linear regression model.

Solution:

Use the following metrics: Accuracy, Precision, Recall, F1 score, and AUC.

Does the solution meet the goal?

A

Explanation
The provided options are metrics for evaluating classification models. Instead of those you can use Mean Absolute Error, Root Mean Absolute Error, Relative Absolute Error, Relative Squared Error, and the Coefficient of Determination.

82
Q

[EXAM - UDEMY]

You are a data scientist of the company named digiTechClouds

You are building a deep convolutional neural network (CNN) for image classification.

The CNN model you build shows signs of overfitting.

You need to reduce overfitting and converge the model to an optimal fit.

Which two actions should you perform? Each correct answer presents a complete solution.

  • add L1/L2 regularization
  • reduce the amount of training data
  • use training data augmentation
  • add an additional dense layer with 64 input units
    add an addiational dense layer with 512 input units
A

Explanation
Weight regularization provides an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set.

Keras provides a weight regularization API that allows you to add a penalty for weight size to the loss function.

Three different regularizer instances are provided; they are:

§ L1: Sum of the absolute weights.

§ L2: Sum of the squared weights.

§ L1L2: Sum of the absolute and the squared weights.

https: //machinelearningmastery.com/how-to-reduce-overfitting-in-deep-learning-with-weight-regularization/
https: //en.wikipedia.org/wiki/Convolutional_neural_network

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.

L1 regularization is another common form. It is possible to combine L1 with L2 regularization (this is called Elastic net regularization). The L1 regularization leads the weight vectors to become sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs.

Data augmentation is a technique to artificially create new training data from existing training data. This is done by applying domain-specific techniques to examples from the training data that create new and different training examples.

83
Q

[PERSONAL]

Data augmentation

A

Data augmentation is a technique to artificially create new training data from existing training data. This is done by applying domain-specific techniques to examples from the training data that create new and different training examples. You can use Keras for this

Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks

84
Q

[EXAM - UDEMY]
You are an employee of digTechClouds.

You are performing feature engineering on a dataset.

You must add a feature named CityName and populate the column value with the text London.

You need to add the new feature to the dataset.

Which Azure Machine Learning Studio module should you use?

  • execute python script
  • filter based feature selection
  • Latent dirichlet allocati
  • edit metadata
A

To add a new column you can run Execute Python Script. All new columns are labeled as “feature” by default. Change Metadata module cannot be used to add new feature or column.

https://www.tensorflow.org/tutorials/structured_data/feature_columns

85
Q

[PERSONAL]

Latent dirichlet allocati (LDA)

A

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

86
Q

[EXAM - UDEMY]

You are working as a Machine Learning professional of your company.

You are evaluating a Python NumPy array that contains six data points defined as follows: data = [10, 20, 30, 40, 50, 60]

You must generate the following output by using the k-fold algorithm implantation in the Python Scikit-learn machine learning library:

train: [10 40 50 60], test: [20 30]
train: [20 30 40 60], test: [10 50]
train: [10 20 30 50], test: [40 60]

You need to implement a cross-validation to generate the output.

How should you complete the code segment?

To answer, select the appropriate code segment in the dialog box in the answer area.

A

from numpy import array
from sklearn.model_selection imprt k-fold

data = array([10,20,30,40,50,60])
kfold = Kfold(n_splits = 3, shuffle = true, random_state = 1) 

for train, test in kFold, split ,(data):
print(‘train’: %s, test: %5’ % (data[train], data[test]))

Explanation
K-Folds cross-validation provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

The parameter n_splits ( int, default=3) is the number of folds. Must be at least 2.

87
Q

[EXAM - UDEMY]
Your manager have asked you to perform feature engineering on a dataset.

You must add a feature named CityName and populate the column value with the text London.

You need to add the new feature to the dataset.

Which Azure Machine Learning Studio module should you use?

  • preprocess text
  • Extract N-gram features from text
  • apply sql transformation
  • edit metadata
A

Explanation
Editing the metadata allows you to rename or change the data type of existing columns.

Since a new column/feature is being added. The answer Apply SQL Transform

88
Q

[EXAM - UDEMY]

Your manager have provided you a dataset created for multiclass classification tasks that contains a normalized numerical feature set with 10,000 data points and 150 features.

You use 75 percent of the data points for training and 25 percent for testing.

You are using the scikit-learn machine learning library in Python. You use X to denote the feature set and Y to denote class labels.

You create the following Python data frames:

X_train: training feature set
Y_train training class labels
x_test: testing feature set
y_test testing class labels

You need to apply the Principal Component Analysis (PCA) method to reduce the dimensionality of the feature set to 10 features in both training and testing sets.

How should you complete the code segment?

A

from sklearn.decomposition import PCA

pca = PCA(n_components = 10) 
X_train = pca.fit_transform(X_train) 
x_test = pca.fit
89
Q

[EXAM - UDEMY]

You are data scientist of your company and you use Azure Machine Learning Studio.

You are developing a linear regression model in Azure Machine Learning Studio. You run an experiment to compare different algorithms.

which algorithm minimizes differences between actual and predicted value

A

Choose algorithm that minimizes

Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a lower score is better.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

90
Q

[EXAM - UDEMY]
You are trainee of digiTechClouds company

Your instructor asked you to perform clustering by using the K-means algorithm.

You need to define the possible termination conditions.

Which three conditions can you use? Each correct answer presents a complete solution.

  • centroids do not change between interations
  • the redidual sumof squares (RSS) rises above a threshold
  • the sum of distances between centroids reaches a maximum
  • A fixed number of iterations is executed.
  • The residual sum of squares (RSS) fall below a threshold
A
  • Centroids do not change between iterations
  • A fixed number of iterations is executed
  • The residual sum of squares (RSS) fall below a threshold.

Explanation
The algorithm terminates when the centroids stabilize or when a specified number of iterations are completed.

A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vector from its centroid summed over all vectors. RSS is the objective function and our goal is to minimize it.

https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/k-means-clustering
https: //nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html

91
Q

[EXAM - UDEMY]
You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-image classification deep learning model that uses a set of labeled bird photos collected by experts. You plan to use the model to develop a cross-platform mobile app that predicts the species of bird captured by app users.

You must test and deploy the trained model as a web service. The deployed model must meet the following requirements:

  • An authenticated connection must not be required for testing.
  • The deployed model must perform with low latency during inferencing.
  • The REST endpoints must be scalable and should have a capacity to handle large number of requests when multiple end users are using the mobile application.

You need to verify that the web service returns predictions in the expected JSON format when a valid REST request is submitted.

Which compute resources should you use?

A

Explanation
ds-workstation notebook VM: An authenticated connection must not be required for testing.

On a Microsoft Azure virtual machine (VM), including a Data Science Virtual Machine (DSVM), you create local user accounts while provisioning the VM. Users then authenticate to the VM by using these credentials.

https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-common-identity

92
Q

[EXAM - UDEMY]
Are the following statements true or false?
Some code was given.

  • If a training cluster already exists in the workspace, it will be deleted and replaced
  • the wait_for_completion() method will not return until the aml-cluster compute has four active nodes.
  • the code creates a new aml-cluster compute target, it may be preempted due to capacity constraints
  • the aml-cluster compute target is deleted from the workspace after the training completes.
A

Box 1: No If a training cluster already exists it will be used.

Box 2: Yes The wait_for_completion method waits for the current provisioning operation to finish on the cluster.

Box 3: Yes Low Priority VMs use Azure’s excess capacity and are thus cheaper but risk your run being pre-empted.

Box 4: No Need to use training_compute.delete() to deprovision and delete the AmlCompute target.

https://notebooks.azure.com/azureml/projects/azureml-getting-started/html/how-to-use-azureml/training/train-on-amlcompute/train-on-amlcompute.ipynb

93
Q

[PERSONAL]

Types of feature scaling

A
  • standardscaler: assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.
  • min max scaler: It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).

This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

  • robust scaler
    The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rathar than the min-max, so that it is robust to outliers.
  • normalizer
    The normalizer scales each value by dividing each value by its magnitude in n-dimensional space for n number of features.Each point is now within 1 unit of the origin on this Cartesian co-ordinate system.

https://benalexkeen.com/feature-scaling-with-scikit-learn/

94
Q

[EXAM - UDEMY]
Your manager have asked you to perform feature engineering on a dataset.

You must add a feature named CityName and populate the column value with the text London.

You need to add the new feature to the dataset.

Which Azure Machine Learning Studio module should you use?

  • preprocess text
  • extract n-gram features from text
  • apply sql transformation
A

Explanation
Editing the metadata allows you to rename or change the data type of existing columns.

Since a new column/feature is being added. The answer Apply SQL Transform

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/apply-sql-transformation

95
Q

[EXAM - UDEMY]
You are data scientist of your company

You are creating a binary classification by using a two-class logistic regression model.

You need to evaluate the model results for imbalance.

Which evaluation metric should you use?

  • Mean absolute error
  • Root Mean Square Error
  • Accuracy
  • Relative absolute error
  • AUC curve
  • Relative squared error
A

One can inspect the true positive rate vs. the false positive rate in the Receiver Operating Characteristic (ROC) curve and the corresponding Area Under the Curve (AUC) value. The closer this curve is to the upper left corner, the better the classifier’s performance is (that is maximizing the true positive rate while minimizing the false positive rate). Curves that are close to the diagonal of the plot, result from classifiers that tend to make predictions that are close to random guessing.

https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance#evaluating-a-binary-classification-model

96
Q

[EXAM - UDEMY]

You have a model with a large difference between the training and validation error values. You must create a new model and perform cross-validation. You need to identify a parameter set for the new model using Azure Machine Learning Studio.

Which module you should use for below step?

Step:

Define the parameter scope

  • partition and sample
  • two-class boosted decision tree
  • tune model hyperparameters
  • split data
A

Split data

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

97
Q

[EXAM - UDEMY]
You have a model with a large difference between the training and validation error values. You must create a new model and perform cross-validation. You need to identify a parameter set for the new model using Azure Machine Learning Studio.

Which module you should use for below step?

Step:

Define the cross-validation settings

  • partition and sample
  • two-class boosted decision tree
  • tune model hyperparameters
  • split data
A

Explanation
Cross validation randomly divides the training data into a number of partitions, also called folds.

The algorithm defaults to 10 folds if you have not previously partitioned the dataset.

To divide the dataset into a different number of folds, you can use the Partition and Sample module and indicate how many folds to use.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/cross-validate-model

98
Q

[PERSONAL]

KENDALL rank correlation coefficient

A

is a statistic used to measure the ordinal association between two measured quantities.

It is a supported method of the Azure Machine Learning Feature selection.

99
Q

[EXAM - UDEMY]
You are with a time series dataset in Azure Machine Learning Studio.

You need to split your dataset into training and testing subsets by using the Split Data module.

Which splitting mode should you use?

  • recommender split
  • split rows with the randomized split parameter set to true
  • relative expression split
  • regular expression split
A

Explanation
Relative Expression Split: Use this option whenever you want to apply a condition to a number column. The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage. For example, you might want to divide your dataset based on the cost of the items, group people by age ranges, or separate data by a calendar date.

https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/split-data

100
Q

[PERSONAL]

Explain all the splitting modes

A
  • Split Rows: Use this option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split. By default, the data is divided 50/50.
  • Regular Expression Split: Choose this option when you want to divide your dataset by testing a single column for a value.

For example, if you’re analyzing sentiment, you can check for the presence of a particular product name in a text field. You can then divide the dataset into rows with the target product name and rows without the target product name.

  • Relative Expression Split: Use this option whenever you want to apply a condition to a number column. The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage. For example, you might want to divide your dataset based on the cost of the items, group people by age ranges, or separate data by a calendar date.
101
Q
[EXAM - UDEMY] 
You create a multi-class image classification deep learning experiment by using me Pytorch framework. you plan to run the experiment on an Azure compute cluster that has nodes with GPU's.

You need to define an Azure Machine Learning service pipeline to perform the monthly retraining of the image classification model. The pipeline must run with minimal cost and minimize the time required to train the model.

Which three pipeline steps should you run in sequence?

Giver several options…

A

The PyTorch estimator provides a simple way to launch a PyTorch training job on a compute target

Step 1: Configure a DataTransferStep() to fetch new image data from public web portal, running on the cpu-compute compute target.

Step 6: Configure a PythonScriptStep() to run image_resize.py on the cpu-compute compute target.

Step 4: Configure an EstimatorStep() to run an estimator that runs the bird_classifier_train.py model training script on the gpu_compute compute target as GPUs are faster for computing than CPUs.

102
Q

[EXAM - UDEMY]
You are Azure Machine Learning expert.

You are using a decision tree algorithm. You have trained a model that generalizes well at a tree depth equal to 10.

You need to select the bias and variance properties of the model with varying tree depth values.

Which properties should you select for each tree depth?

Bias for 5? high, low, identical
variance for 5? high, low, identical
Bias for 15? high, low, identical
variance for 15? high, low, identical

A

Bias for 5 -> low,
variance for 5 -> high,
Bias for 15 -> high
variance for 15 -> low

Explanation
In decision trees, the depth of the tree determines the variance. A complicated decision tree (e.g. deep) has low bias and high variance.

https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

103
Q
[EXAM - UDEMY] 
You are asked to use C-Support Vector classification to do a multi-class classification with an unbalanced training dataset.

The C-Support Vector classification using Python code shown below:

from sklearn.svm import svc
import numpy as np
svc = SVC(kernel = ‘linear’, class_Weight = ‘balanced’, c=1.0, random_state=0)
model = svc.fit(x_train, y)

You need to evaluate the C-Support Vector classification code.

Which evaluation statement should you use?

A
Explanation
Automatically adjust weights inversely proportional to class frequencies in the input data

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Penalty parameter

Parameter: C : float, optional (default=1.0)

Penalty parameter C of the error term.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

104
Q

[PERSONAL]

What is the objective of a support vector machine algorithm?

A

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

105
Q

[EXAM - UDEMY]

You need to configure the estimator for the experiment so that the script can read the data from a data reference named data_ref that references the csv_files folder in the training_data datastore.

Which code should you use to configure the estimator?

A

from azureml.core import Workspace, Datastore, Experiment
from azureml.train.sklearn import SKLearn

ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = ‘csv_training’
ds = Datastrore.get(ws, datatsore_name = “training_date”)
data_ref = ds.path(‘csv_files’)

Explanation
Besides passing the dataset through the inputs parameter in the estimator, you can also pass the dataset through script_params and get the data path (mounting point) in your training script via arguments. This way, you can keep your training script independent of azureml-sdk. In other words, you will be able use the same training script for local debugging and remote training on any cloud platform.

https://docs.microsoft.com/es-es/azure/machine-learning/how-to-train-with-datasets

106
Q

[EXAM - UDEMY]

You create a binary classification model for you company.

You need to evaluate the model performance.

Which metrics can you use?

A

The evaluation metrics available for binary classification models are: Accuracy, Precision, Recall, F1 Score, and AUC.

107
Q

[EXAM - UDEMY]

You create a binary classification model for you company.

You need to evaluate the model performance.

Which metrics can you use?

A

The evaluation metrics available for binary classification models are: Accuracy, Precision, Recall, F1 Score, and AUC.

https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance

108
Q

[EXAM - UDEMY]

You are determining if two sets of data are significantly different from one another by using Azure Machine Learning Studio.

Estimated values in one set of data may be more than or less than reference values in the other set of data.

You must produce a distribution that has a constant Type I error as a function of the correlation.

You need to produce the distribution.

Which type of distribution should you produce?

A

Explanation
Choose a one-tail or two-tail test. The default is a two-tailed test. This is the most common type of test, in which the expected distribution is symmetric around zero.

Paired because they are estimated and reference values of the same thing (or at least I took that as implied). Thus, they are related and should vary together.

Example: Type I error of unpaired and paired two-sample t-tests as a function of the correlation. The simulated random numbers originate from a bivariate normal distribution with a variance of 1.

https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/test-hypothesis-using-t-test
https: //en.wikipedia.org/wiki/Student%27s_t-test

109
Q

[EXAM - UDEMY]

You use the Two-Class Neural Network module in Azure Machine Learning Studio to build a binary classification model.

You use the Tune Model Hyperparameters module to tune accuracy for the model.

You need to configure the Tune Model Hyperparameters module.

Which two values should you use? Each correct answer presents part of the solution

A

You need to specify the maximum number of times the algorithm should process the training cases for Number of learning iterations

This is at accuracy tuning stage, the model design is completed, which means hidden layer specification is fixed now. Also thinking about hyperparameters, you should be able to do grid search.

All 5 values mentioned in the questions can be set, however only the “Learning Rate” and “Number of Learning Iterations” can be set as a range.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

110
Q

[EXAM - UDEMY]

You plan to provision an Azure Machine Learning Basic edition workspace for a data science project.

You need to identify the tasks you will be able to perform in the workspace.

Which three tasks will you be able to perform? Each correct answer presents a complete solution.

Use automated machine learning user interface to train a model
Create an azure kubernetes service (AKS) inference cluster
Create a tabular dataset that supports versioning
Create a compute instance and use it to run code in jupyter notebooks.
User the designer to train a model by draggin and dropping pre-defined modules.

A

Create an azure kubernetes service (AKS) inference cluster
Create a tabular dataset that supports versioning
Create a compute instance and use it to run code in jupyter notebooks.

https: //azure.microsoft.com/en-us/pricing/details/machine-learning/
https: //azure.github.io/azureml-sdk-for-r/reference/create_workspace.html

111
Q

[EXAM - UDEMY]
You are working as a lead data scientist for your company.

The project you are working in tracks the health and migration of birds. You create a multi-class image classification deep learning model that uses a set of labeled bird photographs collected by experts.

You have 100,000 photographs of birds. All photographs use the JPG format and are stored in an Azure blob container in an Azure subscription.

You need to access the bird photograph files in the Azure blob container from the Azure Machine Learning service workspace that will be used for deep learning model training.

You must minimize data movement. What should you do?

A

Register the azure blob storage containing the bird photograps as a datastore in azure machine learning service.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data

112
Q

[EXAM - UDEMY]
You are lead data scientist for your project

You use the Azure Machine Learning service to create a tabular dataset named training_data.

You plan to use this dataset in a training script.

You create a variable that references the dataset using the following code:

training_ds = workspace.datasets.get(“training_data”)

You define an estimator to run the script.

You need to set the correct property of the estimator to ensure that your script can access the training_data dataset.

Which property should you set?

A

input = [training_ds.as_named_input(‘training_ds’)]

example:
diabetes_ds = ws.datasets.get(“Diabetes Dataset”)
hyper_estimator = SKLearn(source_directory = experiment_folder, inputs = [diabetes_ds.as_named_unput(‘diabetes’)])

https://notebooks.azure.com/GraemeMalcolm/projects/azureml-primers/html/04%20-%20Optimizing%20Model%20Training.ipynb

113
Q

[EXAM - UDEMY]
You have asked to build a model for image recognition.

You create a deep learning model for image recognition on Azure Machine Learning service using GPU-based training.

You must deploy the model to a context that allows for real-time GPU-based inferencing.

You need to configure compute resources for model inferencing.

Which compute type should you use?

A

You can use Azure Machine Learning to deploy a GPU-enabled model as a web service. Deploying a model on Azure Kubernetes Service (AKS) is one option.

The AKS cluster provides a GPU resource that is used by the model for inference.

Inference, or model scoring, is the phase where the deployed model is used to make predictions. Using GPUs instead of CPUs offers performance advantages on highly parallelizable computation.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-inferencing-gpus

114
Q

[PERSONAL]

How to deploy models with azure ML?

A

-Register the model (optional, see below).
-Prepare an inference configuration (unless using no-code deployment).
An inference configuration describes how to set up the web-service containing your model. It’s used later, when you deploy the model.
-Prepare an entry script (unless using no-code deployment).
-Choose a compute target.
-Deploy the model to the compute target.
Test the resulting web service.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli

115
Q

[PERSONAL] how to load a registered scikit-learn model and score it with numpy data

A

import json
import numpy as np
import os
from sklearn.externals import joblib

def init():
    global model
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_mnist_model.pkl')
    model = joblib.load(model_path)
def run(data):
    try:
        data = np.array(json.loads(data))
        result = model.predict(data)
        # You can return any data type, as long as it is JSON serializable.
        return result.tolist()
    except Exception as e:
        error = str(e)
        return error
116
Q

[PERSONAL]

What are all the possible compute targets?

A
  • local webservices
  • azure ml compute instance webservice
  • AKS -> GPU
  • Azure container instances
  • AZML compute cluseters -> GPU
  • Azure function
  • Azure IOT Edge
  • Azure data box Edge
    https: //docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli
117
Q

[EXAM - UDEMY]
You are Azure Machine Learning expert of your company and your manager have provided you a training set.

You are building a binary classification model by using a supplied training set.

The training set is imbalanced between two classes.

You need to resolve the data imbalance.

What are three possible ways to achieve this goal? Each correct answer presents a complete solution.

A
  • penalize classification
  • generate synthetic samples in minority class
  • resample the dataset using under or oversampling

Explanation
Generate synthetic samples in the minority class: Try Generate Synthetic Samples

A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

Resample the dataset using undersampling or oversampling: You can change the dataset that you use to build your predictive model to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

§ Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)

§ Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)

Try Penalized Models:

You can use the same algorithms but give them a different perspective on the problem.

Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class.

References:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

118
Q

[EXAM - UDEMY]

The finance team asks you to train a model using data in an Azure Storage blob container named finance-data.

You need to register the container as a datastore in an Azure Machine Learning workspace and ensure that an error will be raised if the container does not exist.

How should you complete the code?

To answer, select the appropriate options in the answer area.

A
datastrore = Datastore.register_azure_blob_container (workspace = ws, 
datatstore_name = '...', 
contrainer_name = '....', 
account_name = ...', 
account_key = '...' , 
create_if_not_exists = False)

https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py

Box 1: register_azure_blob_container Register an Azure Blob Container to the datastore.

Box 2: create_if not_exists = False Create the file share if it does not exists, defaults to False.

119
Q

[EXAM - UDEMY]

You are working in a bank as a senior data scientist

You are creating a classification model for a banking company to identify possible instances of credit card fraud.

You plan to create the model in Azure Machine Learning by using automated machine learning.

The training dataset that you are using is highly unbalanced.

You need to evaluate the classification model.

Which primary metric should you use?

A

AUC-weighted

Explanation
AUC_weighted is a Classification metric.

AUC is the Area under the Receiver Operating Characteristic Curve. Weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class.

Incorrect Answers:

§ normalized_mean_absolute_error is a regression metric, not a classification metric.

§ When comparing approaches to imbalanced classification problems, consider using metrics beyond accuracy such as recall, precision, and AUROC. It may be that switching the metric you optimize for during parameter selection or model selection is enough to provide desirable performance detecting the minority class.

§ normalized_root_mean_squared_error is a regression metric, not a classification metric.

Reference:

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml

120
Q

[EXAM - UDEMY]

You use Azure Machine Learning Studio for analysis

You are with a time series dataset in Azure Machine Learning Studio.

You need to split your dataset into training and testing subsets by using the Split Data module.

Which splitting mode should you use?

A

Relative Expression Split: Use this option whenever you want to apply a condition to a number column. Time-series data means you should split the data by date, otherwise you may have information leaking. The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage. For example, you might want to divide your dataset based on the cost of the items, group people by age ranges, or separate data by a calendar date.

121
Q

[EXAM - UDEMY]

You are familiar with deep learning model

You create a multi-class image classification deep learning model that uses a set of labeled images.

You create a script file named train.py that uses the PyTorch 1.3 framework to train the model.

You must run the script by using an estimator. The code must not require any additional Python libraries to be installed in the environment for the estimator.

The time required for model training must be minimized.

You need to define the estimator that will be used to run the script. Which estimator type should you use?

  • Pytorch
  • Estimator
  • SKlearn
  • TensorFlow
A

Explanation

For PyTorch, TensorFlow and Chainer tasks, Azure Machine Learning provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks.

Reference:

https: //docs.microsoft.com/en-us/azure/machine-learning/how-to-train-ml-models
https: //docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py

What is an estimator
Represents a generic estimator to train data using any supplied framework.

This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn. To create an Estimator that is not preconfigured, see Train models with Azure Machine Learning using estimator.

The Estimator class wraps run configuration information to help simplify the tasks of specifying how a script is executed. It supports single-node as well as multi-node execution. Running the estimator produces a model in the output directory specified in your training script.

122
Q

[EXAM - UDEMY]

You are familiar in Azure Machine learning Studio

You are creating a new Azure Machine Learning pipeline using the designer.

The pipeline must train a model using data in a comma-separated values (CSV) file that is published on a website. You have not created a dataset for this file.

You need to ingest the data from the CSV file into the designer pipeline using the minimal administrative effort.

Which module should you add to the pipeline in Designer?

  • enter Data manually
  • import Data
  • Convert to CSV
  • Dataset
A
Explanation
The preferred way to provide data to a pipeline is a Dataset object. The Dataset object points to data that lives in or is accessible from a datastore or at a Web URL. The Dataset class is abstract, so you will create an instance of either a FileDataset (referring to one or more files) or a TabularDataset that's created by from one or more files with delimited columns of data.

Example:

from azureml.core import Dataset
iris_tabular_dataset = Dataset.Tabular.from_delimited_files ([(def_blob_store, ‘train-dataset/iris.csv’)])

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline

123
Q

[PERSONAL]

how to set up a pipeline?

A
  • Create a workspace
  • Set up machine learning resources
    • set up a datastore
  • Configure data with Dataset and PipelineData objects
  • set up a compute target
  • Configure the training run’s environment
  • Construct your pipeline steps
  • Caching & reuse
  • Submit the pipeline
124
Q

[EXAM - UDEMY]

You must identify the output files that are generated by the experiment run. You need to add code to retrieve the output file names. Which code segment should you add to the script?

A

files = run.get_file_names()

125
Q

[EXAM - UDEMY]
You are responsible to create different Azure Machine Learning model for your company.

You are creating a binary classification by using a two-class logistic regression model.

You need to evaluate the model results for imbalance.

Which evaluation metric should you use?

A

Explanation
One can inspect the true positive rate vs. the false positive rate in the Receiver Operating Characteristic (ROC) curve and the corresponding Area Under the

Curve (AUC) value. The closer this curve is to the upper left corner, the better the classifier’s performance is (that is maximizing the true positive rate while minimizing the false positive rate). Curves that are close to the diagonal of the plot, result from classifiers that tend to make predictions that are close to random guessing.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance#evaluating-a-binary-classification-model

126
Q

[EXAM - UDEMY]
You are expert in Azure Machine Learning

You are tuning a hyperparameter for an algorithm. The following table shows a data set with different hyperparameter, training error, and validation errors.

Larger image

Select the answer from the following boxes:

A

Explanation
Difference between Erros:

105-095: 10

200-085: 115

250-100: 150

105-100: 5 —> This is the best H value. 4 for Q1

400-050: 350 -> Highest Diff. So Poor 5 for Q2

4:

Choose the one which has lower training and validation error and also the closest match.

Minimize variance (difference between validation error and train error).

5:

Minimize variance (difference between validation error and train error).

127
Q

[EXAM - UDEMY]
You are responsible for creating different deep learning model for your company.

You create a multi-class image classification deep learning model.

You train the model by using PyTorch version 1.2.

You need to ensure that the correct version of PyTorch can be identified for the inferencing environment when the model is deployed.

What should you do?

A

Explanation
framework_version: The PyTorch version to be used for executing training code. PyTorch.get_supported_versions() returns a list of the versions supported by the current SDK.

https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py

128
Q

[EXAM - UDEMY]
You are senior Azure Machine Learning Associate of your company.

You are building a recurrent neural network to perform a binary classification.

The training loss, validation loss, training accuracy, and validation accuracy of each training epoch has been provided.

You need to identify whether the classification model is overfitted.

Which of the following is correct?

A

Explanation
An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade. – performance on the train set is good and continues to improve: meaning train loss decrease –validation set improves to a point and then begins to degrade: validation begins to degrade means loss increase

References:

https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/

129
Q

[EXAM - UDEMY]
A set of CSV files contains sales records. All the CSV files have the same data schema.

Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file in stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:

Larger image

At the end of each month, a new folder with that month’s sales file is added to the sales folder.

You plan to use the sales data to train a machine learning model based on the following requirements:

  • You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe.
  • You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month.
  • You must register the minimum number of datasets possible.

You need to register the sales data as a dataset in Azure Machine Learning service workspace.

What should you do?

A

Explanation
Specify the path.

Example: The following code gets the workspace existing workspace and the desired datastore by name. And then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds.

from azureml.core import Workspace, Datastore, Dataset datastore_name = ‘your datastore name’ # get existing workspace workspace = Workspace.from_config() # retrieve an existing datastore in the workspace by name datastore = Datastore.get(workspace, datastore_name)

create a TabularDataset from 3 file paths in datastore datastore_paths = [(datastore, ‘weather/2018/11.csv’), (datastore, ‘weather/2018/12.csv’), (datastore, ‘weather/201 9r.csv’)]

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

Activate Windows Go to Settings to activate Windows.

130
Q

[EXAM - UDEMY]
You are lead data scientist of your project and you use Azure Machine Learning Studio

You have a comma-separated values (CSV) file containing data from which you want to train a classification model.

You are using the Automated Machine Learning interface in Azure Machine Learning studio to train the classification model.

You set the task type to Classification.

You need to ensure that the Automated Machine Learning process evaluates only linear models.

What should you do?

A

https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml

131
Q

[EXAM - UDEMY]
You are senior data scientist of your company and you use Azure Machine Learning Model.

You train a model and register it in your Azure Machine Learning workspace.

You are ready to deploy the model as a real-time web service.

You deploy the model to an Azure Kubernetes Service (AKS) inference cluster, but the deployment fails because an error occurs when the service runs the entry script that is associated with the model deployment.

You need to debug the error by iteratively modifying the code and reloading the service, without requiring a re-deployment of the service for each code update.

A

Explanation
If you encounter problems deploying a model to ACI or AKS, you can try deploying it as a local web service. Using a local web service it makes easier to troubleshoot problems. The Docker image containing the model is downloaded and started on your local system.

Deployment and runtime errors can be easier to diagnose by deploying the service as a container in a local Docker instance.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-deployment

132
Q

[EXAM - UDEMY]
You have a model with a large difference between the training and validation error values. You must create a new model and perform cross-validation. You need to identify a parameter set for the new model using Azure Machine Learning Studio.

Which module you should use for below step?

Step:

Train, evaluate, and compare

A
Explanation
Tune Model Hyperparameters Integrated train and tune: You configure a set of parameters to use, and then let the module iterate over multiple combinations, measuring accuracy until it finds a "best" model. With most learner modules, you can choose which parameters should be changed during the training process, and which should remain fixed.

Reference:

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

133
Q

[EXAM - UDEMY]
You are creating a binary classification by using a two-class logistic regression model.
You need to evaluate the model results for imbalance.
Which evaluation metric should you use?

A

Explanation
One can inspect the true positive rate vs. the false positive rate in the Receiver Operating Characteristic (ROC) curve and the corresponding Area Under the Curve (AUC) value. The closer this curve is to the upper left corner, the better the classifiers performance is (that is maximizing the true positive rate while minimizing the false positive rate). Curves that are close to the diagonal of the plot, result from classifiers that tend to make predictions that are close to random guessing.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance#evaluating-a- binary-classification-model

134
Q

[EXAM - UDEMY]
You are developing a deep learning model by using TensorFlow. You plan to run the model training workload on an Azure Machine Learning Compute Instance. You must use CUDA-based model training. You need to provision the Compute Instance. Which two virtual machines sizes can you use?

A

CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.

Reference:

https://www.infoworld.com/article/3299703/what-is-cuda-parallel-programming-for-gpus.html

135
Q

[EXAM - UDEMY]
You are creating a machine learning model in Azure Machine Learning Studio fro your company.

You have a dataset that contains null rows.

You need to use the Clean Missing Data module in Azure Machine Learning Studio to identify and resolve the null and missing data in the dataset.

Which parameter should you use?

A

Explanation
Remove entire row: Completely removes any row in the dataset that has one or more missing values. This is useful if the missing value can be considered randomly missing.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

136
Q

[EXAM - UDEMY]
You are a senior data scientist working for a retail bank and have used Azure ML to train and register a machine learning model that predicts whether a customer is likely to repay a loan or not.

You want to understand how your model is making selections and must be sure that the model does not violate government regulations such as denying loans based on where an applicant lives.

You need to determine the extent to which each feature in the customer data is influencing predictions.

What should you do?

A

Use the interpretabibiltiy package to genrate an explainer for the model.

Explanation
When you compute model explanations and visualize them, you’re not limited to an existing model explanation for an automated ML model. You can also get an explanation for your model with different test data. The steps in this section show you how to compute and visualize engineered feature importance based on your test data.

Incorrect Answers:

§ In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons where model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.

§ A confusion matrix is used to describe the performance of a classification model. Each row displays the instances of the true, or actual class in your dataset, and each column represents the instances of the class that was predicted by the model.

§ Hyperparameters are adjustable parameters you choose for model training that guide the training process. The HyperDrive package helps you automate choosing these parameters.

Reference:

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability-automl

137
Q

[EXAM - UDEMY]
You are junior Machine Learning associate of your company

You are implementing a machine learning model to predict stock prices.

The model uses a PostgreSQL database and requires GPU processing.

You need to create a virtual machine that is pre-configured with the required tools.

What should you do?

A

Explanation
Incorrect Answers:

DLVM is a template on top of DSVM image. In terms of the packages, GPU drivers etc are all there in the DSVM image. Mostly it is for convenience during creation where we only allow DLVM to be created on GPU VM instances on Azure.

PostgreSQL (CentOS) is only available in the Linux Edition.

The Azure Geo AI Data Science VM (Geo-DSVM) delivers geospatial analytics capabilities from Microsoft’s Data Science VM. Specifically, this VM extends the AI and data science toolkits in the Data Science VM by adding ESRI’s market-leading ArcGIS Pro Geographic Information System.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

138
Q

[EXAM - UDEMY]
Your manager asked you to create a deep learning model.

You create a multi-class image classification deep learning model that uses the PyTorch deep learning framework.

You must configure Azure Machine Learning Hyperdrive to optimize the hyperparameters for the classification model.

You need to define a primary metric to determine the hyperparameter values that result in the model with the best accuracy score.

Which three actions must you perform? Each correct answer presents part of the solution.

A

xplanation
primary_metric_name=”accuracy”,

primary_metric_goal=PrimaryMetricGoal.MAXIMIZE

Optimize the runs to maximize “accuracy”. Make sure to log this value in your training script.

The training script calculates the val_accuracy and logs it as “accuracy”, which is used as the primary metric.

Note:

primary_metric_name: The name of the primary metric to optimize. The name of the primary metric needs to exactly match the name of the metric logged by the training script. primary_metric_goal: It can be either PrimaryMetricGoal.MAXIMIZE or PrimaryMetricGoal.MINIMIZE and determines whether the primary metric will be maximized or minimized when evaluating the runs.

Reference:

https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriverunconfig?view=azure-ml-py

139
Q

[EXAM - UDEMY]
You are familiar with Azure Machine Learning Studio

You create a binary classification model by using Azure Machine Learning Studio.

You must tune hyperparameters by performing a parameter sweep of the model.

The parameter sweep must meet the following requirements:

§ iterate all possible combinations of hyperparameters

§ minimize computing resources required to perform the sweep

You need to perform a parameter sweep of the model. Which parameter sweep mode should you use?

A

Random grid

xplanation
Random Grid:

Maximum number of runs on random grid: This option also controls the number of iterations over a random sampling of parameter values, but the values are not generated randomly from the specified range; instead, a matrix is created of all possible combinations of parameter values and a random sampling is taken over the matrix. This method is more efficient and less prone to regional oversampling or undersampling.

If you are training a model that supports an integrated parameter sweep, you can also set a range of seed values to use and iterate over the random seeds as well. This is optional, but can be useful for avoiding bias introduced by seed selection.

You can also reduce the size of the grid and run a random grid sweep. Research has shown that this method yields the same results, but is more efficient computationally.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

140
Q

[EXAM - UDEMY]
You are very much familiar with Azure Machine Learning Studio

You are performing a classification task in Azure Machine Learning Studio.

You must prepare balanced testing and training samples based on a provided data set.

You need to split the data with a 0.75:0.25 ratio.

Which value should you use for each parameter?

  • splitting mode
  • fraction of rows in the first output dataset
  • randomized split
  • stratified split
A

Explanation
Box 1: Split rows

Use the Split Rows option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50.

You can also randomize the selection of rows in each group, and use stratified sampling. In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets.

Box 2: 0.75

If you specify a number as a percentage, or if you use a string that contains the “%” character, the value is interpreted as a percentage. All percentage values must be within the range (0, 100), not including the values 0 and 100.

Box 3: True

To ensure splits are balanced.

Box 4: True

It is asking for balanced data. Stratified split should be true in order to be balanced.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows

141
Q

[EXAM - UDEMY]
HOTSPOT

You have a dataset that contains 2,000 rows. You are building a machine learning classification model by using Azure Learning Studio. You add a Partition and

Sample module to the experiment.

You need to configure the module. You must meet the following requirements:

✑ Divide the data into subsets

✑ Assign the rows into folds using a round-robin method

✑ Allow rows in the dataset to be reused

How should you configure the module? 
Partition or sample 
 - assign to folds 
- pick folds 
- sampling 
- head 
use replacement in the partitioning 
randomized split
A

Use the Split data into partitions option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups.

  1. Add the Partition and Sample module to your experiment in Studio (classic), and connect the dataset.
  2. For Partition or sample mode, select Assign to Folds.
  3. Use replacement in the partitioning: Select this option if you want the sampled row to be put back into the pool of rows for potential reuse. As a result, the same row might be assigned to several folds.
  4. If you do not use replacement (the default option), the sampled row is not put back into the pool of rows for potential reuse. As a result, each row can be assigned to only one fold.
  5. Randomized split: Select this option if you want rows to be randomly assigned to folds.

If you do not select this option, rows are assigned to folds using the round-robin method.

References:

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

142
Q

[EXAM - UDEMY]
You have to create many machine learning model on daily basis.

So you plan to use automated machine learning to train a regression model.

You have data that has features which have missing values, and categorical features with few distinct values.

You need to configure automated machine learning to automatically impute missing values and encode categorical features as part of the training task.

Which parameter and value pair should you use in the AutoMLConfig class?

A

Explanation
Featurization str or FeaturizationConfig

Values: ‘auto’ / ‘off’ / FeaturizationConfig

Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used.

Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:

§ Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.

§ Numeric: Impute missing values, cluster distance, weight of evidence.

§ DateTime: Several features such as day, seconds, minutes, hours etc.

§ Text: Bag of words, pre-trained Word embedding, text target encoding.

Reference:

https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py

143
Q

[EXAM - UDEMY]
You are very much familiar in Azure Machine learning workspace

You train and register a model in your Azure Machine Learning workspace.

You must publish a pipeline that enables client applications to use the model for batch inferencing. You must use a pipeline with a single ParallelRunStep step that runs a Python inferencing script to get predictions from the input data.

You need to create the inferencing script for the ParallelRunStep pipeline step.

Which two functions should you include? Each correct answer presents part of the solution.

A

init()
run(mini_batch)

Explanation
Reference:

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/parallel-run

144
Q

[EXAM - UDEMY]
An organization uses Azure Machine Learning service and wants to expand their use of machine learning.

You have the following compute environments. The organization does not want to create another compute environment.

Environment name Compute type nb server Compute Instance aks cluster Azure Kubernetes Service mlc cluster Machine Learning Compute

You need to determine which compute environment to use for the following scenarios.

Which compute types should you use? To answer, drag the appropriate compute environments to the correct scenarios. Each compute environment may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

A

Explanation
Box 1: mlc_cluster

With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as compute targets. A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDlnsight or a remote virtual machine.

Box 2: aks_cluster

Real-time endpoints must be deployed to an Azure Kubernetes Service cluster.

https: //docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target
https: //docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets
https: //docs.microsoft.com/en-us/azure/machine-learning/concept-designer