Questions set 1 Flashcards
[EXAM- UDEMY] You are asked to solve a classification task.
You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits.
You need to configure the k parameter for the cross-validation.
Which value should you use?
k = 10 k = 0.9 K = 0.5 K = 1
Leave One Out (LOO) cross-validation
Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), this is a special case of the K-fold approach.
LOO CV is sometimes useful but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.
This is why the usual choice is K=5 or 10. This provides a good compromise for the bias-variance tradeoff.
[PERSONAL] what is the purpose of K-fold cross validation
- maximize the use of the available data for training and then testing a model.
- assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
[PERSONAL] what is the purpose of cross-validation?
Cross validation (CV) is one of the technique used to test the effectiveness of a machine learning models, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample/portion of the data on which is not used to train the model, later use this sample for testing/validating.
[PERSONAL] Give the variations on cross-validation
Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model. LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short. Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation. Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample. Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.
[EXAM- UDEMY] Your manager asked you to analyze a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.
You need to analyze a full dataset to include all values.
Solution:
Use the Last Observation Carried Forward (LOCF) method to impute the missing data points.
Explanation
Instead of using Last Observation Carried Forward method, you need to use the Multiple Imputation by Chained Equations (MICE) method.
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.
Note:
Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study.
[PERSONAL] Pro’s and Cons of mean/median imputation
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
[PERSONAL] Pro’s and Cons of Most Frequent or Zero/Constant values
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
[PERSONAL] Pro’s and Cons
Imputation Using k-NN
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive.
KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
[PERSONAL] Pro’s and Cons
Imputation Using Multivariate Imputation by Chained Equation (MICE)
pro’s
- better
- flexible: can handle different data types
- can handle complexities
This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
[PERSONAL] what is Hot-Deck imputation
Works by randomly choosing the missing value from a set of related and similar variables.
[PERSONAL] what is Extrapolation and Interpolation imputation?
It tries to estimate values from other observations within the range of a discrete set of known data points.
[PERSONAL] what is Stochastic regression imputation
It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value
[EXAM - UDEMY]
You are a senior data scientist of your company and you use Azure Machine Learning Studio.
You are asked to normalize values to produce an output column into bins to predict a target column.
Solution:
Apply a Quantiles normalization with a QuantileIndex normalization.
Does the solution meet the goal?
Quantile Normalization: Summary of YT-video - start with highest value - calculate mean - put elements of different distribution on the mean.
https://www.youtube.com/watch?reload=9&v=ecjN6Xpv6SE
In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile-normalize a test distribution to a reference distribution of the same length, sort the test distribution and sort the reference distribution.
This has nothing to do with bins.
Entropy MDL: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named
[EXAM - UDEMY]
You are analyzing a raw dataset that requires cleaning.
You must perform transformations and manipulations by using Azure Machine Learning Studio.
You need to identify the correct module to perform the below transformation.
Which module should you choose?
Scenario:
Remove potential duplicates from a dataset
- remove duplicate rows
- SMOTE
- Convert to indicator values
- Clean missing data
- Threshold filter
Use the Remove Duplicate Rows module in Azure Machine Learning Studio (classic), to remove potential duplicates from a dataset.
[PERSONAL]
What are all the categories in the data transformation category?
Data Transformation - Filter
Learning with Counts
Data Transformation - Manipulation
Data Transformation - Sample and Split
Data Transformation - Scale and Reduce
[PERSONAL] Data Transformation - Filter
Give al the types of filters and what they do
Apply Filter: Applies a filter to specified columns of a dataset.
FIR Filter: Creates an FIR filter for signal processing.
See also
IIR Filter: Creates an IIR filter for signal processing.
Median Filter: Creates a median filter that’s used to smooth data for trend analysis.
Moving Average Filter: Creates a moving average filter that smooths data for trend analysis.
Threshold Filter: Creates a threshold filter that constrains values.
User-Defined Filter: Creates a custom FIR or IIR filter.
[PERSONAL] Data Transformation - Learning with Counts
The basic idea of count-based featurization is that by calculating counts, you can quickly and easily get a summary of what columns contain the most important information. The module counts the number of times a value appears, and then provides that information as a feature for input to a model.
Build Counting Transform: Creates a count table and count-based features from a dataset, and then saves
the table and features as a transformation.
Export Count Table: Exports a count table from a counting transform. This module supports backward
compatibility with experiments that create count-based features by using Build Count Table (deprecated)
and Count Featurizer (deprecated).
Import Count Table: Imports an existing count table. This module supports backward compatibility with experiments that create count-based features by using Build Count Table (deprecated) and Count Featurizer (deprecated). The module supports conversion of count tables to count transformations.
Merge Count Transform: Merges two sets of count-based features.
Modify Count Table Parameters: Modifies count-based features that are derived from an existing count table.
[PERSONAL]
Data Transformation - Manipulation
Give some modules of this module.
Add Columns: Adds a set of columns from one dataset to another.
See also
Add Rows: Appends a set of rows from an input dataset to the end of another dataset.
Apply SQL Transformation: Runs a SQLite query on input datasets to transform the data.
Clean Missing Data: Specifies how to handle values that are missing from a dataset. This module replaces Missing Values Scrubber, which has been deprecated.
Convert to Indicator Values: Converts categorical values in columns to indicator values.
Edit Metadata: Edits metadata that’s associated with columns in a dataset.
Group Categorical Values:
Groups data from multiple categories into a new category.
Join Data: Joins two datasets.
Remove Duplicate Rows: Removes duplicate rows from a dataset.
Select Columns in Dataset: Selects columns to include in a dataset or exclude from a dataset in an operation.
Select Columns Transform: Creates a transformation that selects the same subset of columns as in a
specified dataset.
SMOTE: Increases the number of low-incidence examples in a dataset by using synthetic minority
oversampling.
[PERSONAL] Data Transformation - Sample and Split
Give the two modules and what they do.
Partition and Sample: Creates multiple partitions of a dataset based on sampling.
Split Data: Partitions the rows of a dataset into two distinct sets.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-sample-and-split
[PERSONAL] Data Transformation - Scale and Reduce
Clip Values: Detects outliers, and then clips or replaces their values.
Group Data into Bins: Puts numerical data into bins.
Normalize Data: Rescales numeric data to constrain dataset values to a standard range.
Principal Component Analysis: Computes a set of features that have reduced dimensionality for more efficient
learning.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are performing a filter-based feature selection for a dataset to build a multi-class classifier by using Azure Machine Learning Studio.
The dataset contains categorical features that are highly correlated to the output label column.
You need to select the appropriate feature scoring statistical method to identify the key predictors.
Which method should you use?
- spearman correlation
- Kendal correlation
- Chi-squared
- Pearson correlation
Explanation
The chi-square statistic is used to show whether or not there is a relationship between two categorical variables
Incorrect Answer:
Pearson’s correlation coefficient (r) is used to demonstrate whether two variables are correlated or related to each other.
[PERSONAL]
Explain CHI-squared test, for what is it used?
is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.
How likely is it that two sets of observations arose from the same distribution?
YT: https://www.youtube.com/watch?v=2QeDRsxSF9M
[PERSONAL]
spearman correlation
Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a relationship between two variables. The result will always be between 1 and minus 1.
A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation.
[PERSONAL]
Kendal correlation
n statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. … can be formulated as special cases of a more general correlation coefficient.
In the normal case, the Kendall correlation is preferred than the Spearman correlation because of a smaller gross error sensitivity (GES) (more robust) and a smaller asymptotic variance (AV) (more efficient).
[PERSONAL] Pearson correlation
is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1. A value of +1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.
Correlation is a technique for investigating the relationship between two quantitative, continuous variables
[EXAM - UDEMY]
You are a data scientist and you use Azure Machine Learning Studio for your experiments.
You are creating a new experiment in Azure Machine Learning Studio.
One class has a much smaller number of observations than the other classes in the training set.
You need to select an appropriate data sampling strategy to compensate for the class imbalance.
Solution:
You use the Principal Components Analysis (PCA) sampling mode.
Does the solution meet the goal?
Explanation
Instead of using Principal Components Analysis, use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.
Note:
SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
Incorrect Answers:
The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.
[EXAM - UDEMY]
You are a data scientist and you use Azure Machine Learning Studio for your experiments.
You are creating a new experiment in Azure Machine Learning Studio.
One class has a much smaller number of observations than the other classes in the training set.
You need to select an appropriate data sampling strategy to compensate for the class imbalance.
Solution:
You use the Principal Components Analysis (PCA) sampling mode.
Does the solution meet the goal?
Explanation
Instead of using Principal Components Analysis, use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.
Note:
SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.
Incorrect Answers:
The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.
[PERSONAL] Explain PCA
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
[EXAM - UDEMY] - (duplicate question)
You are a data scientist using Azure Machine Learning Studio.
You are using Azure Machine Learning Studio to perform feature engineering on a dataset.
You need to normalize values to produce a feature column grouped into bins.
Solution:
Apply an Entropy Minimum Description Length (MDL) binning mode.
Does the solution meet the goal?
Explanation
Entropy MDL binning mode:
This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named quantized.
[EXAM - UDEMY]
HOTSPOT
You are a data scientist of your company
You are working on a classification task.
You have a dataset indicating whether a student would like to play soccer and associated attributes.
The dataset includes the following columns:
isPlayerSoccer? boolean Gender? M or F PrevExamMarks: Stores values from 0 - 100 Height: in centimeters Weight: stores in kilograms
Which are continuous variables?
Too obvious :)
- heigth
- weight
- PrevExamMarks
[EXAM - UDEMY]
HOTSPOT
Your manager have asked you to create a binary classification model to predict whether a person has a disease.
You need to detect possible classification errors.
Which error type should you choose for below description?
A person has a disease. The model calssifies the case as having no disease.
False negative
A false negative is an outcome where the model incorrectly predicts the negative class.
Note:
Let’s make the following definitions:
“Wolf” is a positive class.
“No wolf” is a negative class.
We can summarize our “wolf-prediction” model using a 2x2 confusion matrix that depicts all four possible outcomes:
[EXAM - UDEMY]
You use the Azure Machine Learning service to create a tabular dataset named training_data. You plan to use this dataset in a training script.
You create a variable that references the dataset using the following code:
training_ds = workspace.datasets.get(“training_data”)
You define an estimator to run the script.
You need to set the correct property of the estimator to ensure that your script can access the training_data dataset.
Which property should you set?
- source_directory = training_ds
- inputs = [training_ds.as_named.input(‘training_ds’)]
- environment_definition = {‘training_ds”: training_ds}
- script_params = {“– training_ds: training_ds”}
inputs =[training_ds.as_named.input(‘training_ds’)]
Estimator. Represents a generic estimator to train data using any supplied framework. This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn
- inputs (list):
A list of DataReference or DatasetConsumptionConfig objects to use as input.
[PERSONAL]
What is an estimator?
Estimator. Represents a generic estimator to train data using any supplied framework. This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for Chainer, PyTorch, TensorFlow, and SKLearn
The Estimator class wraps run configuration information to help simplify the tasks of specifying how a script is executed. It supports single-node as well as multi-node execution. Running the estimator produces a model in the output directory specified in your training script.
[PERSONAL] what are the parameters of an estimator
Parameters for estimator
source_directory (str)
A local directory containing experiment configuration and code files needed for a training job.
compute_target (AbstractComputeTarget or str)
The compute target where training will happen. This can either be an object or the string “local”.
vm_size (str)
The VM size of the compute target that will be created for the training. Supported values: Any Azure VM size.
vm_priority (str)
The VM priority of the compute target that will be created for the training. If not specified, ‘dedicated’ is used. Supported values: ‘dedicated’ and ‘lowpriority’.This takes effect only when the vm_size parameter is specified in the input.
entry_script (str)
The relative path to the file used to start training.
script_params (dict)
A dictionary of command-line arguments to pass to the training script specified in entry_script.
node_count (int)
The number of nodes in the compute target used for training. If greater than 1, an MPI distributed job will be run.
process_count_per_node (int)
The number of processes (or “workers”) to run on each node. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.
distributed_backend (str)
The communication backend for distributed training.
DEPRECATED. Use the distributed_training parameter.
Supported values: ‘mpi’. ‘mpi’ represents MPI/Horovod.
This parameter is required when node_count or process_count_per_node > 1.
When node_count == 1 and process_count_per_node == 1, no backend will be used unless the backend is explicitly set. Only the AmlCompute target is supported for distributed training.
distributed_training (Mpi)
Parameters for running a distributed training job.
For running a distributed job with MPI backend, use Mpi object to specify process_count_per_node.
use_gpu (bool)
Indicates whether the environment to run the experiment should support GPUs. If true, a GPU-based default Docker image will be used in the environment. If false, a CPU-based image will be used. Default Docker images (CPU or GPU) will be used only if the custom_docker_image parameter is not set. This setting is used only in Docker enabled compute targets.
use_docker (bool)
Specifies whether the environment to run the experiment should be Docker-based.
custom_docker_base_image (str)
The name of the Docker image from which the image to use for training will be built.
DEPRECATED. Use the custom_docker_image parameter.
If not set, a default CPU-based image will be used as the base image.
custom_docker_image (str)
The name of the Docker image from which the image to use for training will be built. If not set, a default CPU-based image will be used as the base image. Only specify images available in public docker repositories (Docker Hub). To use an image from a private docker repository, use the constructor’s environment_definition parameter instead.
image_registry_details (ContainerRegistry)
The details of the Docker image registry.
user_managed (bool)
Specifies whether Azure ML reuses an existing Python environment. If false, a Python environment is created based on the conda dependencies specification.
conda_packages (list)
A list of strings representing conda packages to be added to the Python environment for the experiment.
pip_packages (list)
A list of strings representing pip packages to be added to the Python environment for the experiment.
conda_dependencies_file_path (str)
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages
DEPRECATED. Use the conda_dependencies_file paramenter.
Specify either conda_dependencies_file_path or conda_dependencies_file. If both are specified, conda_dependencies_file is used.
pip_requirements_file_path (str)
The relative path to the pip requirements text file.
DEPRECATED. Use the pip_requirements_file parameter.
This parameter can be specified in combination with the pip_packages parameter. Specify either pip_requirements_file_path or pip_requirements_file. If both are specified, pip_requirements_file is used.
conda_dependencies_file (str)
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages.
pip_requirements_file (str)
The relative path to the pip requirements text file. This parameter can be specified in combination with the pip_packages parameter.
environment_variables (dict)
A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed.
environment_definition (Environment)
The environment definition for the experiment. It includes PythonSection, DockerSection, and environment variables. Any environment option not directly exposed through other parameters to the Estimator construction can be set using this parameter. If this parameter is specified, it will take precedence over other environment-related parameters like use_gpu, custom_docker_image, conda_packages, or pip_packages. Errors will be reported on invalid combinations.
Inputs (list)
A list of DataReference or DatasetConsumptionConfig objects to use as input.
source_directory_data_store (Datastore)
The backing data store for the project share.
shm_size (str)
The size of the Docker container’s shared memory block. If not set, the default azureml.core.environment._DEFAULT_SHM_SIZE is used. For more information, see Docker run reference.
resume_from (DataPath)
The data path containing the checkpoint or model files from which to resume the experiment.
max_run_duration_seconds (int)
The maximum allowed time for the run. Azure ML will attempt to automatically cancel the run if it take longer than this value.
[PERSONAL] Write code for an estimator that uses the remote compute.
# Get the training dataset diabetes_ds = ws.datasets.get("Diabetes Dataset")
# Create an estimator that uses the remote compute hyper_estimator = SKLearn(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input compute_target = cpu_cluster, conda_packages=['pandas','ipykernel','matplotlib'], pip_packages=['azureml-sdk','argparse','pyarrow'], entry_script='diabetes_training.py')
source (this is a good source for general setup): https://notebooks.azure.com/GraemeMalcolm/projects/azureml-primers/html/04%20-%20Optimizing%20Model%20Training.ipynb
[EXAM - UDEMY]
You are creating a new experiment in Azure Machine Learning Studio.
You have a small dataset that has missing values in many columns.
The data does not require the application of predictors for each column.
You plan to use the Clean Missing Data. You need to select a data cleaning method.
Which method should you use?
- SMOTE ( synthetic minority oversampling technique)
- Replace using probabilistic PCA
- Replace using MICE
- Normalization
Instead of using Clean Missing Data, use Replace using Probabilistic PCA
Replace using Probabilistic PCA: Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns.
[PERSONAL]
Replace using Probabilistic PCA
YT: https://www.youtube.com/watch?v=6z6yipdfe3o
Replaces the missing values by using a linear model that analyzes the correlations between the columns and estimates a low-dimensional approximation of the data, from which the full data is reconstructed. The underlying dimensionality reduction is a probabilistic form of Principal Component Analysis (PCA), and it implements a variant of the model proposed in the Journal of the Royal Statistical Society, Series B 21(3), 611–622 by Tipping and Bishop.
Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns.
The key limitations of this method are that it expands categorical columns into numerical indicators and computes a dense covariance matrix of the resulting data. It also is not optimized for sparse representations. For these reasons, datasets with large numbers of columns and/or large categorical domains (tens of thousands) are not supported due to prohibitive space consumption.
[PERSONAL] Pro’s and cons for using Probabilistic PCA
not requiring the application of predictors for each column. it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns
- computes dens covariance matrix
- not optimized for spare representations
Not good for datasets with large numbers of columns ans/or large categorical domaines.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are evaluating a completed binary classification machine learning model.
You need to use the precision as the evaluation metric.
Which visualization should you use?
- box-plot
- binary classification confusion matrix
- violin plot
- gradient descent
Explanation
Incorrect Answers:
1) A violin plot is a visual that traditionally combines a box plot and a kernel density plot.
2) Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
3) A box plot lets you see basic distribution information about your data, such as median, mean, range and quartiles but doesn’t show you how your data looks throughout its range.
[EXAM - UDEMY] You are analyzing a raw dataset that requires cleaning.
You must perform transformations and manipulations by using Azure Machine Learning Studio.
You need to identify the correct module to perform the below transformation.
Which module should you choose?
Scenario:
Replace missing values by removing rows and columns
- clean missing data
- convert to indicator values
- remove duplicate rows
- threshold filter
- smote
Clean missing data
Each time that you apply the Clean Missing Data module to a set of data, the same cleaning operation is applied to all columns that you select. Therefore, if you need to clean different columns using different methods, use separate instances of the module.
Add the Clean Missing Data module to your pipeline, and connect the dataset that has missing values.
For Columns to be cleaned, choose the columns that contain the missing values you want to change. You can choose multiple columns, but you must use the same replacement method in all selected columns. Therefore, typically you need to clean string columns and numeric columns separately.
For example, to check for missing values in all numeric columns:
Select the Clean Missing Data module, and click on Edit column in the right panel of the module.
For Include, select Column types from the dropdown list, and then select Numeric.
Any cleaning or replacement method that you choose must be applicable to all columns in the selection. If the data in any column is incompatible with the specified operation, the module returns an error and stops the pipeline.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are creating a machine learning model.
You need to identify outliers in the data.
Which two visualizations can you use?
- random forest diagram
- Venn diagram
- Scatter plot
- ROC-curve
- BOX-plot
Explanation
The box-plot algorithm can be used to display outliers.
One other way to quickly identify Outliers and represent visually is to create scatter plots.
[PERSONAL]
ROC-curve
The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
[PERSONAL] Fisher score
Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features.
In mathematical statistics, the Fisher information (sometimes simply called information[1]) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X.
Extra information: https://towardsdatascience.com/overview-of-feature-selection-methods-a2d115c7a8f7
[PERSONAL]
Mutual Information
Mutual Information
The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.
[EXAM - UDEMY]
You are data science instructor of your company
You plan to deliver a hands-on workshop to several students.
The workshop will focus on creating data visualizations using Python.
Each student will use a device that has internet access.
Student devices are not configured for Python development.
Students do not have administrator access to install software on their devices.
Azure subscriptions are not available for students.
You need to ensure that students can run Python-based data visualization code.
- azure notebooks
- Azure ML service
- Anacond data science platform
- Azure batchAI
- azure notebooks
[EXAM - UDEMY]
Your supervisor asked you to preprocess text from CSV files.
You load the Azure Machine Learning Studio default stop words list.
You need to configure the Preprocess Text module to meet the following requirements:
§ Ensure that multiple related words from a single canonical form.
§ Remove pipe characters from text.
§ Remove words to optimize information retrieval.
Which three options should you select?
- remove stop words
- lemmatization
- remove special characters
[PERSONAL] Lemmatisation (or lemmatization) and difference with stemming.
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
[PERSONAL] Stemming
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
[EXAM - UDEMY]
You plan to explore demographic data for home ownership in various cities. The data is in a CSV file with the following format:
age,city,income,home_owner 21,Chicago,50000,0 35,Seattle,120000,1 23,Seattle,65000,0 45,Seattle,130000,1 18,Chicago,48000,0 You need to run an experiment in your Azure Machine Learning workspace to explore the data and log the results. The experiment must log the following information:
- the number of observations in the dataset
- a box plot of income by home_owner
- a dictionary containing the city names and the average income for each city
You need to use the appropriate logging methods of the experiment’s run object to log the required information.
How should you complete the code?
log
log_image
log_table
Explanation
Box 1: log The number of observations in the dataset.
run.log(name, value, description=”) Scalar values: Log a numerical or string value to the run with the given name. Logging a metric to a run causes that metric to be stored in the run record in the experiment. You can log the same metric multiple times within a run, the result being considered a vector of that metric.
Example: run.log(“accuracy”, 0.95)
Box 2: log_image A box plot of income by home_owner.
log_image Log an image to the run record. Use log_image to log a .PNG image file or a matplotlib plot to the run. These images will be visible and comparable in the run record. Example: run.log_image(“ROC”, plot=plt)
Box 3: log_table A dictionary containing the city names and the average income for each city. log_table: Log a dictionary object to the run with the given name.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are creating a machine learning model in Python.
The provided dataset contains several numerical columns and one text column.
The text column represents a product’s category.
The product category will always be one of the following:
§ Bikes
§ Cars
§ Vans
§ Boats
You are building a regression model using the scikit-learn Python package.
You need to transform the text data to be compatible with the scikit-learn Python package.
How should you complete the code segment? To answer, select the appropriate options in the answer area.
Import pandas as dataframe
Use .map method of the dataframe
Explanation
Box 1: pandas as df
Pandas takes data like a CSV or TSV file, or a SQL database and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example).
Box 2: map[ProductCategoryMapping]
[EXAM - UDEMY]
DRAG DROP
You have a dataset that contains over 150 features.
You use the dataset to train a Support Vector Machine (SVM) binary classifier.
You need to use the Permutation Feature Importance module in Azure Machine Learning Studio to compute a set of feature importance scores for the dataset.
In which order should you perform the actions?
Add a Two-Class Support Vector Machine module to initialize the SVM classifier.
Add a dataset to the experiment
Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment.
Add a Split Data module to create training and test dataset.
Add a Permutation Feature Importance module and connect to the trained model and test dataset.
Step 1: Add a Two-Class Support Vector Machine module to initialize the SVM classifier.
Step 2: Add a dataset to the experiment
Step 3: Add a Split Data module to create training and test dataset.
Step 4: Add a Permutation Feature Importance module and connect to the trained model and test dataset.
Step 5: Set the Metric for measuring performance property to Classification - Accuracy and then run the experiment.
[EXAM - UDEMY]
You are a data scientist and you use Azure Machine Learning Studio.
You use Azure Machine Learning Studio to build a machine learning experiment.
You need to divide data into two distinct datasets.
Which module should you use?
Split data
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are using the Azure Machine Learning Service to automate hyperparameter exploration of your neural network classification model.
You must define the hyperparameter space to automatically tune hyperparameters using random sampling according to following requirements:
§ The learning rate must be selected from a normal distribution with a mean value of 10 and a standard deviation of 3.
§ Batch size must be 16, 32 and 64.
§ Keep probability must be a value selected from a uniform distribution between the range of 0.05 and 0.1.
You need to use the param_sampling method of the Python API for the Azure Machine Learning Service.
How should you complete the code segment?
param_sampling = RandomParameterSampling ({“learning_rate” = ?, “batch_size”: ?, “keep_probability: ?”})
normal(10,3)
batch_size = choice(16,32,64)
keep_probability = uniform(0.05,0.1)
Explanation
Random sampling allows the search space to include both discrete and continuous hyperparameters.
In random sampling, hyperparameter values are randomly selected from the defined search space.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are analyzing a dataset by using Azure Machine Learning Studio.
You need to generate a statistical summary that contains the p-value and the unique count for each feature column.
Which two modules can you use?
- export count table
- computer linear correlation
- execute python script
- summarize data
- convert to indicator values
Explanation The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules.
Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:
§ How many missing values are there in each column?
§ How many unique values are there in a feature column?
§ What is the mean and standard deviation for each column?
The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input.
Incorrect Answers:
The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset.
With Python, you can perform tasks that aren’t currently supported by existing Studio modules such as:
§ Visualizing data using matplotlib
§ Using Python libraries to enumerate datasets and models in your workspace
§ Reading, loading, and manipulating data from sources not supported by the Import Data module
The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model.
[EXAM - UDEMY]
You are building a regression model for estimating the number of calls during an event hosting by your company.
You need to determine whether the feature values achieve the conditions to build a Poisson regression model.
Which two conditions must the feature set contain? Each correct answer presents part of the solution.
- sign of the label data?
- type of number?
Label data must be positive whole numbers
Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, you should use this module to create your regression model only if the values you are trying to predict fit the following conditions:
§ The response variable has a Poisson distribution.
§ Counts cannot be negative. The method will fail outright if you attempt to use it with negative labels.
§ A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers.
References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/poisson-regression
[Personal]
Poisson distribution
is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.[1]
[Personal]
Melt function in pandas:
pd.melt(dataframe, id_vars = ‘‘shop”, value_vars = [‘2017’,’2018’])
Pandas melt() function is used to change the DataFrame format from wide to long. It’s used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns – variable and value.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html