Questions (subset) Flashcards

Question

You are using a decision tree algorithm. You have trained a model that generalizes well at a tree depth equal to 10. You need to select the bias and variance properties of the model with varying tree depth values. Which properties should you select for each tree depth? To answer, select the appropriate options in the answer area. Hot Area: Tree Depth 5: ``` Bias = high/low; Variance = high/low? ``` Tree Depth 15 ``` Bias = High/Low? Variance = High/Low? ```

Answer 1

Tree Depth 5 Bias = High; Variance = Low Tree Depth 15 Bias = Low; Variance = High In decision trees, the depth of the tree determines the variance. A complicated decision tree (e.g. deep) has low bias and high variance. Note: In statistics and machine learning, the bias""variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias. References: https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

Answer 2

Correct Answer: E https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/linux-dsvm-walkthrough#other-tools In the DSVM, your training models can use deep learning algorithms on hardware that's based on graphics processing units (GPUs). You can switch to a GPU-based VM when you're training large models, or when you need high-speed computations while keeping the same OS disk. You can choose any of the N series GPU enabled virtual machine SKUs with DSVM. Please note Azure free accounts do not support GPU enabled virtual machine SKUs. The Windows editions of the DSVM comes pre-installed with GPU drivers, frameworks, and GPU versions of deep learning frameworks. *** On the Linux edition, deep learning on GPUs is enabled on the Ubuntu DSVMs *** https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview Incorrect Answers: A, C: PostgreSQL (CentOS) is only available in the Linux Edition. B: The Azure Geo AI Data Science VM (Geo-DSVM) delivers geospatial analytics capabilities from Microsoft's Data Science VM. Specifically, this VM extends the AI and data science toolkits in the Data Science VM by adding ESRI's market-leading ArcGIS Pro Geographic Information System. D: DLVM is a template on top of DSVM image. In terms of the packages, GPU drivers etc are all there in the DSVM image. Mostly it is for convenience during creation where we only allow DLVM to be created on GPU VM instances on Azure. References: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

Answer 3

Correct Answer: Box 1: 300 You type 300 (%), the module triples the percentage of minority cases (3000) compared to the original dataset (1000). "you can set the value of SMOTE percentage, using multiples of 100" Box 2: 5 We should use 5 data rows. Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses when in building new cases. A nearest neighbor is a row of data (a case) that is very similar to some target case. The distance between any two cases is measured by combining the weighted vectors of all features. By increasing the number of nearest neighbors, you get features from more cases. By keeping the number of nearest neighbors low, you use features that are more like those in the original sample. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Answer 4

Correct Answer 1. Import data 2. Clean Missing Data 3. Partition and Sample "Partition and Sample creates multiple partitions of a dataset based on sampling" Incorrect Answers: ✑ Latent Direchlet Transformation: Latent Dirichlet Allocation module in Azure Machine Learning Studio, to group otherwise unclassified text into a number of categories. Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to find texts that are similar. Another common term is topic modeling. ✑ Build Counting Transform: Build Counting Transform module in Azure Machine Learning Studio, to analyze training data. From this data, the module builds a count table as well as a set of count-based features that can be used in a predictive model. ✑ Missing Value Scrubber: The Missing Values Scrubber module is deprecated. ✑ Feature hashing: Feature hashing is used for linguistics, and works by converting unique tokens into integers. ✑ Replace discrete values: the Replace Discrete Values module in Azure Machine Learning Studio is used to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values

Answer 5

Correct Answer: B Accuracy, Precision, Recall, F1 score, and AUC are metrics for evaluating classification models. Note: Mean Absolute Error, Root Mean Absolute Error, Relative Absolute Error are OK for the linear regression model. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

Answer 6

Correct Answer: "Partition or sample mode" --> Assign to Folds "Use replacement in partitioning" --> Yes/Click "Randomized Split" --> No/Leave ( Defaults to Round-Robin) Refrences Use the Split data into partitions option when you want to divide the dataset into subsets of the data. This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups. 1. Add the Partition and Sample module to your experiment in Studio (classic), and connect the dataset. 2. For Partition or sample mode, select 'Assign to Folds'. 3. 'Use replacement in the partitioning': Select this option if you want the sampled row to be put back into the pool of rows for potential reuse. As a result, the same row might be assigned to several folds. If you *** do not use replacement (the default option) ***, the sampled row is not put back into the pool of rows for potential reuse. As a result, each row can be assigned to only one fold. 4. "Randomized split": Select this option if you want rows to be randomly assigned to folds. *** If you do not select this option, rows are assigned to folds using the round-robin method *** References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

Answer 7

Correct Answer: ABD -- A. Penalize the classification --> Add a weight column B. Resample the dataset using undersampling or oversampling --> . Resampling Note: Use a performance metric that deals better with imbalanced data. For example, the **F1 score** https://docs.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls#handle-imbalanced-data -- The best way to prevent over-fitting is to follow ML best-practices including: Using more training data, and eliminating statistical bias Preventing target leakage Using fewer features Regularization and hyperparameter optimization Model complexity limitations Cross-validation https://docs.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls#handle-imbalanced-data --- ``` A: Try Penalized Models - You can use the same algorithms but give them a different perspective on the problem. Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class. ``` B: You can change the dataset that you use to build your predictive model to have more balanced data. This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: ✑ Consider testing undersampling when you have an a lot data (tens- or hundreds of thousands of instances or more) ✑ Consider testing over-sampling when you don't have a lot of data (tens of thousands of records or less) D: Try Generate Synthetic Samples A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class. References: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Answer 8

Correct Answer: BC --- Regularization (e.g., L1/L2) is a process ... to prevent overfitting. ... providing a convolutional network with **more training examples** can reduce overfitting https://en.wikipedia.org/wiki/Convolutional_neural_network --- Correct Answer: BD B: Weight regularization provides an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set. Keras provides a weight regularization API that allows you to add a penalty for weight size to the loss function. Three different regularizer instances are provided; they are: ✑ L1: Sum of the absolute weights. ✑ L2: Sum of the squared weights. ✑ L1L2: Sum of the absolute and the squared weights. D: Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is dropout. At each training stage, individual nodes are either "dropped out" of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. By avoiding training all nodes on all training data, dropout decreases overfitting. References: https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-learning-with-weight-regularization/ https://en.wikipedia.org/wiki/Convolutional_neural_network

Answer 9

Correct Answer: C Relative Expression Split: Use this option whenever you want to apply a condition to a number column. The number could be a ** date/time field **, a column containing age or dollar amounts, or even a percentage. For example, you might want to divide your data set depending on the cost of the items, group people by age ranges, or separate data by a calendar date. Incorrect Answers: B: Regular Expression Split: Choose this option when you want to divide your dataset by testing a single column for a value. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

Answer 10

Correct Answer: D Maximum number of runs on random grid: This option also controls the number of iterations over a random sampling of parameter values, but the values are not generated randomly from the specified range; instead, a matrix is created of all possible combinations of parameter values and a random sampling is taken over the matrix. This method is more efficient and less prone to regional oversampling or undersampling. If you are training a model that supports an integrated parameter sweep, you can also set a range of seed values to use and iterate over the random seeds as well. This is optional, but can be useful for avoiding bias introduced by seed selection. Incorrect Answers: B: If you are building a clustering model, use Sweep Clustering to automatically determine the optimum number of clusters and other parameters. C: Entire grid: When you select this option, the module loops over a grid predefined by the system, to try different combinations and identify the best learner. This option is useful for cases where you don't know what the best parameter settings might be and want to try all possible combination of values. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters

Answer 11

Correct Answer: B An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade. References: https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/

Answer 12

Correct Answer: ACD A: The algorithm terminates when the centroids stabilize C: A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vector from its centroid summed over all vectors. RSS is the objective function and our goal is to minimize it. D: The algorith terminates when a specified number of iterations are completed. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/k-means-clustering https://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html

Answer 13

Correct Answer: C "To translate a corpus of English text to French, we need to build a recurrent neural network (RNN)." RNNs *** are designed to take sequences of text as inputs *** or return sequences of text as outputs, or both. They're called recurrent because the network's hidden layers have a loop in which the output and cell state from each time step become inputs at the next time step. This recurrence serves as a form of memory. It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step. References: https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571

Answer 14

Correct Answer: BC The evaluation metrics available for binary classification models are: Accuracy, Precision, Recall, F1 Score, and AUC. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance

Answer 15

Correct Answer: DE D: For Number of learning iterations, specify the maximum number of times the algorithm should process the training cases. E: For Hidden layer specification, select the type of network architecture to create. Between the input and output layers you can insert multiple hidden layers. Most predictive tasks can be accomplished easily with only one or a few hidden layers. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/two-class-neural-network "The idea here is that, ABC are hyperparameters that Azure can figure out for you, but it needs D and E. why D? Because Azure needs to know when to stop. I.e. it can't run forever why E? Because you need to tell Azure what to "sweep over". Is it to sweep over hidden layer breadth? Depth? both? for each of these sweep runs, it tries a permutation of learning rate and kernel function etc"

Answer 16

Correct Answer: B Those are metrics for evaluating classification models, instead use: Mean Absolute Error, Root Mean Absolute Error, Relative Absolute Error, Relative Squared Error, and the Coefficient of Determination. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

Answer 17

Correct Answer: B Relative Squared Error, Coefficient of Determination are good metrics to evaluate the linear regression model, but the others are metrics for classification models. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

Answer 18

Correct Answer: B Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect. Incorrect Answers: A: Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction. C: Recall is the fraction of all correct results returned by the model. D: Precision is the proportion of true results over all positive results. E: Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a lower score is better. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

Answer 19

Correct Answer: B Note: Use a performance metric that deals better with imbalanced data. For example, the F1 score is a weighted average of precision and recall. References: https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model?redirectedfrom=MSDN#bkmk_classification https: //docs.microsoft.com/en-us/azure/machine-learning/studio/evaluate-model-performance#evaluating-a-binary-classification-model

Answer 20

Correct Answer: D --- Jens OK:at "A paired t-test is used to compare two population means where you have two samples in which observations in one sample can be paired with observations in the other sample." "If the direction of the difference does not matter, a two-tailed hypothesis is used" Reference: https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/test-hypothesis-using-t-test https: //en.wikipedia.org/wiki/Student%27s_t-test

Answer 21

Correct Answer: D ('ADD COLUMN') Reference https://gallery.azure.ai/Experiment/Add-Column-with-Apply-SQL-Transform

Answer 22

Correct Answer: A The following metrics are reported for evaluating regression models. When you compare models, they are ranked by the metric you select for evaluation. Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a lower score is better. Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction. Relative absolute error (RAE) is the relative absolute difference between expected and actual values; relative because the mean difference is divided by the arithmetic mean. Relative squared error (RSE) similarly normalizes the total squared error of the predicted values by dividing by the total squared error of the actual values. Mean Zero One Error (MZOE) indicates whether the prediction was correct or not. In other words: ZeroOneLoss(x,y) = 1 when x!=y; otherwise 0. Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

Answer 23

Correct Answer: C

Answer 24

Correct Answer: D Receiver operating characteristic (or ROC) is a plot of the correctly classified labels vs. the incorrectly classified labels for a particular model. Incorrect Answers: A: A violin plot is a visual that traditionally combines a box plot and a kernel density plot. B: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. C: A scatter plot graphs the actual values in your data against the values predicted by the model. The scatter plot displays the actual values along the X-axis, and displays the predicted values along the Y-axis. It also displays a line that illustrates the perfect prediction, where the predicted value exactly matches the actual value. References: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#confusion-matrix

Answer 25

Correct Answer: B Leave One Out (LOO) cross-validation Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach. LOO CV is sometimes useful but typically doesn't shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff.

Answer 26

Correct Answer: A Split Data:"Partitions the rows of a dataset into two distinct sets" References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Answer 27

Correct Answer: BD Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, you should use this module to create your regression model only if the values you are trying to predict fit the following conditions: ✑ The response variable has a Poisson distribution. ✑ Counts cannot be negative. ✑ A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/poisson-regression

Answer 28

Correct Answer: A Note: "This option cannot be applied to completely empty columns. Such columns must be removed or passed to the output as is". Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. Note: Multivariate imputation by chained equations (MICE), sometimes called "fully conditional specification" or "sequential regression multiple imputation" has emerged in the statistical literature as one principled method of addressing missing data. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary) as well as complexities such as bounds or survey skip patterns. References: https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/ https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Answer 29

Correct Answer: B "You need to analyze a *** full dataset *** to include all values." https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Answer 30

Correct Answer: B The Principal Component Analysis module in Azure Machine Learning Studio (classic) is used to reduce the dimensionality of your training data. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis

Answer 31

Correct Answer: A Replace using Probabilistic PCA: Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. Therefore, it might offer better performance for datasets that have missing values in many columns. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Answer 32

Correct Answer: C --- "The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic" https: //machinelearningmastery.com/feature-selection-with-categorical-data/ - -- Incorrect Answer: D Pearson's correlation statistic, or Pearson's correlation coefficient, is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation Pearson's correlation coefficient is the test statistics that measures the statistical relationship, or association, *** between two continuous variables ***. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. Incorrect Answers: C: The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/filter-based-feature-selection https://www.statisticssolutions.com/pearsons-correlation-coefficient/ Kendall Correlation Kendall's rank correlation is one of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable. In other words, it measures the similarity of ***orderings*** when ranked by the quantities. Both this coefficient and Spearman’s correlation coefficient are designed for use with non-parametric and non-normally distributed data. Spearman Correlation Spearman's coefficient is a nonparametric measure of statistical dependence between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s coefficient expresses the degree to which two variables are monotonically related. It is also called Spearman rank correlation, because it can be used with ***ordinal**** variables.

Answer 33

Correct Answer: C References: https://notebooks.azure.com/

Answer 34

Correct Answer: D Incorrect Answers: A: A violin plot is a visual that traditionally combines a box plot and a kernel density plot. B: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. C: A box plot lets you see basic distribution information about your data, such as median, mean, range and quartiles but doesn't show you how your data looks throughout its range. References: https://machinelearningknowledge.ai/confusion-matrix-and-performance-metrics-machine-learning/

Answer 35

Correct Answer: B Instead use the Multiple Imputation by Chained Equations (MICE) method. Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. Note: Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study. References: https://methods.sagepub.com/reference/encyc-of-research-design/n211.xml https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

Answer 36

Correct Answer: A SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Answer 37

Correct Answer: B In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result dataset References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Answer 38

Correct Answer: BE The box-plot algorithm can be used to display outliers. One other way to quickly identify Outliers visually is to create scatter plots. References: https://blogs.msdn.microsoft.com/azuredev/2017/05/27/data-cleansing-tools-in-azure-machine-learning/

Answer 39

Correct Answer: BE ``` The Export Count Table module is provided for backward compatibility with experiments that use the Build Count Table (deprecated) and Count Featurizer (deprecated) modules. ``` E: Summarize Data statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know: How many missing values are there in each column? How many unique values are there in a feature column? What is the mean and standard deviation for each column? The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input. Incorrect Answers: A: The Compute Linear Correlation module in Azure Machine Learning Studio is used to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset. C: With Python, you can perform tasks that aren't currently supported by existing Studio modules such as: Visualizing data using matplotlib Using Python libraries to enumerate datasets and models in your workspace Reading, loading, and manipulating data from sources not supported by the Import Data module D: The purpose of the Convert to Indicator Values module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/export-count-table https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/summarize-data

Answer 40

Correct Answer: A Entropy MDL binning mode: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. ****It then returns the bin number associated with each row of your data in a column named quantized. ***** References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Answer 41

Correct Answer: B References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Answer 42

Correct Answer: B Scale and Reduce support the following data preparation tasks: Grouping data into bins of varying sizes or distributions. Removing outliers or changing their values. Normalizing a set of numeric values into a specific range. Creating a compact set of feature columns from a high-dimension dataset. ---- Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote Follow up: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-scale-and-reduce

Answer 43

Correct Answer: C Remove entire row: Completely removes any row in the dataset that has one or more missing values. This is useful if the missing value can be considered randomly missing. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Answer 44

Correct Answer: C Leave One Out (LOO) cross-validation Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach. LOO CV is sometimes useful but typically doesn't shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff.

Answer 45

Correct Answer: C Partition and Sample with the Stratified split option outputs multiple datasets, partitioned using the rules you specified. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/partition-and-sample

Answer 46

Correct Answer: B References: https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/ https: //docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data key sentence is "You need to analyze a full dataset to include all values.". This can only be done via MICE (Multiple imputations). Mean, Meadian are single imputations that is they only consider the column with the missing value and not the other columns whereas MICE uses the other columns to fill in the missing value.

Answer 47

Correct Answer: B References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Answer 48

Correct Answer: B References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Answer 49

Correct Answer: C Azure Container Instances can be used as compute target for testing or development. Use for low-scale CPU-based workloads that require less than 48 GB of RAM. Azure Databricks is not a ** deployment target **, but can be used as a "compute target for training" Reference: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where

Answer 50

Correct Answer: D Use the SMOTE module in Azure Machine Learning Studio (classic) to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. You connect the SMOTE module to a dataset that is imbalanced. There are many reasons why a dataset might be imbalanced: the category you are targeting might be very rare in the population, or the data might simply be difficult to collect. Typically, you use SMOTE when the class you want to analyze is under- represented. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Answer 51

Correct Answer: AD RMSE and R2 are both metrics for regression models. A: Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction. **In general, a lower RMSE is better than a higher one*** D: Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect. Incorrect Answers: C, E: F-score is used for classification models, not for regression models. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model Prepare data for modeling

Answer 52

Correct Answer: B TensorFlow is an open source library for numerical computation and large-scale machine learning. It uses Python to provide a convenient front-end API for building applications with the framework TensorFlow can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation, natural language processing, and PDE (partial differential equation) based simulations. Incorrect Answers: A: Rattle is the R analytical tool that gets you started with data analytics and machine learning. C: Weka is used for visual data mining and machine learning software in Java. D: Scikit-learn is one of the most useful library for machine learning in Python. It is on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. Reference: https://www.infoworld.com/article/3278008/what-is-tensorflow-the-machine-learning-library-explained.html

Answer 53

Correct Answer: C For those not in the know CUDA is a parallel computing platform and application programming interface (API) developed by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. A Deep Learning Virtual Machine is a pre-configured environment for deep learning using GPU instances. References: https://azuremarketplace.microsoft.com/en-au/marketplace/apps/microsoft-ads.dsvm-deep-learning

Answer 54

Correct Answer: E Only the DSVM on Ubuntu is preconfigured for Caffe2 and PyTorch. References: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

Answer 55

Correct Answer: C In the DSVM, your training models can use deep learning algorithms on hardware that's based on graphics processing units (GPUs). The Linux DSVM comes with PostgreSQL Reference: https: //docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview https: //docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/linux-dsvm-walkthrough#postgresql-and-squirrel-sql

Answer 56

Correct Answer: C 1. An Apache Spark cluster on HDInsight. See Create an Apache Spark cluster. 2. Run a custom script to install ***Microsoft Cognitive Toolkit** on an Azure HDInsight Spark cluster. 3. Upload a Jupyter Notebook to the Apache Spark cluster to see how to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark) CTLK is used by Cognitive Services

Answer 57

Correct Answer: BCD ``` You can move data to and from Azure Blob storage using different technologies: ✑ Azure Storage-Explorer ✑ AzCopy ✑ Python ✑ SSIS ``` References: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/move-azure-blob

Answer 58

Correct Answer: C ``` Use the Convert to ARFF module in Azure Machine Learning Studio, to convert datasets and results in Azure Machine Learning to the attribute-relation file format used by the Weka toolset. This format is known as ARFF. The ARFF data specification for Weka supports multiple machine learning tasks, including data preprocessing, classification, and feature selection. In this format, data is organized by entites and their attributes, and is contained in a single text file. ``` The supported formats include: The dataset format that's used throughout Azure Machine Learning. The ARFF format that's used by Weka. Weka is an open-source Java-based set of machine learning algorithms. The SVMLight format. The SVMLight format was developed for the SVMlight framework for machine learning. It can also be used by Vowpal Wabbit. The tab-separated (TSV) and comma-separated (CSV) flat file formats that are supported by most relational databases. These formats are also widely supported by R and Python. References: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/convert-to-arff

Answer 59

Correct Answer: CE C: Make sure your Windows system supports Hardware Virtualization Technology and that virtualization is enabled. Ensure that hardware virtualization support is turned on in the BIOS settings. E: To run Docker, your machine must have a 64-bit operating system running Windows 7 or higher. References: https: //docs.docker.com/toolbox/toolbox_install_windows/ https: //blogs.technet.microsoft.com/canitpro/2015/09/08/step-by-step-enabling-hyper-v-for-use-on-windows-10/

Answer 60

Correct Answer: B In Azure Databricks, we can create two different types of clusters. ✑ Standard, these are the default clusters and can be used with Python, R, Scala and SQL Azure Databricks is fully integrated with Azure Data Factory. Incorrect Answers: D: Azure Container Instances is good for development or testing. Not suitable for production workloads. References: https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning

Answer 61

Correct Answer: A The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft's Azure cloud built specifically for doing data science. Caffe2 and Chainer are supported by DSVM. DSVM integrates with Azure Machine Learning. Incorrect Answers: B: Use Machine Learning Studio when you want to experiment with machine learning models quickly and easily, and the built-in machine learning algorithms are sufficient for your solutions. References: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

Questions (subset) Flashcards

Many but not all confirmed... (86 cards)