Machine Learning with Viya® 3.4® Lesson 2: Data Preparation Flashcards
Prepare for SAS Machine Learning Specialist Exam
For a new project that you are creating in SAS Model Studio, you wish to use a SAS data set that is not in memory but exists on your local machine. In which tab would this data set be located?
Data sets that are located on your local machine can be found in Import.
For a new project that you are creating in SAS Model Studio, you wish to use a SAS data set that is not in memory but exists on your connected server (not the local machine). In which tab would this data set be located?
All data sets that exist on a connected server are found in Data Sources.
What are the four groups of Advanced project settings in Model Studio?
- Advisor Options
- Partition Data
- Event-based Sampling
- Node Configuration
What are the default proportions when data is partitioned for event-based sampling?
- 60% Training
- 30% Validation
- 10% Test
Where would you adjust the threshold for rejecting inputs with missing values?
Adjust the maximum percent missing under the Advisor Options group in the Advanced Project Settings
What is the validation data that the Variable Selection node creates from the training data used for?
Model assessment during the modeling process
The Data Exploration node enables you to do what?
View the most Important inputs or Screening to see suspicious variables.
What is the best practice for handling high-cardinality input variables?
binning
Which Model Studio setting determines whether a numeric input is designated as interval or nominal?
If a numeric input has more distinct values/levels than the interval cutoff value, it is declared interval. Otherwise, it is declared nominal.
Where would you specify the interval cut-off in Model Studio?
In the Advisor Options group under Advanced Project settings
How would you define variable metadata and assign rules to modify variables?
You can perform these tasks using either the Data tab or the Manage Variables node.
What does the temporary table produced in a Save Data node following a Decision Tree contain?
Following a decision tree, the table contains predicted probabilities and leaf IDs
What can you do using the Manage Variables node after a pipeline is run?
Set up imputation and transformation rules.
What does the Manage Variables node enable you to do in Model Studio?
modify the data such as changing the role of a variable or adding transformations to the data within a pipeline, turn on event-based sampling before a pipeline is run
Which variable selection technique identifies the set of input variables that jointly explain the maximum amount of variance contained in the data when using the Variable Selection node in Model Studio?
Unsupervised Selection
What does the Text Mining node do?
Creates topics based on groups of terms that occur together in several documents
What is the purpose of the Feature Extraction node in Model Studio?
the Feature Extraction node transforms the existing features (variables) into a lower-dimensional space by generating new features that are composites of the original features
What is the drawback to Feature Extraction?
Composite variables are no longer meaningful with respect to the original problem
What is another term for a feature in predictive modeling?
Input
How would you specify the threshold for rejecting categorical variables in Model Studio?
Set the maximum class levels under the Advisor Options group in Advanced Project settings
Where can you set up imputation and transformation rules in Model Studio?
In the Manage Variables node after a pipeline is run
For a new project that you are creating in SAS Model Studio, you wish to use a SAS data set loaded into memory. In which tab would this data set be located?
CAS tables loaded into memory are seen in Available.
Does the Variable Selection node use supervised methods or unsupervised methods to select inputs?
The Variable Selection node can perform input selection based on both supervised and unsupervised methods.
What is the curse of dimensionality?
The more inputs you use to build the model, the more cases are required to discover the relationship between the inputs and the target
Why it is important to reduce the number of inputs during data preparation?
The more inputs you use to build the model, the more cases are required to discover the relationship between the inputs and the target
What does the Save Data node do?
The Save Data node produces a temporary table in a CAS library.
What does the Replacement node enable you to do?
replace outliers and unknown class values with specified values
How do the transformations available in the Transformations node minimize bias in model predictions?
by reducing the effect of extreme or unusual input values
What are some of the techniques used in the Feature Extraction node in Model Studio?
principal component analysis (PCA), robust PCA, singular value decomposition (SVD), and autoencoders
What is binning?
Binning is a method of transformation that converts numeric inputs to categories or groups the levels of a high-cardinality input.
Which transformation creates bins for a numeric variable?
A quantile transformation creates bins for a numeric variable.
What would you use the Anomaly Detection node in Model Studio for?
the Anomaly Detection node is used to identify and exclude anomalies using the support vector data description (SVDD)
When would it be helpful to use the Anonamly Detection node?
when using a data set where most of the data belongs to one class and the other class is scarce or missing
What does the Filtering node do?
The Filtering node excludes certain observations such as rare or extreme values
Assume the Target has an event proportion of 2% in the original data. You want to build models where event-based sampling has been used such that the modeling data set will have a 50% event proportion. What are the two ways this can be done using Model Studio?
- While the project is being created, after the data source has been selected, click the Advanced button. Select the Event-Based Sampling option. Turn on event-based sampling by checking the check box.
- After the project is created but before a pipeline has been run, go into project settings, select the Event-Based Sampling option. Turn on event-based sampling by checking the check box.
What are some reasons for performing transformations on your data?
stabilizing variances,
removing non-linearity,
and correcting non-normality
What is the default maximum cardinality for determining whether or not to reject a nominal variable?
20
Is the Maximum percent missing option is turned on by default?
Yes
What is Singular Value Decomposition (SVD)?
Singular value decomposition(SVD) projects the high-dimensional document and term spaces into a lower dimension space.
What is SVD used for when selecting model inputs?
The singular values can be thought of as providing a measure of importance used to decide how many dimensions to keep.
What does the Transformation node enable you to do?
alter your data by replacing an input variable with some function of that variable