Test 2 Flashcards
How can a paired t-test could be used to compare two models that have been developed for a classification problem?
A paired T test could be used to compare two classification models by applying both models to the test and comparing their error rates. The differences between these two will be calculated and the mean and standard dev of said differences are used in the t test to determine if there is a stat sig difference between the models
How might a data set be manipulated to simulate weighting instances?
By duplicating the instances you want to see with a higher weight in the data set
Briefly describe how the back-propagation algorithm works in MLPs.
It works by shifting the weights of the connections using gradient descent. This occurs iteratively until the error is acceptably low
Why might we wish to weight some instances over others?
To fix underrepresented data or to bias a model to a certain outcome to avoid very bad results from improper classification
What activation function is often used in an MLP?
Sigmoid function, ReLU, and tanh
How is a signal propagated from one layer of a multilayer perceptron (MLP) to the next?
Through a system of weights and biases between nodes. The input is multiplied by the weight and biases are added.
What is a bias node in an MLP? What purpose does it serve?
Additional node in each layer of a MLP. It has a constant output of 1 and is connected to all nodes in the next layer. This node helps the network learn more complex decisions by shifting the activation functions horizontally
What role does the training set, validation set, and test set play in model development
Training set is the set that the model is trained on. While training the model is tested on the validation set and when the model is done it is compared against the test set
In the context of training and evaluation of ML models, what is holdout? What is cross-validation?
A holdout is method where part of the data is set aside and used to test the data. Cross validation is a model technique that splits the data into multiple datasets, and averages the results
If your training data is very large and representative of the population to which the final model will be applied, should you perform
cross-validation?
Not necessary as it would take to long and not provide a much better model
In a regression tree, we do not use information gain as a splitting criterion. Assuming all attributes are numeric, how then are splits performed? How do we know when to terminate the splitting process?
We use a variance of the target attribute values and split to minimize that. You terminate based on a threshold set
What are the distinctions between a regression tree and a model tree?
A regression tree predicts a single value while a model tree has a linear model
How could a ML algorithm such as C4.5 (J48) be used for attribute selection? How about linear regression
By training a decision tree on a dataset and selecting the attributes that appear in the tree.
Linear regression can be used for attribute selection by fitting a linear model to the data and selecting attributes with the highest absolute coefficients or lowest p-values, indicating their significance in predicting the target variable.
In C4.5, how are instances with missing attribute-values utilized during training? During classification?
By using a probabilistic approach distribution of known values. For classification the missing values are propagated down multiple branches and the final prediction is based on a weighted average of the outcomes
What are forward selection and backward selection in the context of attribute selection? Which is likely to produce a set containing
more features?
Forward starts with no attributes and then adds them one-by-one until a suitable set is found. Backward selection starts with all
attributes and then eliminates individual attributes. The latter will generally produce larger sets.
We sometimes want to discretize numeric attributes. Two methods to do that are equal-interval binning and equal-frequency
binning. Explain the basic ideas underlying each
Equal interval binning would have splits based upon the range of numeric values of the attributes (i.e., we divide the range of values into multiple intervals of the same size), while equal frequency binning would choose splits that result in sets of roughly
the same size.
As discussed, C4.5 uses error on the training set rather than the test set to drive pruning. However, to avoid overfitting, an estimate is made. What is this estimate?
The upper bound of the training error
What is meant by recursive feature elimination?
It’s an attribute selection method whereby one repeatedly applies an ML algorithm that provides coefficients for attributes commonly used with linear regression is used. The attribute with the lowest value is removed and the algorithm applied again. This process is repeated until no attributes remain. In general, the scheme provides a way of ranking attributes.
What is an ensemble learner? Describe one or more (simple) techniques for combining results from multiple models
A collection of learners that have been combined to solve a problem.
Voting: each base model makes a prediction, final prediction is the majority vote (for classification) or average (for regression)
What are some of the methods used to evaluate the quality of a feature set in attribute selection?
Choose attributes that are individually correlated with the target. Alternatively, we could examine sets of attributes, looking for sets containing attributes that are individually correlated to the target but with low inter-correlation
What do we mean when we say that an attribute selection method is scheme independent?
Attributes are selected based upon characteristics of the data set and not based on performance of a machine learning scheme. For instance, we could use correlation with the target attribute to select attributes this does not utilize any ML algorithm.
In the context of ensemble learning, what is a weak learner?
A model that performs only slight better than random guessing
What computationally cheaper technique might be used to derive an attribute set rather than PCA?
Choosing the vectors randomly appears to work well in practice.
What is stacking?
An ensemble learning technique that combines multiple base models by training a meta-model to learn how to best combine their predictions.
In informal terms, how can principal component analysis be used to reduce the number of attributes used in a machine learning algorithm? What can be said about the attributes ultimately produced by PCA?
PCA will produce new attributes that are based on one or more of the original attributes. It works by choosing a vector in which
variance (of target values, when instances are projected onto it) is maximal. Subsequent vectors are chosen in the same way, but
orthogonal to those previously selected. As such, the same information encoded in multiple original attributes could be encoded more concisely
What are bagging and boosting, and how are they different?
Techniques that combine multiple base models to improve performance Bagging creates multiple subsets of the training data by sampling with replacement, trains a base model and combines the predictions
Boosting iteratively trains weak learners on weighted versions of the dataset where misclassified instances receive higher weights.