Test 2 Flashcards

1
Q

How can a paired t-test could be used to compare two models that have been developed for a classification problem?

A

A paired T test could be used to compare two classification models by applying both models to the test and comparing their error rates. The differences between these two will be calculated and the mean and standard dev of said differences are used in the t test to determine if there is a stat sig difference between the models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How might a data set be manipulated to simulate weighting instances?

A

By duplicating the instances you want to see with a higher weight in the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Briefly describe how the back-propagation algorithm works in MLPs.

A

It works by shifting the weights of the connections using gradient descent. This occurs iteratively until the error is acceptably low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why might we wish to weight some instances over others?

A

To fix underrepresented data or to bias a model to a certain outcome to avoid very bad results from improper classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What activation function is often used in an MLP?

A

Sigmoid function, ReLU, and tanh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is a signal propagated from one layer of a multilayer perceptron (MLP) to the next?

A

Through a system of weights and biases between nodes. The input is multiplied by the weight and biases are added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a bias node in an MLP? What purpose does it serve?

A

Additional node in each layer of a MLP. It has a constant output of 1 and is connected to all nodes in the next layer. This node helps the network learn more complex decisions by shifting the activation functions horizontally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What role does the training set, validation set, and test set play in model development

A

Training set is the set that the model is trained on. While training the model is tested on the validation set and when the model is done it is compared against the test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the context of training and evaluation of ML models, what is holdout? What is cross-validation?

A

A holdout is method where part of the data is set aside and used to test the data. Cross validation is a model technique that splits the data into multiple datasets, and averages the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If your training data is very large and representative of the population to which the final model will be applied, should you perform
cross-validation?

A

Not necessary as it would take to long and not provide a much better model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In a regression tree, we do not use information gain as a splitting criterion. Assuming all attributes are numeric, how then are splits performed? How do we know when to terminate the splitting process?

A

We use a variance of the target attribute values and split to minimize that. You terminate based on a threshold set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the distinctions between a regression tree and a model tree?

A

A regression tree predicts a single value while a model tree has a linear model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How could a ML algorithm such as C4.5 (J48) be used for attribute selection? How about linear regression

A

By training a decision tree on a dataset and selecting the attributes that appear in the tree.
Linear regression can be used for attribute selection by fitting a linear model to the data and selecting attributes with the highest absolute coefficients or lowest p-values, indicating their significance in predicting the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In C4.5, how are instances with missing attribute-values utilized during training? During classification?

A

By using a probabilistic approach distribution of known values. For classification the missing values are propagated down multiple branches and the final prediction is based on a weighted average of the outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are forward selection and backward selection in the context of attribute selection? Which is likely to produce a set containing
more features?

A

Forward starts with no attributes and then adds them one-by-one until a suitable set is found. Backward selection starts with all
attributes and then eliminates individual attributes. The latter will generally produce larger sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

We sometimes want to discretize numeric attributes. Two methods to do that are equal-interval binning and equal-frequency
binning. Explain the basic ideas underlying each

A

Equal interval binning would have splits based upon the range of numeric values of the attributes (i.e., we divide the range of values into multiple intervals of the same size), while equal frequency binning would choose splits that result in sets of roughly
the same size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

As discussed, C4.5 uses error on the training set rather than the test set to drive pruning. However, to avoid overfitting, an estimate is made. What is this estimate?

A

The upper bound of the training error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is meant by recursive feature elimination?

A

It’s an attribute selection method whereby one repeatedly applies an ML algorithm that provides coefficients for attributes commonly used with linear regression is used. The attribute with the lowest value is removed and the algorithm applied again. This process is repeated until no attributes remain. In general, the scheme provides a way of ranking attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an ensemble learner? Describe one or more (simple) techniques for combining results from multiple models

A

A collection of learners that have been combined to solve a problem.
Voting: each base model makes a prediction, final prediction is the majority vote (for classification) or average (for regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some of the methods used to evaluate the quality of a feature set in attribute selection?

A

Choose attributes that are individually correlated with the target. Alternatively, we could examine sets of attributes, looking for sets containing attributes that are individually correlated to the target but with low inter-correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do we mean when we say that an attribute selection method is scheme independent?

A

Attributes are selected based upon characteristics of the data set and not based on performance of a machine learning scheme. For instance, we could use correlation with the target attribute to select attributes this does not utilize any ML algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In the context of ensemble learning, what is a weak learner?

A

A model that performs only slight better than random guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What computationally cheaper technique might be used to derive an attribute set rather than PCA?

A

Choosing the vectors randomly appears to work well in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is stacking?

A

An ensemble learning technique that combines multiple base models by training a meta-model to learn how to best combine their predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In informal terms, how can principal component analysis be used to reduce the number of attributes used in a machine learning algorithm? What can be said about the attributes ultimately produced by PCA?

A

PCA will produce new attributes that are based on one or more of the original attributes. It works by choosing a vector in which
variance (of target values, when instances are projected onto it) is maximal. Subsequent vectors are chosen in the same way, but
orthogonal to those previously selected. As such, the same information encoded in multiple original attributes could be encoded more concisely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are bagging and boosting, and how are they different?

A

Techniques that combine multiple base models to improve performance Bagging creates multiple subsets of the training data by sampling with replacement, trains a base model and combines the predictions
Boosting iteratively trains weak learners on weighted versions of the dataset where misclassified instances receive higher weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Does bagging utilize every instance in a training set? Explain.

A

Bagging uses sampling with replacement. As such, its possible for the re-sampled set to contain every instance of the underlying training set, but in realistic settings it is improbable.

18
Q

In post-pruning of decision trees using subtree replacement, when is a subtree replaced with a leaf node?

A

When doing so improves performance

18
Q

Why is bagging recommended in cases where the underlying machine learning scheme is sensitive to variations in the data?

A

Bagging is recommended when the underlying machine learning scheme is sensitive to variations in the data because it helps reduce variance and improve stability.

18
Q

What is a paired T-test?

A

A statistical method used to compare two related samples or measurements to determine if there is a significant difference between their means.

18
Q

In stacking, why is it wrong to use the data used in developing the level-0 models to combine them?

A

Because the resulting learner will naturally prefer level-0 models that over-fit the training data used to produce them. If we use a separate set to develop the level-1 model, then this will be better estimate of performance on unseen data.

18
Q

Suppose you have a data set with 1000 instances and 100 attributes. Discuss how random forests make use of training instances and their attributes to create a classifier.

A

Random forests uses bagging and attribute selection to generate a set of simple decision trees. Each tree is developed using only a subset of the full number of attributes. E.g., each tree could be developed using only 500 instances (possibly with duplicates) taken from the 1000 and only 20 randomly selected attributes from the original 100.

19
Q

In AdaBoost.M1, what happens if the current classifier has an accuracy level of less than 50

A

The instances are reweighted and the algorithm continues (another model is trained). It’s only
when the error is 0 or ≥ 0.5 that that the algorithm terminates.

19
Q

What is a confidence interval?

A

Confidence intervals provide a range of values that are likely to contain the true population parameter with a certain level of confidence. They give us a way to estimate an unknown population parameter based on a sample statistic, while accounting for the uncertainty in our estimate due to sampling variability.

19
Q

Do bagging and boosting typically combine multiple, distinct ML schemes?

A

Typically not. Bagging uses the same ML scheme but varies the data sets used to develop models, while boosting generates a sequence of models based on a common scheme and using a common data set but re-weighting instances for each subsequent
model.

20
Q

What is the Backpropagation Algorithm?

A

A supervised learning method used to train multi-layer perceptrons (MLPs) and other types of feedforward neural networks.

20
Q

Briefly describe how AdaBoost.M1 works. How are instances weighted?

A

It works by iteratively training weak learners (e.g., decision stumps) on weighted versions of the dataset. Initially, all instances have equal weights. In each iteration, a weak learner is trained on the weighted data, and its error rate is calculated. The weights of misclassified instances are increased, while the weights of correctly classified instances are decreased. This forces the next weak learner to focus more on the misclassified instances. The process is repeated for a specified number of iterations. The final classifier combines the weak learners’ predictions weighted by their accuracy.

20
Q

What is Cross-entropy?

A

A loss function commonly used in machine learning, particularly in classification problems. It measures the dissimilarity between two probability distributions: the true distribution of the data (usually represented by one-hot encoded class labels) and the predicted distribution (usually the output of a softmax function).

20
Q

What is Softmax?

A

An activation function commonly used in the output layer of neural networks for multi-class classification problems. It converts a vector of real numbers into a probability distribution over multiple classes, ensuring that the outputs sum up to 1 and can be interpreted as class probabilities.

21
Q

What is Cross Validation?

A

A method for partitioning data and assessing model performance. In k-fold cross-validation, the data is divided into k equally sized folds. The model is trained and evaluated k times, using each fold once as the testing set and the remaining folds as the training set. The final performance estimate is the average of the k evaluations.

21
Q

What is a training set?

A

The training set is the portion of the data used to train the machine learning model. The model learns from the examples in the training set by adjusting its parameters to minimize the loss function.

22
Q

What is stratified sampling?

A

When partitioning data, it’s important to ensure that the class distribution in each subset is representative of the original dataset. Stratified sampling techniques, such as stratified k-fold cross-validation, maintain the class proportions in each fold.

23
Q

What is a testing set?

A

The testing set is used to evaluate the performance of the trained model on unseen data. It provides an unbiased estimate of the model’s generalization ability, i.e., how well it performs on new, previously unseen examples.

24
Q

What is a validation set?

A

The validation set is used for model selection and hyperparameter tuning. It helps in choosing the best model architecture, regularization techniques, and other hyperparameters that maximize the model’s performance on unseen data. The validation set is used during the training process to guide the model selection and prevent overfitting.

24
Q

What is Post-prunning?

A

It involves recursively removing or collapsing subtrees that do not improve the model’s performance on a separate validation set or based on a complexity criterion. The subtrees are replaced with leaf nodes assigned the majority class or the average value of the target variable.

25
Q

What is pre-prunning?

A

Stopping the tree growth early, before it reaches its maximum depth or all the leaves are pure. This is done by setting a threshold on the minimum number of instances required to split a node, the maximum depth of the tree, or the minimum improvement in impurity reduction required for a split.

25
Q

What is Decision Tree pruning?

A

A technique used to simplify decision trees and prevent overfitting.

26
Q

What is Reduced error pruning?

A

In this post-pruning approach, each subtree is considered for pruning. If replacing a subtree with a leaf node reduces the error rate on a separate validation set, the subtree is pruned. This process is repeated until no further improvements can be made.

26
Q

What is cost-complexity pruning (weakest link prruning)

A

This post-pruning method introduces a complexity parameter (α) that controls the trade-off between the tree’s complexity and its goodness of fit. The algorithm finds the subtrees that minimize the cost complexity, defined as the sum of the misclassification rate and the product of α and the number of leaf nodes. The optimal value of α is determined using cross-validation.

27
Q

What is PCA (principal component analysis)?

A

A linear transformation technique that reduces the dimensionality of the data by projecting it onto a lower-dimensional space while retaining the maximum amount of variance. PCA identifies the principal components, which are the directions of maximum variance in the data, and transforms the features into a new set of uncorrelated variables.

27
Q

What is Random Projections?

A

A dimensionality reduction technique that projects the high-dimensional data onto a lower-dimensional subspace using a randomly generated matrix. It preserves the pairwise distances between data points and is computationally efficient.

28
Q

What is Recursive Feature Elimination?

A

A wrapper-based feature selection method that recursively removes the least important features from the dataset. It starts with all the features and iteratively trains a model, ranks the features based on their importance (e.g., using the model’s coefficients or feature importances), and eliminates the least important features. This process is repeated until the desired number of features is reached.

29
Q

What is Forward Stepwise Selection?

A

Forward Stepwise Selection is an iterative feature selection method that starts with an empty feature set and gradually adds the most relevant features one at a time. In each iteration, the feature that provides the greatest improvement in the model’s performance is added to the feature set. The process continues until a stopping criterion is met, such as a maximum number of features or no further improvement in performance.

30
Q

What is Backward Stepwise Selection?

A

Iteratively removes the least relevant features one at a time. In each iteration, the feature whose removal leads to the smallest decrease in the model’s performance is eliminated. The process continues until a stopping criterion is met.

31
Q

What is minimum description length pruning?

A

MDL pruning is based on the principle of finding the simplest model that encodes the data well. It considers the tree’s complexity and its ability to compress the data. The goal is to minimize the sum of the description length of the tree and the description length of the data given the tree.

31
Q

What is TF-IDF?

A

A numerical statistic used to reflect the importance of a word in a document within a collection of documents (corpus). It is commonly used as a feature transformation technique in text mining and information retrieval.

31
Q

What is Bagging?

A

Bagging is an ensemble method that creates multiple subsets of the training data by random sampling with replacement (bootstrap sampling). Each subset is used to train a separate base learner, typically a decision tree The results are then averaged out to get a final prediciton

32
Q

What is TF?

A

Measures the importance of a word across the entire corpus. The TF-IDF score is the product of TF and IDF, which helps in identifying the words that are highly frequent in a document but rare in the corpus, indicating their importance.

32
Q

What is Discretizing?

A

The process of converting continuous or numeric attributes into discrete or categorical attributes. It involves dividing the range of the continuous attribute into a set of intervals or bins and assigning each interval a discrete value or label.

33
Q

What is Unsupervised Discretization?

A

Dividing the attribute range into intervals without considering the target variable. Equal-width binning creates bins of equal size, while equal-frequency binning creates bins with an equal number of instances in each bin.

34
Q

What is One-hot encoding?

A

A technique used to convert categorical attributes into a binary vector representation. It creates a new binary attribute for each unique category in the original attribute, where a value of 1 indicates the presence of the category and 0 indicates its absence.

34
Q

What is ensemble learning?

A

A machine learning paradigm that combines multiple individual models, known as base learners or weak learners, to create a stronger and more accurate predictive model. The idea behind ensemble learning is that by combining the predictions of multiple models, we can reduce the errors and improve the overall performance compared to using a single model.

35
Q

What is Supervised Discretization?

A

Taking the target variable into account when creating the intervals. They aim to find the optimal split points that maximize the information gain or minimize the impurity with respect to the target variable.

36
Q

What is Boosting?

A

Combines weak learners to create a strong learner. The weak learners are trained sequentially, with each learner focusing on the instances that were misclassified by the previous learners.

36
Q

What is Adaboosting?

A

A popular boosting algorithm that assigns weights to the training instances based on their difficulty. Initially, all instances have equal weights.

36
Q

What is stacking?

A

An ensemble method that combines the predictions of multiple base learners using a meta-learner. The base learners are trained on the original training data, and their predictions are used as input features for the meta-learner.

37
Q

What is Random Forests?

A

An ensemble method that combines bagging with random feature selection. It creates multiple decision trees using bootstrap sampling of the training data. At each split in the decision trees, a random subset of features is considered for the split, introducing additional randomness and reducing the correlation between the trees.
The final prediction is obtained by majority voting (for classification) or averaging (for regression) of the predictions of all the decision trees.
Random Forest is known for its robustness, high accuracy, and ability to handle high-dimensional data.