Test 2 Flashcards

Question

In informal terms, how can principal component analysis be used to reduce the number of attributes used in a machine learning algorithm? What can be said about the attributes ultimately produced by PCA?

Answer 1

PCA will produce new attributes that are based on one or more of the original attributes. It works by choosing a vector in which variance (of target values, when instances are projected onto it) is maximal. Subsequent vectors are chosen in the same way, but orthogonal to those previously selected. As such, the same information encoded in multiple original attributes could be encoded more concisely

Answer 2

Techniques that combine multiple base models to improve performance Bagging creates multiple subsets of the training data by sampling with replacement, trains a base model and combines the predictions Boosting iteratively trains weak learners on weighted versions of the dataset where misclassified instances receive higher weights.

Answer 3

Bagging uses sampling with replacement. As such, its possible for the re-sampled set to contain every instance of the underlying training set, but in realistic settings it is improbable.

Answer 4

When doing so improves performance

Answer 5

Bagging is recommended when the underlying machine learning scheme is sensitive to variations in the data because it helps reduce variance and improve stability.

Answer 6

A statistical method used to compare two related samples or measurements to determine if there is a significant difference between their means.

Answer 7

Because the resulting learner will naturally prefer level-0 models that over-fit the training data used to produce them. If we use a separate set to develop the level-1 model, then this will be better estimate of performance on unseen data.

Answer 8

Random forests uses bagging and attribute selection to generate a set of simple decision trees. Each tree is developed using only a subset of the full number of attributes. E.g., each tree could be developed using only 500 instances (possibly with duplicates) taken from the 1000 and only 20 randomly selected attributes from the original 100.

Answer 9

The instances are reweighted and the algorithm continues (another model is trained). It’s only when the error is 0 or ≥ 0.5 that that the algorithm terminates.

Answer 10

Confidence intervals provide a range of values that are likely to contain the true population parameter with a certain level of confidence. They give us a way to estimate an unknown population parameter based on a sample statistic, while accounting for the uncertainty in our estimate due to sampling variability.

Answer 11

Typically not. Bagging uses the same ML scheme but varies the data sets used to develop models, while boosting generates a sequence of models based on a common scheme and using a common data set but re-weighting instances for each subsequent model.

Answer 12

A supervised learning method used to train multi-layer perceptrons (MLPs) and other types of feedforward neural networks.

Answer 13

It works by iteratively training weak learners (e.g., decision stumps) on weighted versions of the dataset. Initially, all instances have equal weights. In each iteration, a weak learner is trained on the weighted data, and its error rate is calculated. The weights of misclassified instances are increased, while the weights of correctly classified instances are decreased. This forces the next weak learner to focus more on the misclassified instances. The process is repeated for a specified number of iterations. The final classifier combines the weak learners' predictions weighted by their accuracy.

Answer 14

A loss function commonly used in machine learning, particularly in classification problems. It measures the dissimilarity between two probability distributions: the true distribution of the data (usually represented by one-hot encoded class labels) and the predicted distribution (usually the output of a softmax function).

Answer 15

An activation function commonly used in the output layer of neural networks for multi-class classification problems. It converts a vector of real numbers into a probability distribution over multiple classes, ensuring that the outputs sum up to 1 and can be interpreted as class probabilities.

Answer 16

A method for partitioning data and assessing model performance. In k-fold cross-validation, the data is divided into k equally sized folds. The model is trained and evaluated k times, using each fold once as the testing set and the remaining folds as the training set. The final performance estimate is the average of the k evaluations.

Answer 17

The training set is the portion of the data used to train the machine learning model. The model learns from the examples in the training set by adjusting its parameters to minimize the loss function.

Answer 18

When partitioning data, it's important to ensure that the class distribution in each subset is representative of the original dataset. Stratified sampling techniques, such as stratified k-fold cross-validation, maintain the class proportions in each fold.

Answer 19

The testing set is used to evaluate the performance of the trained model on unseen data. It provides an unbiased estimate of the model's generalization ability, i.e., how well it performs on new, previously unseen examples.

Answer 20

The validation set is used for model selection and hyperparameter tuning. It helps in choosing the best model architecture, regularization techniques, and other hyperparameters that maximize the model's performance on unseen data. The validation set is used during the training process to guide the model selection and prevent overfitting.

Answer 21

It involves recursively removing or collapsing subtrees that do not improve the model's performance on a separate validation set or based on a complexity criterion. The subtrees are replaced with leaf nodes assigned the majority class or the average value of the target variable.

Answer 22

Stopping the tree growth early, before it reaches its maximum depth or all the leaves are pure. This is done by setting a threshold on the minimum number of instances required to split a node, the maximum depth of the tree, or the minimum improvement in impurity reduction required for a split.

Answer 23

A technique used to simplify decision trees and prevent overfitting.

Answer 24

In this post-pruning approach, each subtree is considered for pruning. If replacing a subtree with a leaf node reduces the error rate on a separate validation set, the subtree is pruned. This process is repeated until no further improvements can be made.

Answer 25

This post-pruning method introduces a complexity parameter (α) that controls the trade-off between the tree's complexity and its goodness of fit. The algorithm finds the subtrees that minimize the cost complexity, defined as the sum of the misclassification rate and the product of α and the number of leaf nodes. The optimal value of α is determined using cross-validation.

Answer 26

A linear transformation technique that reduces the dimensionality of the data by projecting it onto a lower-dimensional space while retaining the maximum amount of variance. PCA identifies the principal components, which are the directions of maximum variance in the data, and transforms the features into a new set of uncorrelated variables.

Answer 27

A dimensionality reduction technique that projects the high-dimensional data onto a lower-dimensional subspace using a randomly generated matrix. It preserves the pairwise distances between data points and is computationally efficient.

Answer 28

A wrapper-based feature selection method that recursively removes the least important features from the dataset. It starts with all the features and iteratively trains a model, ranks the features based on their importance (e.g., using the model's coefficients or feature importances), and eliminates the least important features. This process is repeated until the desired number of features is reached.

Answer 29

Forward Stepwise Selection is an iterative feature selection method that starts with an empty feature set and gradually adds the most relevant features one at a time. In each iteration, the feature that provides the greatest improvement in the model's performance is added to the feature set. The process continues until a stopping criterion is met, such as a maximum number of features or no further improvement in performance.

Answer 30

Iteratively removes the least relevant features one at a time. In each iteration, the feature whose removal leads to the smallest decrease in the model's performance is eliminated. The process continues until a stopping criterion is met.

Answer 31

MDL pruning is based on the principle of finding the simplest model that encodes the data well. It considers the tree's complexity and its ability to compress the data. The goal is to minimize the sum of the description length of the tree and the description length of the data given the tree.

Answer 32

A numerical statistic used to reflect the importance of a word in a document within a collection of documents (corpus). It is commonly used as a feature transformation technique in text mining and information retrieval.

Answer 33

Bagging is an ensemble method that creates multiple subsets of the training data by random sampling with replacement (bootstrap sampling). Each subset is used to train a separate base learner, typically a decision tree The results are then averaged out to get a final prediciton

Answer 34

Measures the importance of a word across the entire corpus. The TF-IDF score is the product of TF and IDF, which helps in identifying the words that are highly frequent in a document but rare in the corpus, indicating their importance.

Answer 35

The process of converting continuous or numeric attributes into discrete or categorical attributes. It involves dividing the range of the continuous attribute into a set of intervals or bins and assigning each interval a discrete value or label.

Answer 36

Dividing the attribute range into intervals without considering the target variable. Equal-width binning creates bins of equal size, while equal-frequency binning creates bins with an equal number of instances in each bin.

Answer 37

A technique used to convert categorical attributes into a binary vector representation. It creates a new binary attribute for each unique category in the original attribute, where a value of 1 indicates the presence of the category and 0 indicates its absence.

Answer 38

A machine learning paradigm that combines multiple individual models, known as base learners or weak learners, to create a stronger and more accurate predictive model. The idea behind ensemble learning is that by combining the predictions of multiple models, we can reduce the errors and improve the overall performance compared to using a single model.

Answer 39

Taking the target variable into account when creating the intervals. They aim to find the optimal split points that maximize the information gain or minimize the impurity with respect to the target variable.

Answer 40

Combines weak learners to create a strong learner. The weak learners are trained sequentially, with each learner focusing on the instances that were misclassified by the previous learners.

Answer 41

A popular boosting algorithm that assigns weights to the training instances based on their difficulty. Initially, all instances have equal weights.

Answer 42

An ensemble method that combines the predictions of multiple base learners using a meta-learner. The base learners are trained on the original training data, and their predictions are used as input features for the meta-learner.

Answer 43

An ensemble method that combines bagging with random feature selection. It creates multiple decision trees using bootstrap sampling of the training data. At each split in the decision trees, a random subset of features is considered for the split, introducing additional randomness and reducing the correlation between the trees. The final prediction is obtained by majority voting (for classification) or averaging (for regression) of the predictions of all the decision trees. Random Forest is known for its robustness, high accuracy, and ability to handle high-dimensional data.

Test 2 Flashcards

(67 cards)