Test 2 Flashcards

Question

How is the output of a node calculated?

Answer 1

The output of a node is calculated by applying the activation function to the weighted sum of its inputs.

Answer 2

The signal is propagated from one layer to the next in an MLP through the following steps: Each node in the current layer calculates its weighted sum of inputs The activation function is applied to the weighted sum to generate the node's output The outputs of the nodes in the current layer become the inputs for the nodes in the next layer The process is repeated for each layer until the output layer is reached

Answer 3

An activation function in an MLP is a mathematical function applied to the weighted sum of inputs of a node. It introduces non-linearity into the network, enabling it to learn and represent complex patterns.

Answer 4

The sigmoid activation function maps the input to a value between 0 and 1. It is defined as: f(x) = 1 / (1 + e^(-x)) where e is the mathematical constant approximately equal to 2.71828.

Answer 5

Smoothly maps the input to a value between 0 and 1 Continuously differentiable, which is important for gradient-based optimization Suffers from the vanishing gradient problem for very high or low input values

Answer 6

The hyperbolic tangent (tanh) activation function maps the input to a value between -1 and 1. It is defined as: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Answer 7

Smoothly maps the input to a value between -1 and 1 Continuously differentiable Suffers from the vanishing gradient problem for very high or low input values Provides a wider output range compared to the sigmoid function

Answer 8

The Rectified Linear Unit (ReLU) activation function maps the input to itself if it is positive, and to 0 if it is negative. It is defined as: f(x) = max(0, x)

Answer 9

Simple and computationally efficient Provides a sparse representation, as negative inputs result in a 0 output Helps alleviate the vanishing gradient problem Not continuously differentiable at 0 Can suffer from the "dying ReLU" problem, where neurons become permanently inactive

Answer 10

Leaky ReLU: A variant of ReLU that allows small negative outputs to address the "dying ReLU" problem Exponential Linear Unit (ELU): Similar to ReLU but has a smooth transition for negative inputs Swish: Defined as f(x) = x * sigmoid(x), providing a smooth and non-monotonic activation function

Answer 11

An activation function introduces non-linearity into the network, enabling it to learn and represent complex patterns by transforming the weighted sum of inputs of a node.

Answer 12

The most commonly used activation function in the hidden layers of an MLP is the Rectified Linear Unit (ReLU) function.

Answer 13

The ReLU activation function is defined as: f(x) = max(0, x) It maps the input to itself if it is positive, and to 0 if it is negative.

Answer 14

Computationally efficient and simple to implement Provides a sparse representation, as negative inputs result in a 0 output Helps alleviate the vanishing gradient problem Promotes faster convergence during training compared to sigmoid and tanh functions

Answer 15

The ReLU function can suffer from the "dying ReLU" problem, where neurons become permanently inactive if their weighted sum of inputs is consistently negative during training, leading to a loss of gradient flow.

Answer 16

The sigmoid (logistic) activation function is commonly used in the output layer of an MLP for binary classification tasks. It maps the input to a value between 0 and 1, representing the probability of the positive class.

Answer 17

The softmax activation function is commonly used in the output layer of an MLP for multi-class classification tasks. It maps the inputs to a probability distribution over the classes, ensuring that the outputs sum up to 1.

Answer 18

A bias node is an additional node in each layer of an MLP, except for the output layer. It has a constant value of 1 and is connected to all the nodes in the next layer.

Answer 19

The purpose of a bias node is to provide flexibility to the model by allowing it to shift the activation function of each node independently, enabling better fitting of the data.

Answer 20

A bias node allows the model to add a constant value to the weighted sum of inputs for each node in the next layer. This constant value can shift the activation function left or right, helping the model to better fit the data.

Answer 21

No, the bias node is not connected to the nodes in the previous layer. Its value is always set to 1, and it is only connected to the nodes in the next layer.

Answer 22

The bias node is connected to each node in the next layer through a weighted connection, just like the connections from the other nodes in the previous layer. The weights of these connections are learned during the training process.

Answer 23

Without a bias node, the activation functions of the nodes in the next layer would always pass through the origin (0, 0), limiting the model's ability to fit the data well. The bias node provides the necessary flexibility to shift the activation functions and improve the model's performance.

Answer 24

No, bias nodes are typically not used in the output layer of an MLP. The output layer nodes directly produce the final output values based on the weighted sum of their inputs and the chosen activation function.

Answer 25

The back-propagation algorithm is used to train an MLP by adjusting the weights of the connections between nodes to minimize the difference between the predicted outputs and the actual targets.

Answer 26

The two main phases of the back-propagation algorithm are: Forward pass: The input is propagated through the network to compute the output. Backward pass: The error is propagated back through the network to update the weights.

Answer 27

The loss function measures the difference between the predicted outputs and the actual targets. It quantifies the error of the model's predictions. Common loss functions include mean squared error (MSE) for regression and cross-entropy for classification.

Answer 28

During the backward pass, the error is calculated as the derivative of the loss function with respect to the predicted outputs. This error is then propagated back through the network using the chain rule of calculus.

Answer 29

The chain rule is used to calculate the gradients of the weights with respect to the loss function. It allows the error to be propagated from the output layer back to the input layer, calculating the contribution of each weight to the overall error.

Answer 30

The weights are updated using gradient descent. The gradients of the weights with respect to the loss function are calculated using the chain rule, and the weights are adjusted in the opposite direction of the gradients, scaled by a learning rate.

Answer 31

The learning rate is a hyperparameter that controls the step size at which the weights are updated during the backward pass. It determines the speed of convergence and the stability of the learning process. A higher learning rate leads to faster convergence but may overshoot the optimal solution, while a lower learning rate leads to slower convergence but may find a more stable solution.

Answer 32

Vanishing gradients: As the error is propagated back through the network, the gradients can become very small, leading to slow learning in the earlier layers. Exploding gradients: In some cases, the gradients can become very large, causing the weights to update too much and leading to instability. Local minima: The back-propagation algorithm can get stuck in suboptimal local minima of the loss function, preventing the model from finding the global minimum.

Answer 33

The training set is used to train the model by adjusting its parameters to minimize the loss function. The model learns patterns and relationships from the training data.

Answer 34

The validation set is used to tune the hyperparameters of the model and evaluate its performance during training. It helps prevent overfitting by providing an unbiased estimate of the model's performance on unseen data.

Answer 35

The test set is used to assess the final performance of the trained model. It provides an unbiased estimate of how well the model generalizes to new, unseen data.

Answer 36

The training set is used during the training phase of the model development process. The model's parameters are iteratively updated based on the training data to minimize the loss function.

Answer 37

The validation set is used during the training phase to monitor the model's performance on unseen data. It is used to make decisions about hyperparameter tuning, model selection, and early stopping to prevent overfitting.

Answer 38

The test set is used after the model has been fully trained and optimized using the training and validation sets. It provides a final, unbiased evaluation of the model's performance on new, unseen data.

Answer 39

Keeping the test set separate ensures that the model's performance is evaluated on truly unseen data. If the test set is used during training or hyperparameter tuning, it may lead to overfitting and an overestimation of the model's generalization ability.

Answer 40

A common split ratio is: Training set: 60-80% of the data Validation set: 10-20% of the data Test set: 10-20% of the data However, the exact split ratio can vary depending on the size of the dataset and the specific requirements of the problem.

Answer 41

Cross-validation is a technique that involves splitting the data into multiple subsets, training and evaluating the model on different combinations of these subsets, and averaging the results. It provides a more robust estimate of the model's performance compared to using a single validation set. In cross-validation, the validation set is created multiple times from different portions of the data.

Answer 42

The holdout method is a technique where the dataset is split into two subsets: a training set and a test set (or holdout set). The model is trained on the training set and evaluated on the test set to assess its performance on unseen data.

Answer 43

The purpose of the holdout method is to provide an unbiased estimate of a model's performance on new, unseen data. By evaluating the model on data that was not used during training, we can assess how well the model generalizes

Answer 44

The performance estimate can be sensitive to the specific split of the data into training and test sets. It may not provide a reliable estimate of the model's performance if the dataset is small, as the test set may not be representative of the overall data distribution.

Answer 45

Cross-validation is a technique that involves splitting the data into multiple subsets, training and evaluating the model on different combinations of these subsets, and averaging the results to obtain a more robust estimate of the model's performance.

Answer 46

The most common type of cross-validation is k-fold cross-validation, where the data is split into k equally-sized subsets called folds.

Answer 47

In k-fold cross-validation: The data is split into k folds. The model is trained and evaluated k times, using a different fold as the test set each time and the remaining folds as the training set. The performance metrics from each iteration are averaged to provide a final estimate of the model's performance.

Answer 48

Cross-validation provides a more robust and reliable estimate of a model's performance by evaluating it on multiple subsets of the data. It reduces the impact of the specific split of the data on the performance estimate. It is particularly useful when the dataset is small, as it allows for more efficient use of the available data.

Answer 49

Common values for k in k-fold cross-validation are 5 and 10. However, the choice of k can depend on the size of the dataset and the computational resources available.

Answer 50

The purpose of cross-validation is to provide a more robust and reliable estimate of a model's performance by evaluating it on multiple subsets of the data, reducing the impact of the specific split of the data on the performance estimate.

Answer 51

Cross-validation is particularly useful when: The dataset is small, as it allows for more efficient use of the available data. There is uncertainty about the model's performance on unseen data. You want to compare different models or tune hyperparameters.

Answer 52

When the training data is very large and representative of the population, cross-validation may not be necessary. In this case, a single holdout validation set can provide a reliable estimate of the model's performance.

Answer 53

Large datasets are less sensitive to the specific split of the data, as the holdout validation set is more likely to be representative of the overall data distribution. The computational cost of performing cross-validation on very large datasets can be high, and the benefit may not justify the additional time and resources required.

Answer 54

An alternative approach is to use a single holdout validation set, where a portion of the data (e.g., 10-20%) is reserved for evaluating the model's performance. This validation set should be representative of the overall data distribution.

Answer 55

You might still consider using cross-validation when: You want to compare the performance of different models or architectures. You are tuning hyperparameters and want to ensure the model's performance is robust across different subsets of the data. You have sufficient computational resources and want to obtain the most reliable estimate of the model's performance.

Answer 56

The key factor in deciding whether to use cross-validation with very large and representative training data is the trade-off between the potential improvement in performance estimation and the computational cost. If the benefit of cross-validation is small compared to the computational cost, a single holdout validation set may be sufficient.

Answer 57

A regression tree predicts a continuous numeric value, while a classification tree predicts a categorical class label.

Answer 58

In a regression tree, the splitting criterion is typically based on minimizing the sum of squared errors (SSE) or mean squared error (MSE) of the target variable within each resulting subset.

Answer 59

The sum of squared errors (SSE) is calculated as the sum of the squared differences between each data point and the mean value of the target variable within a subset: SSE = Σ(y_i - ȳ)^2 where y_i is the target value for each data point, and ȳ is the mean target value within the subset.

Answer 60

For each numeric attribute, the algorithm considers all possible split points. At each split point, it calculates the SSE for the resulting subsets. The split point that minimizes the total SSE (or MSE) of the resulting subsets is chosen as the best split for that attribute.

Answer 61

The algorithm compares the best splits for each attribute and selects the attribute that results in the lowest total SSE (or MSE) after splitting. This attribute is used to create the split at the current node.

Answer 62

The splitting process in a regression tree is typically terminated when one of the following conditions is met: The maximum tree depth is reached. The number of data points in a node falls below a specified threshold. The reduction in SSE (or MSE) falls below a specified threshold.

Answer 63

When the splitting process is terminated, the node becomes a leaf node. The predicted value for a leaf node is typically the mean target value of the data points within that node.

Answer 64

ou can control the complexity of a regression tree by adjusting the termination criteria: Setting a maximum tree depth to limit the number of splits. Increasing the minimum number of data points required in a node to create a split. Increasing the minimum reduction in SSE (or MSE) required to create a split. Adjusting these criteria can help prevent overfitting and create a simpler, more interpretable tree.

Answer 65

A regression tree is a decision tree-based model that predicts a continuous numeric value. It recursively splits the input space into subregions based on the input features, and the predicted value for each subregion is the mean target value of the data points within that subregion.

Answer 66

A model tree is an extension of a regression tree that builds a linear regression model at each leaf node instead of using the mean target value. The linear regression model is built using the input features and the target variable of the data points within each leaf node.

Answer 67

In a regression tree, predictions are made by traversing the tree based on the input features until a leaf node is reached. The predicted value for a new data point is the mean target value of the training data points within that leaf node.

Answer 68

In a model tree, predictions are made by traversing the tree based on the input features until a leaf node is reached. The predicted value for a new data point is calculated using the linear regression model built at that leaf node, taking the input features of the new data point as input.

Answer 69

The main advantage of a model tree is that it can capture more complex relationships between the input features and the target variable within each leaf node. By building a linear regression model at each leaf, a model tree can provide more accurate predictions, especially when there are linear relationships between the input features and the target variable.

Answer 70

The trade-off between a regression tree and a model tree is complexity versus interpretability. A model tree can provide more accurate predictions by capturing more complex relationships, but it may be less interpretable than a regression tree. A regression tree, on the other hand, is simpler and easier to interpret but may not capture complex relationships as well as a model tree.

Answer 71

The training process for a regression tree and a model tree is similar in terms of splitting the input space based on the input features. However, in a model tree, after the splitting process is completed, a linear regression model is built at each leaf node using the data points within that node. In a regression tree, the mean target value is used as the predicted value for each leaf node.

Answer 72

You might choose a regression tree over a model tree when: Interpretability is a priority, and you want a simpler, more easily understandable model. The relationships between the input features and the target variable are mostly non-linear or complex. You might choose a model tree over a regression tree when: Prediction accuracy is the main priority, and you want to capture more complex relationships within the leaf nodes. There are linear relationships between the input features and the target variable within the subregions of the input space.

Answer 73

Attribute selection, also known as feature selection, is the process of selecting a subset of relevant features (attributes) from a larger set of features to use in model construction. The goal is to improve model performance, reduce complexity, and enhance interpretability.

Answer 74

C4.5 (J48) can be used for attribute selection by examining the decision tree structure it produces. The attributes that appear closer to the root of the tree and are used for splitting the data are considered more informative and relevant. Attributes that do not appear in the tree or appear in the lower levels of the tree are considered less important and can be potential candidates for removal.

Answer 75

The main idea behind using C4.5 (J48) for attribute selection is that the algorithm selects the most informative attributes for splitting the data based on information gain or gain ratio. By examining the decision tree structure, we can identify the attributes that are most useful in making predictions and discard the ones that contribute little to the model's performance.

Answer 76

Linear regression can be used for attribute selection by examining the coefficients of the regression model. Attributes with larger absolute coefficient values are considered more important in predicting the target variable. Attributes with coefficients close to zero or with high p-values (indicating low statistical significance) can be potential candidates for removal.

Answer 77

Some techniques used in linear regression for attribute selection include: Backward elimination: Start with all attributes and iteratively remove the least significant attribute until a stopping criterion is met. Forward selection: Start with no attributes and iteratively add the most significant attribute until a stopping criterion is met. Stepwise selection: Combination of backward elimination and forward selection, where attributes are added or removed based on their significance at each step.

Answer 78

The advantages of using linear regression for attribute selection include: It provides a quantitative measure of the importance of each attribute in predicting the target variable. It can handle multicollinearity by identifying and potentially removing highly correlated attributes. It is computationally efficient and can be applied to datasets with a large number of attributes.

Answer 79

The limitations of using linear regression for attribute selection include: It assumes a linear relationship between the attributes and the target variable, which may not always be the case. It is sensitive to outliers and may be affected by extreme values in the data. It may not capture complex interactions between attributes, as it considers each attribute independently.

Answer 80

You can combine C4.5 (J48) and linear regression for attribute selection by: Using C4.5 (J48) to identify the most informative attributes based on the decision tree structure. Using linear regression to further refine the attribute selection by examining the coefficients and significance of the selected attributes. Iterating between the two methods to find the optimal subset of attributes that balance model performance and complexity.

Answer 81

During training, C4.5 uses a technique called "fractional instances" to handle missing attribute values. When an instance has a missing value for an attribute, C4.5 splits the instance into multiple fractional instances, each representing a possible value for the missing attribute. The weight of each fractional instance is proportional to the frequency of the corresponding attribute value in the training set.

Answer 82

The purpose of using fractional instances is to allow C4.5 to use all available information during training, even when some instances have missing attribute values. By splitting instances with missing values into fractional instances, C4.5 can consider all possible values for the missing attribute and their respective frequencies in the training set. This approach helps to build a more accurate and robust decision tree.

Answer 83

The weights of fractional instances are calculated based on the frequency of the corresponding attribute value in the training set. For example, if an attribute has two possible values, "A" and "B," and "A" appears in 60% of the instances and "B" in 40%, an instance with a missing value for this attribute will be split into two fractional instances: one with value "A" and weight 0.6, and another with value "B" and weight 0.4.

Answer 84

During classification, when C4.5 encounters an instance with a missing attribute value, it explores all branches of the decision tree corresponding to the possible values of the missing attribute. The final classification is determined by combining the predictions from all the explored branches, weighted by the frequency of each attribute value in the training set.

Answer 85

The main idea behind C4.5's approach to handling missing attribute values during classification is to consider all possible outcomes based on the available information. By exploring all branches corresponding to the possible values of the missing attribute, C4.5 can make a more informed prediction that takes into account the uncertainty introduced by the missing value.

Answer 86

C4.5 combines the predictions from multiple branches by taking a weighted average of the predictions, where the weights are proportional to the frequency of each attribute value in the training set. For example, if the branches corresponding to attribute values "A" and "B" predict classes "X" and "Y," respectively, and "A" appears in 60% of the instances and "B" in 40%, the final prediction will be a weighted average of "X" and "Y" with weights 0.6 and 0.4, respectively.

Answer 87

The advantages of C4.5's approach to handling missing attribute values include: It allows the algorithm to use all available information during training and classification, even when some instances have missing values. It takes into account the frequency of attribute values in the training set when making predictions, which can lead to more accurate results. It provides a principled way to handle missing values without the need for imputation or discarding instances with missing values.

Answer 88

Pruning is a technique used in decision tree learning to reduce the complexity of the tree and prevent overfitting. It involves removing branches or subtrees that do not significantly contribute to the model's performance, resulting in a simpler and more generalizable tree.

Answer 89

C4.5 uses error on the training set to drive pruning because it is readily available during the tree-building process. The training set is used to construct the initial decision tree, and the error on this set can be easily calculated. However, using the training set error alone can lead to overfitting, as the tree may become too complex and fit the noise in the training data.

Answer 90

The problem with using training set error for pruning is that it can lead to overfitting. A decision tree that perfectly fits the training data may not generalize well to new, unseen instances. The tree may become too complex and capture noise or irrelevant patterns in the training set, resulting in poor performance on the test set or real-world data.

Answer 91

To address the overfitting problem during pruning, C4.5 makes an estimate of the true error rate using a statistical technique called pessimistic pruning. Instead of relying solely on the training set error, C4.5 estimates the error rate of each subtree based on its complexity and the number of instances it covers.

Answer 92

Pessimistic pruning is a technique used by C4.5 to estimate the true error rate of a subtree during the pruning process. It takes into account the complexity of the subtree and the number of instances it covers to provide a more conservative estimate of the error rate, which helps to avoid overfitting.

Answer 93

Pessimistic pruning estimates the error rate of a subtree by adding a penalty term to the training set error. The penalty term is based on the confidence interval for the binomial distribution, which takes into account the number of instances covered by the subtree and the confidence level. The estimated error rate is calculated as: estimated error rate = (training set error + confidence interval) / (number of instances + 1)

Answer 94

The confidence interval in pessimistic pruning acts as a penalty term that increases the estimated error rate of a subtree. A larger confidence interval results in a higher estimated error rate, making the subtree more likely to be pruned. The confidence interval is determined by the confidence level, which is a parameter of the algorithm. A higher confidence level leads to a larger interval and more aggressive pruning.

Answer 95

C4.5 compares the estimated error rate of a subtree with the estimated error rate of a leaf node that would replace the subtree. If the estimated error rate of the leaf node is lower than or equal to the estimated error rate of the subtree, the subtree is pruned and replaced by the leaf node. This process is repeated recursively for each subtree until no further pruning can be done.

Answer 96

Recursive feature elimination (RFE) is a feature selection technique that recursively removes the least important features from a dataset until a desired number of features is reached. It is used to identify the most relevant features for a given machine learning task and improve model performance.

Answer 97

RFE works by the following steps: Train a machine learning model on the initial set of features. Evaluate the importance of each feature based on the model's coefficients or feature importances. Remove the least important feature(s) from the dataset. Repeat steps 1-3 until the desired number of features is reached.

Answer 98

The main idea behind RFE is that by recursively eliminating the least important features, the algorithm can identify a subset of features that are most relevant to the target variable. This subset of features can then be used to train a simpler and more interpretable model with improved performance and reduced overfitting.

Answer 99

RFE can be used with any machine learning model that provides a way to evaluate feature importance, such as: Linear models (e.g., linear regression, logistic regression) Decision trees and random forests Support vector machines (SVM) Gradient boosting machines (GBM)

Answer 100

In linear models, such as linear regression or logistic regression, RFE evaluates feature importance based on the absolute values of the model's coefficients. Features with larger absolute coefficient values are considered more important, while features with smaller absolute coefficient values are considered less important and are candidates for elimination.

Answer 101

In decision trees and random forests, RFE evaluates feature importance based on the decrease in impurity (e.g., Gini impurity or entropy) that each feature brings about when it is used for splitting the data. Features that lead to larger decreases in impurity are considered more important, while features with smaller decreases in impurity are considered less important and are candidates for elimination.

Answer 102

The advantages of using RFE for feature selection include: It can identify a subset of the most relevant features, leading to simpler and more interpretable models. It can improve model performance by reducing overfitting and focusing on the most informative features. It is a wrapper method, meaning it takes into account the interaction between features and the specific machine learning algorithm being used.

Answer 103

The limitations of RFE include: It can be computationally expensive, especially when dealing with a large number of features, as it requires training and evaluating the model multiple times. The optimal number of features to select may not be known in advance and may require experimentation or cross-validation to determine. It may not always select the globally optimal subset of features, as it makes greedy decisions based on the current set of features at each iteration.

Answer 104

Evaluating the quality of a feature set in attribute selection helps to determine the effectiveness of the selected features in improving model performance, reducing complexity, and enhancing interpretability. It allows you to compare different feature subsets and choose the one that best suits your machine learning task.

Answer 105

The two main categories of methods for evaluating the quality of a feature set are: Filter methods: Evaluate the quality of features independently of the machine learning algorithm. Wrapper methods: Evaluate the quality of features based on the performance of a specific machine learning algorithm.

Answer 106

Some common filter methods for evaluating the quality of a feature set include: Correlation-based methods: Evaluate features based on their correlation with the target variable and the absence of correlation with other features. Information gain: Measures the reduction in entropy achieved by using a feature to split the data. Chi-squared test: Assesses the independence between a feature and the target variable. Variance threshold: Removes features with low variance, as they may not contribute much to the model.

Answer 107

Correlation-based methods evaluate the quality of a feature set by considering two factors: The correlation between each feature and the target variable: Features with higher correlation are considered more relevant. The absence of correlation among the features themselves: A good feature set should have features that are not highly correlated with each other to avoid redundancy.

Answer 108

Some common wrapper methods for evaluating the quality of a feature set include: Recursive Feature Elimination (RFE): Recursively removes the least important features based on a model's feature importances. Forward Selection: Starts with an empty feature set and iteratively adds the most promising features based on model performance. Backward Elimination: Starts with all features and iteratively removes the least promising features based on model performance.

Answer 109

Wrapper methods evaluate the quality of a feature set by training and testing a specific machine learning model using different subsets of features. The performance of the model, such as accuracy or F1-score, is used as a measure of the quality of the feature set. The feature subset that leads to the best model performance is considered the optimal set.

Answer 110

The advantages of using wrapper methods include: They take into account the interaction between features and the specific machine learning algorithm being used. They can identify feature subsets that are optimized for the particular model and task at hand. They can lead to better model performance compared to filter methods.

Answer 111

The limitations of wrapper methods include: They can be computationally expensive, as they require training and evaluating the model multiple times for different feature subsets. They may be prone to overfitting, especially when the number of features is large compared to the number of instances. The selected feature subset may be specific to the model and may not generalize well to other models or tasks.

Answer 112

Scheme independence refers to the property of an attribute selection method that evaluates the quality of features independently of the specific machine learning algorithm that will be used to build the model. A scheme-independent method assesses the relevance of features based on their intrinsic properties and their relationship with the target variable, without considering the peculiarities of any particular learning scheme.

Answer 113

The two main categories of attribute selection methods based on scheme independence are: Scheme-independent methods: These methods evaluate the quality of features independently of the machine learning algorithm. Scheme-dependent methods: These methods evaluate the quality of features based on the performance of a specific machine learning algorithm.

Answer 114

Some examples of scheme-independent attribute selection methods include: Correlation-based methods: Evaluate features based on their correlation with the target variable and the absence of correlation with other features. Information gain: Measures the reduction in entropy achieved by using a feature to split the data. Chi-squared test: Assesses the independence between a feature and the target variable. Variance threshold: Removes features with low variance, as they may not contribute much to the model.

Answer 115

Some examples of scheme-dependent attribute selection methods include: Recursive Feature Elimination (RFE): Recursively removes the least important features based on a specific model's feature importances. Wrapper methods (e.g., forward selection, backward elimination): Evaluate feature subsets based on the performance of a specific machine learning model. Embedded methods (e.g., L1 regularization, decision tree feature importance): Perform feature selection as part of the model training process.

Answer 116

The advantages of using scheme-independent attribute selection methods include: They are computationally efficient, as they do not require training and evaluating a machine learning model multiple times. They provide a general assessment of feature relevance that can be used with various machine learning algorithms. They are less prone to overfitting, as they do not rely on the performance of a specific model.

Answer 117

The limitations of scheme-independent attribute selection methods include: They do not take into account the interaction between features and the specific machine learning algorithm being used. They may not always select the optimal feature subset for a particular model or task. They may not capture complex relationships between features and the target variable.

Answer 118

You might choose a scheme-independent attribute selection method over a scheme-dependent method when: You want to perform feature selection as a preprocessing step before trying different machine learning algorithms. You have limited computational resources and cannot afford to train and evaluate models multiple times. You want to gain a general understanding of the relevance of features without being tied to a specific model. You are working with a large number of features and want to quickly filter out irrelevant ones.

Answer 119

Forward selection is an iterative attribute selection method that starts with an empty feature set and gradually adds the most relevant features one at a time. At each iteration, the feature that leads to the greatest improvement in the model's performance is added to the feature set until a stopping criterion is met or no more improvements can be made.

Answer 120

The main idea behind forward selection is to start with a simple model and incrementally add features that contribute the most to the model's performance. This approach allows for the identification of a subset of relevant features while keeping the model complexity under control.

Answer 121

Backward elimination, also known as backward selection, is an iterative attribute selection method that starts with the full set of features and gradually removes the least relevant features one at a time. At each iteration, the feature whose removal leads to the smallest decrease in the model's performance is eliminated from the feature set until a stopping criterion is met or no more features can be removed without a significant drop in performance.

Answer 122

The main idea behind backward elimination is to start with a complex model that includes all features and iteratively remove features that contribute the least to the model's performance. This approach allows for the identification of a subset of relevant features by progressively simplifying the model.

Answer 123

Forward selection is likely to produce a feature set containing fewer features compared to backward elimination. This is because forward selection starts with an empty set and adds features one at a time, stopping when no more improvements can be made. In contrast, backward elimination starts with the full set of features and removes them one at a time, often resulting in a larger final feature set.

Answer 124

The advantages of forward selection include: It tends to produce simpler models with fewer features, which can be more interpretable and computationally efficient. It is less prone to overfitting, as it only includes features that significantly improve the model's performance. It is computationally more efficient than backward elimination when the number of features is large.

Answer 125

The limitations of forward selection include: It may not always find the optimal feature subset, as it makes greedy decisions based on the current set of features. It may miss important interactions between features, as it considers each feature independently. It may stop prematurely if the stopping criterion is not well-defined or if the improvement in performance is not significant enough.

Answer 126

The advantages of backward elimination include: It can identify feature interactions and dependencies that may be missed by forward selection. It starts with the full model, which can provide a better understanding of the overall feature space. It may be more thorough in exploring the feature subsets, as it considers all possible feature combinations.

Answer 127

The limitations of backward elimination include: It can be computationally expensive, especially when the number of features is large, as it starts with the full model. It may produce larger feature subsets compared to forward selection, which can lead to more complex and less interpretable models. It may suffer from multicollinearity issues, where highly correlated features may be retained in the final model.

Answer 128

Discretization is the process of converting continuous numeric attributes into discrete or categorical attributes by dividing the range of values into a set of intervals or bins. This can help simplify the data, reduce noise, and improve the performance of certain machine learning algorithms.

Answer 129

Equal-interval binning is a discretization method that divides the range of a numeric attribute into a fixed number of intervals (bins) of equal width. The width of each bin is calculated by dividing the difference between the maximum and minimum values of the attribute by the desired number of bins.

Answer 130

In equal-interval binning, the bin boundaries are determined by the following formula: bin_width = (max_value - min_value) / number_of_bins The lower boundary of the first bin is the minimum value, and the upper boundary of each bin is calculated by adding the bin width to the lower boundary of the previous bin.

Answer 131

The main advantage of equal-interval binning is its simplicity and ease of implementation. It creates bins of equal width, which can be easily interpreted and communicated. Equal-interval binning is useful when the distribution of the numeric attribute is roughly uniform.

Answer 132

A potential drawback of equal-interval binning is that it can create bins with uneven frequencies, especially when the distribution of the numeric attribute is skewed or has outliers. This can lead to some bins having very few or no instances, while others have a large number of instances, which may not be optimal for certain machine learning algorithms.

Answer 133

Equal-frequency binning, also known as quantile binning, is a discretization method that divides the range of a numeric attribute into a fixed number of intervals (bins) such that each bin contains approximately the same number of instances. The bin boundaries are determined by the quantiles of the attribute's distribution.

Answer 134

In equal-frequency binning, the bin boundaries are determined by the quantiles of the attribute's distribution. For example, if we want to create 4 bins, we would use the 25th, 50th, and 75th percentiles (quartiles) as the bin boundaries. Each bin would contain approximately 25% of the instances.

Answer 135

The main advantage of equal-frequency binning is that it creates bins with roughly equal numbers of instances, which can be beneficial for certain machine learning algorithms that are sensitive to class imbalance. It ensures that each bin has sufficient representation in the discretized data.

Answer 136

A potential drawback of equal-frequency binning is that it can create bins with varying widths, which may not be intuitive or easily interpretable. The bin boundaries may not align with meaningful thresholds in the data, and the resulting discretization may not capture important patterns or relationships.

Answer 137

You might choose equal-interval binning over equal-frequency binning when: The distribution of the numeric attribute is roughly uniform The bin boundaries need to be easily interpretable and communicated The number of instances in each bin is less important than the consistency of bin widths

Answer 138

The main goal of PCA is to reduce the dimensionality of a dataset by finding a new set of attributes (called principal components) that capture the maximum amount of variance in the original data while being uncorrelated with each other.

Answer 139

PCA reduces the number of attributes by creating a new set of attributes (principal components) that are linear combinations of the original attributes. These principal components are ordered by the amount of variance they explain in the data, and by selecting a subset of the top principal components, we can effectively reduce the dimensionality of the dataset.

Answer 140

The first principal component is the linear combination of the original attributes that captures the maximum amount of variance in the data. It represents the direction in the attribute space along which the data varies the most.

Answer 141

Each subsequent principal component is orthogonal (perpendicular) to the previous components and captures the maximum remaining variance in the data. The second principal component captures the second most variance, the third captures the third most, and so on.

Answer 142

The eigenvalues in PCA represent the amount of variance explained by each principal component. The larger the eigenvalue, the more variance the corresponding principal component captures. The sum of all eigenvalues equals the total variance in the original data.

Answer 143

There are several methods to decide how many principal components to retain, including: Scree plot: Plot the eigenvalues in descending order and look for an "elbow" point where the curve levels off. Retain the components up to this point. Cumulative explained variance: Retain the minimum number of components that cumulatively explain a desired percentage (e.g., 90%) of the total variance. Kaiser's criterion: Retain components with eigenvalues greater than 1, as they explain more variance than an average single attribute.

Answer 144

The attributes produced by PCA have the following properties: They are uncorrelated (orthogonal) with each other, meaning there is no redundancy in the information they capture. They are ordered by the amount of variance they explain in the data, with the first component explaining the most variance and the last component explaining the least. They are linear combinations of the original attributes, which may make them less interpretable than the original attributes.

Answer 145

After performing PCA, you can use the selected principal components as input features for your machine learning algorithm. By reducing the dimensionality of the data, you can potentially improve the algorithm's performance, reduce overfitting, and decrease computational complexity.

Answer 146

Some limitations of PCA include: It assumes that the relationships between attributes are linear, which may not always be the case. It is sensitive to the scale of the attributes, so it is important to standardize the data before applying PCA. The resulting principal components may be difficult to interpret, as they are linear combinations of the original attributes. It may not always capture the most discriminative information for a specific machine learning task, as it focuses on capturing the maximum variance in the data.

Answer 147

Random Projection is a dimensionality reduction technique that reduces the number of attributes by projecting the original high-dimensional data onto a lower-dimensional subspace using a randomly generated matrix. It is based on the Johnson-Lindenstrauss lemma, which states that a small set of points in a high-dimensional space can be embedded into a lower-dimensional space while preserving the pairwise distances between the points.

Answer 148

Random Projection works by multiplying the original data matrix (n instances × d attributes) with a randomly generated projection matrix (d attributes × k dimensions), where k is the desired number of reduced dimensions. The resulting matrix (n instances × k dimensions) represents the data in the lower-dimensional space.

Answer 149

The main advantages of Random Projection compared to PCA are: It is computationally cheaper, as it does not require the calculation of the covariance matrix or the eigenvalue decomposition, which can be expensive for high-dimensional data. It is data-independent, meaning the projection matrix can be generated without knowledge of the actual data, making it suitable for streaming or online learning scenarios. It has strong theoretical guarantees for preserving pairwise distances and the structure of the data in the reduced space.

Answer 150

The limitations of Random Projection include: The resulting reduced dimensions are not interpretable, as they are random linear combinations of the original attributes. It may require a larger number of reduced dimensions compared to PCA to achieve similar performance, as it does not explicitly maximize the variance captured in the reduced space. The quality of the reduction depends on the choice of the random projection matrix, and different random matrices may yield different results.

Answer 151

Feature Hashing, also known as the hashing trick, is a dimensionality reduction technique that maps high-dimensional feature vectors to a lower-dimensional space using a hash function. It is particularly useful when dealing with sparse, high-dimensional data, such as text data in natural language processing tasks.

Answer 152

Feature Hashing works by applying a hash function to each feature in the original high-dimensional space and using the hash values to map the features to a fixed-size lower-dimensional vector. Collisions, where multiple features map to the same hash value, are allowed and can be handled by adding the feature values together.

Answer 153

The main advantages of Feature Hashing compared to PCA are: It is computationally cheap, as it requires only a single pass over the data and does not involve any matrix operations. It can handle sparse, high-dimensional data efficiently, as it does not require the explicit computation of the feature vector. It has a fixed memory footprint, as the size of the resulting lower-dimensional vector is determined by the hash function and does not grow with the number of features.

Answer 154

The limitations of Feature Hashing include: Hash collisions can lead to information loss and reduced performance, especially if the number of hash buckets is too small relative to the number of original features. The choice of the hash function can impact the quality of the reduction, and different hash functions may yield different results. The resulting reduced dimensions are not interpretable, as they do not have any semantic meaning.

Answer 155

Post-pruning, also known as backward pruning, is a technique used to simplify a decision tree after it has been fully grown. The goal is to reduce overfitting and improve the tree's generalization performance on unseen data by removing or replacing subtrees that do not contribute significantly to the tree's accuracy.

Answer 156

Subtree replacement is a post-pruning technique where a subtree (a node and all its descendants) is replaced by a leaf node. The leaf node is assigned the majority class or the average value of the target variable in the case of classification or regression trees, respectively.

Answer 157

A subtree is replaced with a leaf node when the estimated error of the leaf node is lower than or equal to the estimated error of the subtree. In other words, if replacing the subtree with a leaf node leads to an improvement or no significant decrease in the tree's performance, the subtree is pruned.

Answer 158

The estimated error of a leaf node is calculated based on the number of misclassified instances (for classification trees) or the mean squared error (for regression trees) of the instances that reach the leaf node. The estimated error is often adjusted to account for the complexity of the leaf node, such as by adding a penalty term for the number of instances it covers.

Answer 159

The estimated error of a subtree is calculated by summing the estimated errors of all the leaf nodes in the subtree. This represents the total error of the subtree if it were to be kept in the decision tree.

Answer 160

The pruning criterion in subtree replacement is based on the comparison of the estimated error of the subtree and the estimated error of the leaf node that would replace it. If the estimated error of the leaf node is lower than or equal to the estimated error of the subtree, the subtree is pruned and replaced by the leaf node.

Answer 161

Subtree replacement handles the trade-off between accuracy and simplicity by pruning subtrees that do not contribute significantly to the tree's performance. By replacing complex subtrees with simpler leaf nodes, the decision tree becomes more compact and easier to interpret, while maintaining a good level of accuracy.

Answer 162

The main advantage of post-pruning using subtree replacement is that it can significantly reduce the complexity of the decision tree and improve its generalization performance. By removing overly complex and potentially overfitting subtrees, the pruned tree is more likely to perform well on unseen data.

Answer 163

A potential limitation of subtree replacement is that it relies on the estimated error of the subtrees and leaf nodes, which may not always accurately reflect their true performance. The estimated error may be sensitive to the specific characteristics of the training data and the chosen error estimation method, such as cross-validation or a separate validation set.

Answer 164

An ensemble learner is a machine learning model that combines the predictions of multiple individual models (called base learners or weak learners) to make a final prediction. The goal of an ensemble learner is to improve the overall performance, stability, and robustness of the predictions compared to using a single model.

Answer 165

The main idea behind ensemble learning is that by combining the predictions of multiple models, the ensemble can capitalize on the strengths of each individual model while compensating for their weaknesses. This can lead to better generalization performance and reduced overfitting compared to using a single model.

Answer 166

Majority voting is a simple technique for combining the predictions of multiple classifiers in an ensemble. For each instance, the ensemble collects the predicted class labels from all the base classifiers and selects the class label that receives the majority of the votes as the final prediction. In case of a tie, the ensemble can either randomly choose one of the tied class labels or use a predefined tie-breaking strategy.

Answer 167

Averaging is a technique for combining the predictions of multiple regression models in an ensemble. For each instance, the ensemble collects the predicted numeric values from all the base models and calculates the average (mean) of these values as the final prediction. Averaging can help reduce the impact of individual model errors and produce more stable and accurate predictions.

Answer 168

Weighted averaging is an extension of the averaging technique, where each base model's prediction is assigned a weight based on its performance or importance. The final prediction is calculated as the weighted average of the base models' predictions, giving more influence to the models with higher weights. Weighted averaging can be useful when the base models have different levels of accuracy or when some models are more relevant to the problem at hand.

Answer 169

Stacking, also known as stacked generalization, is an ensemble technique that combines the predictions of multiple base models using another machine learning model, called a meta-learner. The base models are trained on the original training data, and their predictions on a validation set are used as input features for the meta-learner. The meta-learner then learns how to optimally combine the base models' predictions to make the final prediction.

Answer 170

Some advantages of ensemble learning include: Improved accuracy and generalization performance compared to individual models. Reduced overfitting, as the ensemble can average out the noise and biases of individual models. Increased robustness to outliers and noisy data, as the ensemble can mitigate the impact of individual model errors. The ability to combine heterogeneous models, such as different algorithms or models trained on different subsets of the data.

Answer 171

Some limitations of ensemble learning include: Increased complexity and computational cost, as multiple models need to be trained and maintained. Reduced interpretability, as the ensemble's predictions are based on the combined outputs of multiple models, making it harder to understand the reasoning behind the predictions. Potential for diminishing returns, as adding more models to the ensemble may not always lead to significant performance improvements beyond a certain point.

Answer 172

A weak learner is a model that performs only slightly better than random guessing on a given task. It has an accuracy just slightly above 50% for binary classification problems.

Answer 173

Weak learners are used as the base models that are combined in ensemble methods like boosting and bagging. The ensemble techniques allow many weak learners to be combined to create a strong predictive model that outperforms any single weak learner.

Answer 174

A decision tree with only a few levels or splits could be considered a weak learner. It captures only a small part of the pattern in the data.

Answer 175

A weak learner just needs to perform better than random guessing. It does not need to be a strong or highly accurate model on its own.

Answer 176

Not necessarily. Some ensemble techniques like stacking can combine strong learners as well. But many boosting methods intentionally use weak learners as the base models.

Answer 177

Bagging (Bootstrap Aggregating) is an ensemble technique where multiple models are trained on different bootstrap samples of the training data. The predictions from the individual models are combined (e.g. by voting for classification or averaging for regression) to make the final prediction.

Answer 178

In bagging, each model is trained independently on a different bootstrap sample of the data. In boosting, the models are trained sequentially, with later models giving more weight to instances that previous models mis-classified.

Answer 179

Boosting is an ensemble method that converts weak learners (slightly better than random guessing) into a strong learner by training a sequence of models. Each subsequent model pays more attention to the instances that were misclassified by previous models.

Answer 180

In bagging, the constituent models have equal weights and are combined via simple averaging (regression) or majority voting (classification). In boosting, the later sequential models have higher weights based on their accuracy.

Answer 181

Bagging tends to work better with unstable base models like decision trees or neural nets that have high variance. The variance is reduced by averaging multiple trees/nets trained on resampled data.

Answer 182

Stacking is an ensemble learning technique that combines multiple base models by training a higher-level meta-model to assign weights to or combine the predictions from the base models.

Answer 183

Stacking has two levels - level 0 where the base models are trained, and level 1 where a meta-model is trained to combine the base model predictions.

Answer 184

Any type of model like decision trees, neural nets, SVM etc. can be used as the base level 0 models in stacking.

Answer 185

Simple models like logistic regression, naive bayes or linear regression are often used as the level 1 meta-model to combine the base model outputs.

Answer 186

The meta-model is trained on held-out data not used to train the base models. This avoids overfitting the meta-model to the base models.

Answer 187

If small changes in the training data lead to large changes in the model, then the model is said to have high variance or be unstable. Decision trees and neural networks often exhibit this behavior.

Answer 188

Bagging trains multiple models on different bootstrap samples of the training data. By averaging their predictions, the variance of the individual unstable models is reduced.

Answer 189

Bagging reduces the variance of unstable models like trees and neural nets, which helps prevent overfitting to the specific dataset used for training.

Answer 190

Bagging is most useful when the base model is unstable or sensitive to small changes in the training data. For stable low-variance models, simple averaging works about as well.

Answer 191

While bagging reduces variance, it can increase bias if the base models are too simple or constrained. The averaging preserves shared bias across models.

Answer 192

No, bagging does not utilize every instance when creating bootstrap samples from the original training set.

Answer 193

Bootstrap samples are created by drawing random samples with replacement from the original training set. This means some instances may be repeated in a bootstrap sample, while others may be left out.

Answer 194

On average, around 63.2% of the instances from the original training set are included in a bootstrap sample when sampling with replacement. The remaining 36.8% instances are left out.

Answer 195

The instances from the original training set that are left out of a particular bootstrap sample are called the out-of-bag (OOB) instances for that sample.

Answer 196

The OOB instances can be used as a validation set for that bootstrap sample's model. This allows the bagging technique to get an unbiased estimate of the model's generalization error.

Answer 197

No, in bagging the same ML algorithm (e.g. decision tree) is used to train all the base models, but on different bootstrap samples of the training data.

Answer 198

No, in boosting methods like AdaBoost, the same base ML algorithm (e.g. decision tree stumps) is used to sequentially train all the weak learner models.

Answer 199

Yes, it is possible to use different ML algorithms as the base models in bagging and boosting, but this is not the typical implementation.

Answer 200

The base models used are typically fast to train but have high variance/bias, like decision trees or stumps. This allows many models to be efficiently generated.

Answer 201

Stacking can combine very different ML algorithms like SVMs, neural nets, naive Bayes etc. as the level-0 base models by training a meta-model to combine their predictions.

Answer 202

Like bagging, random forests draw bootstrap samples from the original data, using sampling with replacement. On average, each tree is built using around 63.2% of the instances, leaving 36.8% out-of-bag instances.

Answer 203

: Instead of considering all attributes to choose the best split at each node, random forests randomly select a subset of the attributes (e.g. sqrt(total_attributes)) and choose the best split from this subset.

Answer 204

Subsampling the attributes adds additional randomness and prevents strong attributes from being used in lots of trees. This reduces correlation between trees, increasing diversity in the overall forest.

Answer 205

Each tree in the forest makes a prediction (classification or regression) based on the instances propagating through its structure to a leaf node.

Answer 206

For classification, the forest chooses the class with the maximum votes from all trees. For regression, the forest averages the numeric predictions from individual trees.

Answer 207

AdaBoost.M1 works by iteratively training weak learners on re-weighted versions of the data, focusing more on instances that were previously misclassified.

Answer 208

All instances are initially given equal weights (1/N where N is the number of training instances).

Answer 209

Instances that were correctly classified get their weights decreased, while instances that were misclassified get their weights increased.

Answer 210

The amount an instance's weight is updated depends on the error rate of the newly trained weak learner model.

Answer 211

The weak learner models are combined into a weighted majority vote, where each model's vote is weighted by the log of 1/error rate achieved on the weighted data.

Answer 212

Each weak learner added to the boosted ensemble must have an accuracy greater than 50% on the weighted training data.

Answer 213

If the chosen weak learner has accuracy less than 50%, AdaBoost.M1 will fail and terminate the boosting process early.

Answer 214

AdaBoost.M1 gives higher weights to misclassified instances. If a weak learner is worse than 50%, it will keep misclassifying the higher weighted instances, leading to divergence.

Answer 215

More advanced boosting algorithms like AdaBoost.R2 can admit weak learners with errors greater than 50% by updating weights in a different way.

Answer 216

If each weak learner is slightly better than 50% accuracy, AdaBoost.M1 can combine them into an arbitrarily accurate strong learner as more weak learners are added.

Answer 217

Stacking has two levels - level 0 where base models are trained, and level 1 where a meta-model is trained to combine the base model predictions.

Answer 218

Using the same data to train both the level-0 base models and the level-1 meta-model will lead to overfitting and optimistic performance estimates.

Answer 219

The meta-model at level 1 would be trained on data that is not independent of the level-0 base models, violating a core machine learning principle.

Answer 220

The data is split into two disjoint sets - one used to train the level-0 base models, and a held-out set used to train the level-1 meta-model.

Answer 221

Common techniques include using out-of-bag instances from bagging at level-0, or using k-fold cross-validation to create non-overlapping train/test splits.

Answer 222

A confidence interval is a range of values used to estimate an unknown population parameter, constructed from a given set of sample data.

Answer 223

The confidence level represents the likelihood that the calculated confidence interval will contain the true population parameter.

Answer 224

A confidence interval is calculated using the sample statistic, the sample size, and standard deviation or standard error of the sampling distribution.

Answer 225

As the confidence level increases (e.g. from 90% to 95%), the confidence interval becomes wider, reducing the precision but increasing the likelihood of capturing the true parameter.

Answer 226

Confidence intervals are used to estimate population means, proportions, differences between groups, and regression coefficients, providing a range of plausible values.

Answer 227

A paired t-test is a statistical test used to compare two population means where the observations in the two samples are paired or related.

Answer 228

A paired t-test is used when each observation in one sample has a unique corresponding observation in the other sample, such as before-and-after measurements on the same subjects.

Answer 229

In a paired t-test, the two samples are dependent or related, whereas in an independent two-sample t-test, the two samples are independent and unrelated.

Answer 230

The null hypothesis for a paired t-test is that the true mean difference between the two related populations is zero.

Answer 231

A key assumption for a paired t-test is that the differences between the paired observations are normally distributed.

Answer 232

A paired t-test could be used to compare the mean weights of subjects before and after a diet program, where each subject's weight is measured twice.

Answer 233

Different types of errors (false positives vs false negatives) or predictions can have vastly different costs or consequences in real-world applications. Incorporating costs allows the model to make predictions that minimize the expected cost or risk.

Answer 234

In fraud detection, the cost of misclassifying a fraudulent transaction as legitimate (false negative) is much higher than misclassifying a valid transaction as fraudulent (false positive).

Answer 235

One way is to introduce instance weights based on misclassification costs during model training so that higher cost errors are given more importance.

Answer 236

A common approach is to modify the loss function used during training to unequally penalize over-predictions vs under-predictions based on the cost ratios.

Answer 237

Cost-sensitive decision tree induction algorithms like IDX-AL and CAID modify the tree splitting criteria to incorporate misclassification costs.

Answer 238

In addition to costs, the prior probabilities of different classes are also sometimes incorporated into cost-sensitive learning algorithms.

Answer 239

A Perceptron is a single-layer neural network used for binary classification. It computes a weighted sum of the inputs and applies a step activation function.

Answer 240

If the weighted sum is greater than a threshold, the Perceptron predicts the positive class, otherwise it predicts the negative class.

Answer 241

An MLP is a neural network with one or more hidden layers between the input and output layers, allowing it to learn non-linear decision boundaries.

Answer 242

A simple Perceptron can only learn linearly separable patterns. It cannot model complex non-linear decision boundaries.

Answer 243

By introducing hidden layers with non-linear activation functions, an MLP can model complex non-linear relationships between inputs and outputs.

Answer 244

Common activation functions include sigmoid, tanh, and ReLU (rectified linear unit). These introduce non-linearity to the network.

Answer 245

Backpropagation is an algorithm used to train multi-layer neural networks by updating the network weights to minimize the output error.

Answer 246

The two main steps are: 1) Forward propagation to get output predictions, and 2) Backward propagation of errors to update weights.

Answer 247

Weights are updated using gradient descent by computing the gradient of the error with respect to each weight.

Answer 248

The chain rule from calculus is used to compute the gradients by back-propagating the error signal through the network layers.

Answer 249

Backpropagation solves the credit assignment problem of how to distribute blame for errors and update weights in multi-layer networks.

Answer 250

Learning rate, momentum, batch size, number of epochs/iterations, weight initialization, and network architecture impact training performance.

Answer 251

The Softmax function is a generalization of the logistic sigmoid function that outputs a vector of values summing to 1, representing the predicted probability distribution over multiple classes.

Answer 252

The Softmax function is commonly used as the activation function in the output layer of multi-class neural network classifiers.

Answer 253

The Softmax function is calculated as: softmax(x)_i = exp(x_i) / sum_j(exp(x_j)) where x_i is the input value for class i and the sum is taken over all classes j.

Answer 254

The Softmax output is a valid probability distribution, meaning all values are between 0 and 1, and they sum to 1.

Answer 255

The class with the highest predicted probability in the Softmax output vector is typically chosen as the predicted class label.

Answer 256

The cross-entropy loss function is typically used in combination with the Softmax output layer for multi-class classification problems.

Answer 257

Cross-entropy is a loss function commonly used in classification problems, especially with neural networks and Softmax outputs.

Answer 258

Cross-entropy measures the performance of a classification model by quantifying the divergence between the predicted probability distributions and the true distributions (targets).

Answer 259

For binary classification: L = -[y * log(p) + (1-y) * log(1-p)] where y is the true label (0 or 1) and p is the predicted probability.

Answer 260

For multi-class: L = -sum(y_true * log(y_pred)) where y_true is a one-hot vector of true labels and y_pred is the predicted probability distribution.

Answer 261

Cross-entropy increases as the predicted probability diverges from the true label. It reaches a minimum when the predicted distribution exactly matches the true distribution.

Answer 262

Cross-entropy is convex, ensuring a well-behaved gradient for optimization. It also naturally extends to multi-class problems and is robust to outliers.

Answer 263

Partitioning data allows for training a model on one subset, tuning hyperparameters on another subset, and evaluating the generalization performance on a held-out test set.

Answer 264

The training set is used to fit the parameters of the machine learning model.

Answer 265

The validation set is used for tuning hyperparameters, model selection, and preventing overfitting during the training process.

Answer 266

The test set is used to evaluate the final model's performance on unseen data and estimate its ability to generalize.

Answer 267

Cross-validation is a technique that involves partitioning the data into multiple folds for training and validation, allowing all data points to be used for both purposes.

Answer 268

Common techniques include k-fold cross-validation, leave-one-out cross-validation (LOOCV), and stratified cross-validation for imbalanced datasets.

Answer 269

Pruning is a technique used to prevent overfitting by removing sections of the tree that provide little power to classify instances.

Answer 270

The two main approaches are pre-pruning (stopping tree growth early) and post-pruning (removing subtrees from an overly large tree).

Answer 271

Pre-pruning stops growing a tree branch once a statistical test indicates that further splits will not add sufficient value to the model.

Answer 272

Post-pruning first grows a large, overly complex tree, then removes subtrees in a bottom-up fashion based on their contribution to the overall tree performance.

Answer 273

Common metrics include the number of instances misclassified after pruning or an estimate of the expected error rate after pruning.

Answer 274

Pre-pruning can be too aggressive and fail to grow subtrees that may appear unnecessary initially but could be valuable higher in the tree.

Answer 275

Feature selection is the process of choosing a subset of relevant features or attributes from the original set of features in order to improve model performance, efficiency, and interpretability.

Answer 276

Feature selection can improve model accuracy by removing irrelevant or redundant features, reduce overfitting, decrease training times, and enhance interpretability.

Answer 277

Common methods include filter methods (e.g., chi-squared test, mutual information), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO, decision tree importance).

Answer 278

A feature transformation is a process of deriving new features from the original set of features, often to improve their signal or discriminative ability.

Answer 279

One example is the TF-IDF transformation, which converts text data into a vector representation that reflects the importance of words in a document relative to a corpus.

Answer 280

Feature transformations can make patterns more evident to machine learning algorithms, increase model performance, and provide more meaningful representations than the raw input data.

Answer 281

PCA is a technique used to reduce the dimensionality of a dataset by projecting the original features onto a new set of uncorrelated features called principal components.

Answer 282

The principal components are calculated as the eigenvectors of the covariance matrix of the data, ordered by decreasing eigenvalues.

Answer 283

The principal components are orthogonal (uncorrelated) to each other and ordered such that the first few components capture the most variance in the data.

Answer 284

A random projection is a technique that projects high-dimensional data to a lower-dimensional subspace using randomly generated projection matrices.

Answer 285

Random projections provide a computationally efficient way to reduce dimensionality while approximately preserving the structure of the data, according to the Johnson-Lindenstrauss lemma.

Answer 286

Random projections are less computationally expensive than PCA but may not capture as much variance as PCA's principal components. However, they can still provide good dimensionality reduction for some applications.

Answer 287

RFE is a feature selection technique that recursively removes features from the initial set based on their importance scores, building a model on the remaining features at each iteration.

Answer 288

RFE can be used with any machine learning model that provides feature importance scores or weights, such as linear models, tree-based models, and neural networks.

Answer 289

RFE trains the specified model, ranks all features by importance, removes the least important features, and repeats this process until the desired number of features remains.

Answer 290

RFE automatically selects features tailored to the specific model being used and considers feature interactions. It can improve model performance and efficiency.

Answer 291

RFE can be computationally expensive, especially for complex models and large feature sets, since it must train the model multiple times.

Answer 292

The number of features to select can be specified directly, or RFE can be combined with cross-validation to automatically determine the optimal number of features.

Answer 293

Forward Feature Selection is an iterative method that starts with an empty feature set and sequentially adds the most useful feature at each step based on a performance metric.

Answer 294

Backward Feature Elimination is an iterative method that starts with the full set of features and sequentially removes the least useful feature at each step based on a performance metric.

Answer 295

Forward Selection starts with no features and adds, while Backward Elimination starts with all features and removes. Forward Selection is faster for datasets with fewer relevant features.

Answer 296

A common metric is the increase/decrease in model performance (e.g. accuracy, R-squared) when a feature is added/removed.

Answer 297

Typical stopping criteria include adding/removing no new features improves performance, or reaching a specified number of features to select.

Answer 298

They are simple, scalable to high dimensions, and provide a deterministic way to reduce features without getting stuck in local optima like stepwise methods.

Answer 299

PCA is a technique used for dimensionality reduction by projecting the original data onto a new set of orthogonal axes called principal components.

Answer 300

The principal components are calculated as the eigenvectors of the covariance matrix of the data, ordered by decreasing eigenvalues.

Answer 301

The principal components are orthogonal (uncorrelated) to each other, and ordered such that the first few components capture the most variance in the data.

Answer 302

By projecting the data onto the first few principal components, which capture most of the variance, the dimensionality can be reduced while preserving most of the important information.

Answer 303

PCA can improve algorithm performance, reduce overfitting, and provide insights into the sources of variability in the data.

Answer 304

PCA assumes linear relationships between variables, may not be suitable for non-numeric data, and the principal components may lack interpretability.

Answer 305

TF-IDF is a numerical technique used to quantify the importance or relevance of words in a document within a collection or corpus of documents.

Answer 306

TF-IDF consists of two parts: Term Frequency (TF) and Inverse Document Frequency (IDF).

Answer 307

Term Frequency measures how frequently a word appears in a given document. Common words will have a higher TF score.

Answer 308

Inverse Document Frequency measures how rare or common a word is across the entire document corpus. Rare words will have a higher IDF score.

Answer 309

TF-IDF = TF * IDF, where TF is the normalized term frequency, and IDF is the log of the inverse of the document frequency ratio.

Answer 310

TF-IDF is commonly used in information retrieval, text mining, and as a way to represent text data for machine learning tasks like text classification and clustering.

Answer 311

Discretizing is the process of converting a continuous numeric attribute into a categorical attribute by creating bins or ranges of values.

Answer 312

Some machine learning algorithms work better with categorical data, and discretizing can make numeric attributes more interpretable while reducing sensitivity to outliers.

Answer 313

1) Equal-width binning divides the range into N bins of equal size. 2) Equal-frequency binning creates N bins with approximately the same number of instances in each.

Answer 314

One-hot encoding converts a categorical attribute into a vector of binary values, with one component being 1 and the rest 0, to represent each category.

Answer 315

One-hot encoding is required when working with categorical data and machine learning algorithms that assume numeric inputs, such as neural networks.

Answer 316

If not one-hot encoded, algorithms may assume an ordinal relationship between categories, leading to incorrect results. One-hot avoids this assumption.

Answer 317

Ensemble Learning is a machine learning technique that combines predictions from multiple individual models to produce a more accurate and robust composite model.

Answer 318

Ensemble methods can improve predictive performance, reduce overfitting by combining multiple hypotheses, and provide better generalization on unseen data.

Answer 319

The two main types are: 1) Bagging (Bootstrap Aggregating) and 2) Boosting

Answer 320

Bagging involves creating multiple models from different bootstrap samples of the training data and combining their predictions by majority vote (classification) or averaging (regression).

Answer 321

Boosting trains a sequence of weak models where each subsequent model gives more emphasis to instances that were misclassified by previous models.

Answer 322

Examples include Random Forests (Bagging), AdaBoost (Boosting), Gradient Boosting Machines, and Stacking (combining different model types).

Answer 323

Bagging is an ensemble learning technique that combines predictions from multiple models trained on different bootstrap samples of the training data.

Answer 324

Bootstrap samples are created by randomly drawing instances from the original training set with replacement, allowing for duplication of some instances.

Answer 325

For classification, the predicted class is the mode (most frequent) of the classes predicted by individual models. For regression, the predictions are averaged.

Answer 326

Bagging reduces variance and helps avoid overfitting by combining the predictions of multiple models trained on different samples of the data.

Answer 327

Bagging works best with unstable models that have high variance, such as decision trees and neural networks, as it helps reduce their sensitivity to specific data samples.

Answer 328

Random Forests is a popular Bagging ensemble method that combines multiple decision trees trained on bootstrap samples and random subsets of features.

Answer 329

Boosting is an ensemble learning technique that trains a sequence of weak learners in such a way that each subsequent model gives more emphasis to instances that were previously misclassified.

Answer 330

A weak learner is a simple model that performs slightly better than random guessing, such as a shallow decision tree or a decision stump.

Answer 331

Boosting combines multiple weak learners into a strong learner by focusing on the difficult instances and continuously adjusting the instance weights to improve the overall accuracy.

Answer 332

In Bagging, the base models are trained independently, while in Boosting, they are trained sequentially with each model learning from the mistakes of the previous models.

Answer 333

AdaBoost (Adaptive Boosting) is a widely used Boosting algorithm that iteratively trains weak learners on reweighted versions of the training data, giving more weight to misclassified instances.

Answer 334

Boosting models can be prone to overfitting the training data, especially with weak learners that are too complex or with too many iterations. Proper regularization and early stopping are important

Answer 335

Stacking is an ensemble learning technique that combines predictions from multiple machine learning models by training a meta-model on the outputs of the base models.

Answer 336

Stacking involves training several base models (level 0) on the training data, then using their predictions on a holdout set as input features to train a meta-model (level 1).

Answer 337

Any type of machine learning model can be used as the base models (level 0) in Stacking, including decision trees, neural networks, SVMs, etc.

Answer 338

Simple linear models like logistic regression or linear regression are often used as the meta-model to combine the base model predictions.

Answer 339

Using the same data to train both the base models and meta-model would lead to overfitting. A separate holdout set ensures the meta-model generalizes better.

Answer 340

Stacking can leverage the strengths of diverse model types, reduce the risk of choosing a poorly performing single model, and improve predictive performance.

Answer 341

AdaBoost is a popular boosting algorithm that combines multiple weak learners into a strong ensemble model for classification problems.

Answer 342

AdaBoost trains weak learners sequentially, giving more weight to misclassified instances, so that subsequent learners focus on the harder examples.

Answer 343

Each weak learner must have an accuracy higher than 50% on the weighted training data. Otherwise, AdaBoost will fail.

Answer 344

The weak learners are combined through weighted majority voting, where each learner's vote is weighted by its accuracy on the weighted data.

Answer 345

Initially all instances have equal weights. After each iteration, weights are increased for misclassified instances and decreased for correctly classified ones.

Answer 346

AdaBoost is simple yet powerful, achieves better accuracy than using a single model, and is resistant to overfitting when the weak learners are stumps/shallow trees.

Answer 347

A Random Forest is an ensemble learning method that constructs multiple decision trees and combines their predictions through majority voting (classification) or averaging (regression).

Answer 348

1) Bootstrap sampling of the training data to grow each tree, and 2) Random subsampling of features at each split point when growing the trees.

Answer 349

Randomly selecting a subset of features to evaluate at each split increases diversity among the individual trees, reducing correlation and improving the overall prediction accuracy.

Answer 350

Random Forests are robust to overfitting and can handle high-dimensional datasets with good predictive performance, even without much tuning of hyperparameters.

Answer 351

For classification, the majority vote from all decision trees is used. For regression, the mean/average of all tree predictions is used.

Answer 352

Random Forests excel at both classification and regression tasks, can handle mixed data types, and are effective for high-dimensional data with many irrelevant features.

Test 2 Flashcards

(376 cards)