Test 1 Flashcards

Question

How do you calculate the entropy of a decision tree?

Answer 1

Entropy =(For each element in the class) (ElementsInClass)/(TotalElementsInSet)log2(ElementsInClass)/(TotalElementsInSet) added to the next elements in class and so on

Answer 2

You subtract from the original entropy of the parent node the weighted average entropy of the nodes in the split. For each node in the split, you calculate its weighted average entropy by finding the entropy of the node considered in itself, then multiplying it by the ratio of elements to total elements that is in the entire set. You then add each of these weighted average entropies up and subtract that from the original entropy of the parent node

Answer 3

The gain ratio is information gain divided by intrinsic information. Intrinsic information is -Σ (|Sv| / |S|) * log2(|Sv| / |S|), where |Sv| is the number of instances in child node v, and |S| is the total number of instances in the parent node.

Answer 4

Information gain naturally favors attributes that split the data set into many disjoint sets, each containing only a few members. This, however, tends to generalize poorly to new data (i.e., the training data is overfitted). The gain ratio attempts to counter this.

Answer 5

Pick the attribute which will result in the least amount of impurity (entropy) in the leaf nodes

Answer 6

A rooted tree used to classify instances based on their attributes

Answer 7

Possible values for an attribute

Answer 8

A child for each possible value

Answer 9

Some scheme based off of a quantification (ie a cutoff value)

Answer 10

That as one travels down a branch from the root, the set of classes an instance matches get smaller and smaller

Answer 11

No, there is not a way to deterministically classify a data set with two distinct instances with identical values for input attributes in distinct classes

Answer 12

ID3 uses information gain or gain ratio to select the best attribute for splitting at each node.

Answer 13

Entropy(S) = -Σ p(c) * log2(p(c)), where p(c) is the proportion of instances belonging to class c in the set S.

Answer 14

An entropy of 0 indicates perfect purity, meaning all instances in the set belong to the same class.

Answer 15

Maximum entropy indicates high impurity, meaning instances are equally distributed among classes.

Answer 16

Gain(S, A) = Entropy(S) - Σ ((|Sv| / |S|) * Entropy(Sv)), where Sv is the subset of instances in S with attribute A having value v.

Answer 17

Information gain measures the reduction in entropy achieved by splitting the instances based on an attribute. The attribute with the highest information gain is considered the best split attribute.

Answer 18

Gain ratio is an extension of information gain that addresses the bias towards attributes with many values. It is calculated by dividing the information gain by the intrinsic information of the attribute.

Answer 19

For an unlabelled node, ID3 calculates the information gain or gain ratio for each attribute and selects the best attribute for splitting the instances at that node.

Answer 20

After selecting the best attribute, ID3 creates child nodes based on the possible values of the selected attribute and assigns the corresponding instances to each child node.

Answer 21

The recursive process stops when one of the following conditions is met: All instances in a node belong to the same class (pure node). There are no more attributes to split on. There are no more instances to split.

Answer 22

For each leaf node, the majority class label among the instances in that node is assigned. If there are no instances in a leaf node (empty node), the majority class label of its parent node is assigned.

Answer 23

To classify a new instance, traverse the decision tree from the root node to a leaf node based on the attribute values of the instance. The class label associated with the reached leaf node is assigned to the new instance.

Answer 24

The main goal of classification is to predict a categorical or nominal target variable assigning instances to predefined classes or categories

Answer 25

The main goal of regression is to predict a continuous or numeric target variable, estimating the relationship between input features and the target variable

Answer 26

Classification predicts a categorical or nominal target variable, such as binary, or multi-class out comes

Answer 27

Regression predicts a continuous or numeric target variable such as price, age, temperature or any measurable quantity

Answer 28

An example of a classification problem is predicting whether an email is spam or not spam based on its content and other features

Answer 29

An example of a regression problem is predicting the price of a house based on its size, number of bedrooms, location, and other relevant features

Answer 30

The output of a classification model is predicted class label or category for each input instance

Answer 31

The output of a regression model is a predicted numeric value for each input instance

Answer 32

Some common algorithms used for classification include decision trees, logistic regression, native Bayes, support vector machines (SVM), and neural networks

Answer 33

Some common algorithms used for regression include linear regression, polynomial regression, decision trees, random forests, and neural networks

Answer 34

Classification deals with the categorical or nominal target variables, while regression deals with continuous or numeric target variables

Answer 35

Classification predicts a class label or category for each instance, while regression predicts a numeric value for each instance

Answer 36

Yes, decision trees can be used for both classification and regression with slight variations in the algorithm

Answer 37

ŷ = θ₀ + θ₁ * x, where ŷ is the predicted target value, θ₀ is the bias term (intercept), θ₁ is the coefficient for the input attribute, and x is the input attribute value.

Answer 38

ŷ = θ₀ + θ₁ * x₁ + θ₂ * x₂ + ... + θₙ * xₙ, where ŷ is the predicted target value, θ₀ is the bias term, θ₁ to θₙ are the coefficients for the input attributes, and x₁ to xₙ are the corresponding input attribute values.

Answer 39

To calculate the predicted target value, substitute the instance's input attribute values into the linear model equation and compute the result using the given coefficients.

Answer 40

The purpose of gradient descent is to iteratively update the coefficients of the linear model to minimize the difference between the predicted target values and the actual target values

Answer 41

The general update rule for gradient descent is: θⱼ := θⱼ - α * (∂J(θ) / ∂θⱼ), where θⱼ is the j-th coefficient, α is the learning rate, and ∂J(θ) / ∂θⱼ is the partial derivative of the cost function J(θ) with respect to θⱼ.

Answer 42

To calculate the new coefficients, subtract the product of the learning rate (α) and the partial derivative of the cost function with respect to each coefficient from the current coefficient values.

Answer 43

The learning rate (α) determines the step size at which the coefficients are updated in each iteration of gradient descent. It controls the convergence speed and the stability of the algorithm.

Answer 44

If the learning rate is too small, the algorithm will converge slowly, requiring many iterations to reach the optimal coefficients.

Answer 45

If the learning rate is too large, the algorithm may overshoot the optimal coefficients and fail to converge, leading to unstable or divergent behavior.

Answer 46

The learning rate is typically chosen through experimentation or by using techniques like learning rate scheduling. It should be small enough to ensure convergence but large enough to achieve reasonable convergence speed.

Answer 47

Batch gradient descent updates the coefficients using the entire training dataset in each iteration, while stochastic gradient descent updates the coefficients using individual instances or small subsets (mini-batches) of the training dataset.

Answer 48

Logistic regression is used for binary classification problems, where the goal is to predict the probability of an instance belonging to one of two classes.

Answer 49

The hypothesis function in logistic regression is the sigmoid function, also known as the logistic function. It maps the input features to a probability value between 0 and 1.

Answer 50

The equation for the hypothesis function is: hθ(x) = 1 / (1 + e^(-z)), where z = θ₀ + θ₁ * x₁ + θ₂ * x₂ + ... + θₙ * xₙ, and θ₀ to θₙ are the coefficients (parameters) of the logistic regression model.

Answer 51

The output values of the hypothesis function in logistic regression range between 0 and 1, representing the probability of an instance belonging to the positive class.

Answer 52

The hypothesis function takes a linear combination of the input features (z) and applies the sigmoid function to map the result to a probability value. The coefficients (θ₀ to θₙ) determine the weight and impact of each feature on the predicted probability.

Answer 53

The cost function in logistic regression measures the difference between the predicted probabilities and the actual class labels. It is used to evaluate the performance of the logistic regression model and guide the optimization of the coefficients.

Answer 54

The equation for the cost function is: J(θ) = -(1/m) * Σ [y(i) * log(hθ(x(i))) + (1 - y(i)) * log(1 - hθ(x(i)))], where m is the number of instances, y(i) is the actual class label (0 or 1) of the i-th instance, and hθ(x(i)) is the predicted probability for the i-th instance.

Answer 55

The cost function assigns a high cost when the predicted probability is far from the actual class label. For example, if the actual class label is 1 and the predicted probability is close to 0, the cost will be high, indicating a misclassification.

Answer 56

The goal of minimizing the cost function in logistic regression is to find the optimal values for the coefficients (θ₀ to θₙ) that minimize the difference between the predicted probabilities and the actual class labels, thereby improving the accuracy of the logistic regression model.

Answer 57

The coefficients in logistic regression are typically updated using optimization algorithms such as gradient descent. The algorithms iteratively adjust the coefficients based on the gradients of the cost function to minimize the cost and improve the model's performance.

Answer 58

Entropy is a measure of impurity or uncertainty in a set of examples. It quantifies the average amount of information needed to classify an example in the set.

Answer 59

Entropy is calculated using the formula: Entropy(S) = -Σ p(c) * log2(p(c)), where S is the set of examples, c is a class label, and p(c) is the proportion of examples in S belonging to class c.

Answer 60

An entropy value of 0 indicates that the set of examples is completely homogeneous, meaning all examples belong to the same class. There is no impurity or uncertainty in the set

Answer 61

A high entropy value indicates that the set of examples is highly impure or uncertain. The examples are evenly distributed among different classes, making it difficult to classify them accurately.

Answer 62

Information gain is a measure of the reduction in entropy achieved by splitting a set of examples based on a particular feature. It quantifies how much the feature helps in reducing the impurity or uncertainty of the set.

Answer 63

Information gain is calculated using the formula: Gain(S, A) = Entropy(S) - Σ ((|Sv| / |S|) * Entropy(Sv)), where S is the set of examples, A is the feature, Sv is the subset of examples in S with A=v, and |Sv| and |S| are the cardinalities of Sv and S, respectively.

Answer 64

A high information gain value indicates that the feature is effective in reducing the impurity or uncertainty of the set of examples. Splitting the set based on this feature leads to a significant decrease in entropy.

Answer 65

The feature with the highest information gain value is selected as the best feature for splitting the set of examples at a particular node in the decision tree. This feature provides the most informative split and reduces the impurity the most.

Answer 66

Gain ratio is a modification of information gain that addresses the bias towards features with many distinct values. It normalizes the information gain by considering the intrinsic information of the feature.

Answer 67

Gain ratio is calculated using the formula: GainRatio(S, A) = Gain(S, A) / SplitInfo(S, A), where Gain(S, A) is the information gain and SplitInfo(S, A) is the intrinsic information of the feature A, calculated as SplitInfo(S, A) = -Σ ((|Sv| / |S|) * log2(|Sv| / |S|)).

Answer 68

Gain ratio helps mitigate the bias towards features with many distinct values. It penalizes features that split the set into many small subsets, even if they have high information gain. This helps in selecting more balanced and informative splits.

Answer 69

The feature with the highest gain ratio value is selected as the best feature for splitting the set of examples at a particular node in the decision tree. This feature provides a good balance between information gain and the number of distinct values.

Answer 70

A normally distributed random variable is a variable whose values follow a bell-shaped curve called the normal distribution. The distribution is symmetric, and the mean, median, and mode are equal.

Answer 71

The normal distribution is characterized by two parameters: the mean (µ) and the standard deviation (σ). The mean determines the center of the distribution, and the standard deviation determines the spread of the values.

Answer 72

A z-score is a measure of how many standard deviations a particular value is away from the mean of the distribution. It standardizes the values of a normal distribution to have a mean of 0 and a standard deviation of 1.

Answer 73

The z-score is calculated using the formula: z = (X - µ) / σ, where X is the value of interest, µ is the mean of the distribution, and σ is the standard deviation.

Answer 74

A standard z-table, also known as a standard normal table, is a statistical table that provides the probability of a z-score falling within a certain range in a standard normal distribution (mean = 0, standard deviation = 1).

Answer 75

To find the probability of X falling within a given range, follow these steps: Convert the given range for X into z-scores using the formula: z = (X - µ) / σ. Look up the probability associated with each z-score in the standard z-table. If the range includes values less than the mean, subtract the probability of the lower z-score from the probability of the upper z-score. If the range includes values greater than the mean, add the probabilities of the lower and upper z-scores.

Answer 76

To find P(X ≤ a), where a is a specific value, calculate the z-score for a using z = (a - µ) / σ and look up the corresponding probability in the standard z-table.

Answer 77

To find P(X > a), where a is a specific value, calculate the z-score for a using z = (a - µ) / σ, look up the corresponding probability in the standard z-table, and subtract it from 1.

Answer 78

To find P(a < X < b), where a and b are specific values, calculate the z-scores for a and b using z = (a - µ) / σ and z = (b - µ) / σ, respectively. Look up the corresponding probabilities in the standard z-table and subtract the probability of the lower z-score from the probability of the higher z-score.

Answer 79

The total area under the standard normal curve is equal to 1. This means that the sum of the probabilities of all possible values of a standard normal random variable is 1.

Answer 80

An ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classifier system. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds.

Answer 81

An ROC curve represents the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate) of a binary classifier. It shows how well the classifier can distinguish between the positive and negative classes.

Answer 82

The x-axis of an ROC curve represents the False Positive Rate (FPR), and the y-axis represents the True Positive Rate (TPR). Both axes range from 0 to 1.

Answer 83

The True Positive Rate (TPR), also known as sensitivity or recall, is the proportion of actual positive instances that are correctly classified as positive by the classifier. It is calculated as TPR = TP / (TP + FN), where TP is the number of true positives and FN is the number of false negatives.

Answer 84

The False Positive Rate (FPR) is the proportion of actual negative instances that are incorrectly classified as positive by the classifier. It is calculated as FPR = FP / (FP + TN), where FP is the number of false positives and TN is the number of true negatives.

Answer 85

To generate an ROC curve, follow these steps: Obtain the predicted probabilities of the positive class for each instance in the dataset. Sort the instances based on their predicted probabilities in descending order. Iterate through different classification thresholds from the highest probability to the lowest. At each threshold, calculate the TPR and FPR based on the classified instances. Plot the TPR against the FPR at each threshold to create the ROC curve.

Answer 86

Each point on the ROC curve represents a specific classification threshold. The x-coordinate of the point represents the FPR, and the y-coordinate represents the TPR at that threshold.

Answer 87

The ideal point on an ROC curve is the top-left corner, where the TPR is 1 and the FPR is 0. This point represents a perfect classifier that correctly classifies all positive instances without any false positives.

Answer 88

A diagonal line on an ROC curve represents a random classifier, which performs no better than random guessing. Points above the diagonal line indicate better-than-random performance, while points below the diagonal line indicate worse-than-random performance.

Answer 89

To compare the performance of different classifiers using ROC curves, plot the ROC curves for each classifier on the same graph. The classifier with the ROC curve closest to the top-left corner (higher TPR and lower FPR) is considered to have better performance.

Answer 90

The Area Under the Curve (AUC) is a single scalar value that summarizes the overall performance of a classifier. It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance by the classifier. A perfect classifier has an AUC of 1, while a random classifier has an AUC of 0.5.

Answer 91

A Naive Bayes classifier is a probabilistic machine learning algorithm used for classification tasks. It is based on Bayes' theorem and assumes that the features (attributes) of the input data are conditionally independent given the class label.

Answer 92

The main principle behind Naive Bayes classification is to calculate the posterior probability of each class given the input features and predict the class with the highest posterior probability.

Answer 93

Bayes' theorem describes the probability of an event based on prior knowledge of conditions related to the event. It is stated as: P(A|B) = (P(B|A) * P(A)) / P(B), where A and B are events, P(A|B) is the conditional probability of A given B, P(B|A) is the conditional probability of B given A, P(A) is the prior probability of A, and P(B) is the prior probability of B.

Answer 94

A Naive Bayes classifier calculates the posterior probability of a class using the following formula: P(Class|Features) = (P(Features|Class) * P(Class)) / P(Features), where P(Class|Features) is the posterior probability of the class given the input features, P(Features|Class) is the likelihood of the features given the class, P(Class) is the prior probability of the class, and P(Features) is the prior probability of the features.

Answer 95

The basic assumption used by a Naive Bayes classifier is the conditional independence assumption. It assumes that the features (attributes) of the input data are conditionally independent of each other given the class label. In other words, the presence or absence of a particular feature does not depend on the presence or absence of any other feature, given the class.

Answer 96

The conditional independence assumption allows the Naive Bayes classifier to simplify the calculation of P(Features|Class) by breaking it down into the product of individual conditional probabilities for each feature given the class. Instead of considering the joint probability of all features, it assumes that the features are independent, so P(Features|Class) = P(Feature1|Class) * P(Feature2|Class) * ... * P(FeatureN|Class).

Answer 97

The conditional independence assumption has several advantages: It simplifies the calculations and reduces the computational complexity of the classifier. It allows the classifier to handle high-dimensional data with many features efficiently. It makes the training process faster and requires less training data compared to other classifiers.

Answer 98

The conditional independence assumption has some limitations: In real-world scenarios, features may not always be conditionally independent, leading to suboptimal performance. The assumption may not hold for all datasets, especially those with strong correlations or dependencies between features. The classifier may be sensitive to irrelevant or redundant features, which can impact its accuracy.

Answer 99

A Naive Bayes classifier can handle continuous features by assuming a probability distribution for each continuous feature given the class. Common probability distributions used are Gaussian (normal) distribution for real-valued features and multinomial or Bernoulli distribution for discrete features. The parameters of these distributions (mean and variance for Gaussian, probabilities for multinomial/Bernoulli) are estimated from the training data.

Answer 100

Naive Bayes classifiers are commonly used in various applications, such as: Text classification (e.g., spam email detection, sentiment analysis) Document categorization Medical diagnosis Credit risk assessment Multi-class classification problems

Answer 101

The goal of a Naive Bayes classifier is to predict the most likely class label for a new instance based on the calculated posterior probabilities for each class, given the instance's attribute values.

Answer 102

To classify a new instance using a Naive Bayes classifier, you need: A small dataset with class labels and attribute values. The conditional probabilities for each attribute-value pair given each class. The attribute values of the new instance to be classified.

Answer 103

To calculate the posterior probability for a class given a new instance, use the following formula: P(Class|Instance) = P(Class) * P(Instance|Class) / P(Instance) where P(Class) is the prior probability of the class, P(Instance|Class) is the likelihood of the instance given the class, and P(Instance) is the prior probability of the instance.

Answer 104

To calculate the likelihood of an instance given a class, use the conditional independence assumption and multiply the conditional probabilities of each attribute-value pair given the class: P(Instance|Class) = P(Attribute1|Class) * P(Attribute2|Class) * ... * P(AttributeN|Class) where P(AttributeX|Class) is the conditional probability of the attribute-value pair for AttributeX given the class.

Answer 105

The prior probability of a class is the probability of the class occurring in the dataset, calculated as the number of instances belonging to the class divided by the total number of instances in the dataset.

Answer 106

The prior probability of an instance is often assumed to be constant for all instances and can be omitted from the calculations since it does not affect the relative probabilities of the classes.

Answer 107

To classify a new instance, calculate the posterior probability for each class using the formula P(Class|Instance) = P(Class) * P(Instance|Class). Choose the class with the highest posterior probability as the predicted class for the new instance.

Answer 108

If the conditional probability for an attribute-value pair is zero, it means that the particular attribute value has not been observed with the given class in the training dataset. To avoid zero probabilities, you can apply smoothing techniques such as Laplace smoothing (adding a small constant to the counts) to assign non-zero probabilities to unseen attribute-value pairs.

Answer 109

Yes, a Naive Bayes classifier can handle missing attribute values by simply ignoring the attribute when calculating the likelihood of an instance given a class. The conditional probability for the missing attribute is not included in the product of conditional probabilities.

Answer 110

K-fold cross-validation is a resampling technique used to evaluate the performance of a machine learning model. It involves splitting the dataset into k equally sized subsets (folds), training and evaluating the model k times, each time using a different fold as the validation set and the remaining folds as the training set.

Answer 111

The common choice for the value of k in k-fold cross-validation is 5 or 10. A value of k=5 is often used as a good compromise between computational efficiency and reducing bias in the performance estimate. However, the choice of k can depend on the size of the dataset and the specific problem at hand.

Answer 112

When k is equal to the number of instances in the dataset, the k-fold cross-validation process becomes leave-one-out cross-validation (LOOCV). In LOOCV, each instance is used as the validation set once, and the model is trained on the remaining instances. This approach provides an unbiased estimate of the model's performance but can be computationally expensive for large datasets.

Answer 113

K-fold cross-validation can handle imbalanced datasets by using stratified k-fold cross-validation. In stratified k-fold cross-validation, the folds are created in a way that preserves the class distribution of the original dataset. This ensures that each fold has a representative proportion of instances from each class, mitigating the impact of class imbalance on the performance evaluation.

Answer 114

The advantages of using k-fold cross-validation include: More reliable performance estimate compared to a single train-test split. Reduced overfitting and bias in the performance evaluation. Better assessment of the model's generalization ability on unseen data. Provides a measure of the model's stability and consistency across different subsets of the data.

Answer 115

The limitations of k-fold cross-validation include: Increased computational overhead compared to a single train-test split, as the model needs to be trained and evaluated k times. May not be suitable for very large datasets due to the computational cost. The performance estimate can still have some variance, especially for small values of k. The choice of k can impact the bias-variance trade-off in the performance estimate.

Answer 116

The results of k-fold cross-validation can be interpreted as follows: The average performance metric across all k iterations provides an estimate of the model's expected performance on unseen data. The standard deviation or variance of the performance metric across the k iterations indicates the model's stability and consistency. If the performance metric is consistently high across all folds, it suggests that the model is robust and generalizes well. If there is a large variation in the performance metric across folds, it may indicate that the model is sensitive to the specific data split or has high variance.

Answer 117

A confusion matrix is a table that summarizes the performance of a classification model by comparing the predicted class labels with the actual class labels. It shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each class.

Answer 118

True Positives (TP): The number of instances correctly predicted as positive by the classifier. True Negatives (TN): The number of instances correctly predicted as negative by the classifier. False Positives (FP): The number of instances incorrectly predicted as positive by the classifier. False Negatives (FN): The number of instances incorrectly predicted as negative by the classifier.

Answer 119

Accuracy = (TP + TN) / (TP + TN + FP + FN) Accuracy measures the overall correctness of the classifier's predictions. It represents the proportion of instances that are correctly classified.

Answer 120

True Positive Rate (TPR) or Recall = TP / (TP + FN) TPR or recall measures the proportion of actual positive instances that are correctly predicted as positive by the classifier. It represents the classifier's ability to identify positive instances.

Answer 121

Error Rate = (FP + FN) / (TP + TN + FP + FN) Error rate measures the overall misclassification rate of the classifier. It represents the proportion of instances that are incorrectly classified. The error rate is the complement of accuracy.

Answer 122

False Positive Rate (FPR) = FP / (FP + TN) FPR measures the proportion of actual negative instances that are incorrectly predicted as positive by the classifier. It represents the classifier's tendency to produce false alarms.

Answer 123

Precision = TP / (TP + FP) Precision measures the proportion of instances predicted as positive that are actually positive. It represents the classifier's ability to avoid false positives.

Answer 124

Specificity or True Negative Rate (TNR) = TN / (TN + FP) Specificity or TNR measures the proportion of actual negative instances that are correctly predicted as negative by the classifier. It represents the classifier's ability to identify negative instances.

Answer 125

F1 Score = 2 * (Precision * Recall) / (Precision + Recall) The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the classifier's performance, considering both precision and recall equally.

Answer 126

Sensitivity and specificity are inversely related. As the classifier's sensitivity increases, its specificity typically decreases, and vice versa. This trade-off is often represented by the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate at different classification thresholds.

Answer 127

To compare the performance of different classifiers using confusion matrices: Calculate the relevant metrics (accuracy, precision, recall, F1 score, etc.) for each classifier based on their respective confusion matrices. Compare the metrics side by side to assess which classifier performs better overall or in specific aspects (e.g., higher accuracy, better balance between precision and recall). Consider the specific requirements and priorities of the problem domain when evaluating the classifiers' performance.

Test 1 Flashcards

(151 cards)