Test 1 Flashcards
What does the model or hypothesis represent in a linear model?
A real-valued function from the instances to some target attribute
Each training instance can be represented as what?
A row vector x = <x1,x2,….,xk>
In the linear model’s equation each 0j is a what?
A real valued constant (weight)
In the linear model’s equation h0(x) is what?
The estimated value of y for instance x
What really determines the linear model’s function?
The values we choose for each of the weights
For any training instance x the sum is what?
the dot product of the weight and training instance (0 . x)
h0(x) defines a k dimensional what?
Hyperplane
What is the residual?
(y(i) - h0(x(i)))^2
What is the space of possible values for 0?
The error surface
What does the gradient vector ∇J at a given point represent?
The direction of the greatest rate of increase in J at the point
What does the gradient vector ∇J at a given point on the error surface represent?
The slope (at that point) of the surface in the jth dimension
What is a?
A small real valued constant (learning rate)
If the gradient vector ∇J at a given point is 0 what does this mean?
No further updates can occur as the local minimimum for J(0) has been reached. The gradient descent stops at this point
In the context of a linear regression the cost function of J is what?
A convex
If J is convex what does this mean?
There is only one minimum and gradient descent can safely be used to find it
If the original function to be learned is not linear will gradient descent work?
There may be many local minima and you are not guaranteed to find the global minimum.
What is Batch Gradient Descent?
All instances in the data set are examined before updates are made
What is Stochastic Gradient Descent?
A randomly chosen instance or random samples is used instead of the entire data set
What are the benefits of stochastic gradient descent?
The error is reduced more quicklyW
What are the downsides of using stochastic gradient descent?
You may not get the minimum but only an approximation
If a good value for α is chosen then J should what?
Decrease with each iteration.
If α is too large what may happen?
J might not converge, it may increase without bound or oscillate between points.
If α is too small what may happen?
The gradient descent might take a very long time to converge
What is typically used to scale the inputs?
The standard score (xj)
How do you calculate the entropy of a decision tree?
Entropy =(For each element in the class) (ElementsInClass)/(TotalElementsInSet)log2(ElementsInClass)/(TotalElementsInSet) added to the next elements in class and so on
How do you calculate the Information Gain of a decision tree split?
You subtract from the original entropy of the parent node the weighted average entropy of the nodes in the split. For each node in the split, you calculate its weighted average entropy by finding the entropy of the node considered in itself, then multiplying it by the ratio of elements to total elements that is in the entire set. You then add each of these weighted average entropies up and subtract that from the original entropy of the parent node
How do you calculate the gain ratio?
The gain ratio is information gain divided by intrinsic information. Intrinsic information is -Σ (|Sv| / |S|) * log2(|Sv| / |S|), where |Sv| is the number of instances in child node v, and |S| is the total number of instances in the parent node.
What phenomenon does the use of the gain ratio attempt to overcome in the context of decision tree construction?
Information gain naturally favors attributes that split the data set into many disjoint sets, each containing only
a few members. This, however, tends to generalize poorly to new data (i.e., the training data is overfitted). The
gain ratio attempts to counter this.
What is the heuristic for deciding how to build a decision tree?
Pick the attribute which will result in the least amount of impurity (entropy) in the leaf nodes
What is a decision tree?
A rooted tree used to classify instances based on their attributes
What does each branch from a node indicate?
Possible values for an attribute
For nominal attributes a node will have what?
A child for each possible value
For numeric attributes a node will have what?
Some scheme based off of a quantification (ie a cutoff value)
What is the main idea of a decision tree?
That as one travels down a branch from the root, the set of classes an instance matches get smaller and smaller
Can the ideal, a finished tree will correctly classify any given instance, always be achieved?
No, there is not a way to deterministically classify a data set with two distinct instances with identical values for input attributes in distinct classes
What does ID3 use to select the best attribute for splitting at each node?
ID3 uses information gain or gain ratio to select the best attribute for splitting at each node.
How is entropy calculated for a set of instances?
Entropy(S) = -Σ p(c) * log2(p(c)), where p(c) is the proportion of instances belonging to class c in the set S.
What does an entropy of 0 indicate?
An entropy of 0 indicates perfect purity, meaning all instances in the set belong to the same class.
What does maximum entropy indicate?
Maximum entropy indicates high impurity, meaning instances are equally distributed among classes.
How is information gain calculated for an attribute A?
Gain(S, A) = Entropy(S) - Σ ((|Sv| / |S|) * Entropy(Sv)), where Sv is the subset of instances in S with attribute A having value v.
What is the purpose of calculating information gain?
Information gain measures the reduction in entropy achieved by splitting the instances based on an attribute. The attribute with the highest information gain is considered the best split attribute.
What is gain ratio, and why is it used?
Gain ratio is an extension of information gain that addresses the bias towards attributes with many values. It is calculated by dividing the information gain by the intrinsic information of the attribute.
How does ID3 handle an unlabelled node during the tree extension process
For an unlabelled node, ID3 calculates the information gain or gain ratio for each attribute and selects the best attribute for splitting the instances at that node.
What happens after the best attribute is selected for splitting?
After selecting the best attribute, ID3 creates child nodes based on the possible values of the selected attribute and assigns the corresponding instances to each child node.
When does the recursive process of extending the tree stop?
The recursive process stops when one of the following conditions is met:
All instances in a node belong to the same class (pure node).
There are no more attributes to split on.
There are no more instances to split.
How are class labels assigned to the leaf nodes?
For each leaf node, the majority class label among the instances in that node is assigned. If there are no instances in a leaf node (empty node), the majority class label of its parent node is assigned.
How can the resulting decision tree be used to classify new instances?
To classify a new instance, traverse the decision tree from the root node to a leaf node based on the attribute values of the instance. The class label associated with the reached leaf node is assigned to the new instance.
What is the main goal of classification?
The main goal of classification is to predict a categorical or nominal target variable assigning instances to predefined classes or categories
What is the main goal of regression?
The main goal of regression is to predict a continuous or numeric target variable, estimating the relationship between input features and the target variable
What type of target variable does classification predict?
Classification predicts a categorical or nominal target variable, such as binary, or multi-class out comes
What type of target variable does regression predict?
Regression predicts a continuous or numeric target variable such as price, age, temperature or any measurable quantity
Give an example of a classification problem
An example of a classification problem is predicting whether an email is spam or not spam based on its content and other features
Give an example of a regression problem
An example of a regression problem is predicting the price of a house based on its size, number of bedrooms, location, and other relevant features
What is the output of a classification model?
The output of a classification model is predicted class label or category for each input instance
What is the output of a regression model?
The output of a regression model is a predicted numeric value for each input instance
What are some common algorithms used for classification?
Some common algorithms used for classification include decision trees, logistic regression, native Bayes, support vector machines (SVM), and neural networks
What are some common algorithms used for regression?
Some common algorithms used for regression include linear regression, polynomial regression, decision trees, random forests, and neural networks
How do classification and regression differ in terms of the nature of the target variable?
Classification deals with the categorical or nominal target variables, while regression deals with continuous or numeric target variables
How do classification and regression differ in terms of the predicted output?
Classification predicts a class label or category for each instance, while regression predicts a numeric value for each instance
Can a decision tree be used for both classification and regression?
Yes, decision trees can be used for both classification and regression with slight variations in the algorithm