Final Flashcards
hyperparameter and examples
a parameter whose value is used to control the learning process
ex) batch size, number of epochs
parameter grid
specifies search space, combination of hyperparameters
probabilistic graphical models
graphical representations of probability distributions, variables depend on other variables
what are the benefits of graphical models?
learning dependencies, visualizing a probability model, graphical manipulations over latent variables, obtaining insights (like conditional independence)
conditional independence
2 events A and B are conditionally independent given a 3rd event C if the occurrence of A and the occurrence of B are independent events.
How many types of probabilistic graphical models are there and what are they?
2 types: Bayesian Networks, Markov Networks
What is the difference between Bayesian Networks and Markov Networks?
Bayesian Networks have directed graphs and Markov Networks have undirected graphs
Bayesian network
directed edges between nodes that describe conditional dependencies
ex) sprinkler, rain, grass wet
joint probability
Probability of 2 or more events happening at the same time. This uses product/chain rule
ex) Probability that a card drawn is red and 4
marginal probability
probability of an event irrespective of the outcome of another variable (unconditional probability). This is the probability of a single event and this uses the sum rule.
ex) Probability that a card drawn is red
conditional probability
probability of one event with some relationship to one of more events
ex) given that we drew a red card, what is the probability that the red card has a 4
Bayesian Networks
directed acyclic graph (graph having no cycles) and model dependencies between the variables of the data set. Vertices are variables and edges are conditional probability. It allows us to capture variable dependencies within the data which we can’t capture with linear and logistics regression. Bayesian networks use Bayesian Inference.
Inference
Process of using a trained machine learning algorithm to make a prediction.
Posterior Probability
Probability of A (the hypothesis) to occur given event B (the evidence) already occured
Likelihood
Probability of B (the evidence) being true given that A is true
Prior
Probability of A (the hypothesis) being true
Evidence
Probability of B (the evidence) being true
Probability Density Function
Finds probability of outcomes of random variables
What are two ways to build a classifier?
1) Calculate posterior probabilities for a sample and assign it to a class that has the highest probability
2) create a discriminant function
What would you use for a continuous random variable?
gaussian naive bayes
What would you use for a categorical random variable?
categorical naive bayes
What would you use for a multinomial distribution?
multinomial naive bayes
What would you use for a binary random variable?
bernouli naive bayes
discriminant function
we don’t need to calculate evidence
What is the difference between a bayesian network and a naive bayes classifier?
bayesian network assumes that there’s dependency between variables whereas naive bayes classifier assumes there’s no dependency between variables (input features are independent variables)
If the formula for Naive Bayes is given by P(class|data) =[P(data|class)P(class)]\ P(data) ,
then which of the components makes this algorithm ”naive”:
(a) P(class|data)
(b) P(data|class)
(c) P(class)
(d) P(data)
(b) P(data|class)
True or False: Naive Bayes assumes that the input features are independent
True
For continuous data, which formulation of Naive Bayes is appropriate:
(a) Binomial Naive Bayes
(b) Multinomial Naive Bayes
(c) Gaussian Naive Bayes
(d) None of the above
(c) Gaussian Naive Bayes
In the above problem, how would smoothing change the probability that Document 5 is Spam? (a) Increase the probability (b) Decrease the probability (c) No change
(a) Increase the probability
According to the lecture material, which of these is the typical loss function used for SVM’s? (a) Mean Squared Error (b) Hinge Loss (c) Gini Coefficient (d) Cross Entropy
(b) Hinge Loss
SVM’s are less effective when:
(a) The data is linearly separable
(b) The data is clean and ready to use
(c) The data is noisy and contains overlapping points
(c) The data is noisy and contains overlapping points
In SVM what is the meaning of a hard margin?
(a) The SVM allows very low error in classification
(b) The SVM allows high amount of error in classification
(c) None of the above
(a) The SVM allows very low error in classification
True or False: SVM uses the kernel trick to classify non-linear data
True
True or False: Grid search can be used to optimize hyperparameters of a machine learning
algorithm
True
What does a kernel function do?
(a) Transforms linearly inseparable data into separable data by
transforming to a higher dimension
(b) Transforms linearly inseparable data into separable data by transforming to a lower dimension
(a) Transforms linearly inseparable data into separable data by transforming to a higher dimension
True or False: The decision boundary in non-linear SVM must be linear.
True
When fitting an SVM, we attempt to optimize:
(a) The normal vector of the decision boundary
(b) The margin between the decision boundary and the data
(c) The density of the data on either side of the decision boundary
(d) None of the above
(b) The margin between the decision boundary and the data
Given the following decision trees grown from the same dataset, which is the most likely to be overfit? (a) f1-score 0.8, leaf count = 50 (b) f1-score 0.9, leaf count = 20 (c) f1-score 0.7, leaf count = 10
(a) f1-score 0.8, leaf count = 50
All else held equal, a decision tree with more leaves will have more complicated
decision boundaries, thus increasing our expectation that it will overfit to the data.