Final Review Flashcards

1
Q

Who developed the VC dimension?

A

Dr. Vladimir Vapnik and Dr. Chervonenkis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does VC stand for?

A

Vapnik-Chervonekis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why was the VC dimension developed?

A

To help scientists develop better machine learning models that would be better at classifying data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the VC dimension?

A

The Vapnik-Chervonenkis dimension of a hypothesis space is the maximum number of points that can be “shattered” by the hypothesis of that space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In terms of VC dimension, what does “shattering” mean?

A

“Shattering” means that for every possible way of labeling these points (with binary labels), there is a hypothesis in the space that correctly classifies the points according to that labeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is either of these true? Why or why not?

The VC dimension of the hypothesis class of circles is lower than that of squares.

The VC dimension of the hypothesis class of squares is lower than that of circles.

A

No, because both have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is this true? Why or why not?

The VC dimension of the hypothesis class of rings is higher that that of circles.

A

Yes, because rings have a VC dimension of 4 while circles have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 2 important notes when discussing the VC dimension of a hypothesis space?

A
  • Knowing if the hypothesis space is centered on the origin
  • Knowing that the VC dimension is NOT always equivalent to 1 + (number of dimensions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is either of these true? Why or why not?

The VC dimension of circles is higher than that of squares.

The VC dimension of squares is higher that that of circles.

A

No, because both have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What was the goal of Bayes’ Rule?

A

To solve the problem of inverse probability - inferring the probability of causes (hypotheses) from observed effects (data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Bayes’ Rule?

A

A fundamental theorem in probability theory for updating the probability of a hypothesis based on new evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for Bayes’ Rule?

A

P(h|D) = [P(D|h) * P(h)] / P(D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In Bayes’ Rule, what is P(h|D) and what does it represent?

A

The posterior probability of the hypothesis given the data.

It represents our updated belief in the hypothesis after seeing the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In Bayes’ Rule, what is P(D|h) and what does it represent?

A

P(D|h) is the likelihood, which is the probability of observing the data assuming the hypothesis is true.

It tells us how likely the observed data is under the assumption of the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In Bayes’ Rule, what is P(h) and what does it represent?

A

P(h) is the prior probability of the hypothesis before observing any data.

It represents our belief in the hypothesis based on prior knowledge or assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In Bayes’ Rule, what is P(D) and what does it represent?

A

P(D) is the marginal likelihood (evidence), which normalizes the posterior distribution by ensuring the total probability sums to 1.

It represents the total probability of observing the data across all possible hypotheses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For Bayes’ Rule, when would we ignore P(D)?

A

We ignore the probability of the data given any hypothesis when we’re interested in finding the hypothesis that maximizes the posterior probability - P(h|D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is MLE?

A

Maximum Likelihood Estimation is a way to estimate the parameters of a model by finding the values that make the observed data most likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does MLE mean in the context of binary classification?

A

This means finding the parameters of the hypothesis h(x) that make the observed outcomes y more likely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is MSE? How is it written?

A

Mean Squared Error
MSE = (1/n) * (Σ(y - h(x))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is MSE unsuitable for binary classification?

A
  • Binary classification treats targets as discrete (0 or 1), while MSE assumes targets are continuous values.
  • MSE doesn’t penalize incorrect predictions made with high confidence as effectively as cross-entropy loss does
  • MSE here can result in poor performance because it focuses on shortening distance between predicted and actual values, ignoring the probability estimates produced by models in classification tasks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What function is typically used for a binary classification neural network’s output layer? Why?

A

A sigmoid activation function is typically used because it outputs a value between 0 and 1 which can be evaluated with cross-entropy loss to predict how well it matches the actual binary label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe Cross-Entropy Loss (CEL)

A
  • Used in binary classification
  • Measures the difference between the predicted probabilities and the true binary labels
  • Directly handles probabilistic predictions
  • Strongly penalizes wrong predictions made with high confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe Mean Squared Error (MSE)

A
  • Typically used in regression tasks where the target variable is continuous
  • Measures the squared difference between predicted values and true values
  • Treats output as continuous, making it LESS suitable for binary classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Describe Neural Networks with Sum of Squared Error (SSE)
- If it is trained for a binary classification task, it may treat the output as continuous rather than probabilistic - For binary classification, it uses CEL (Cross-Entropy Loss) with a sigmoid output layer
25
What is a limitation of K-Means?
- K-Means struggles with non-spherical clusters - K-Means requires a pre-defined number of clusters - K-Means may converge to local minima
26
What are the steps of spectral clustering?
- Construct a similarity graph - Compute the graph Laplacian - Compute the eigenvalues and eigenvectors - Perform clustering in the reduced eigenspace
27
What is the kernel trick?
This trick maps data into a higher-dimensional space where linear separation is possible
28
Why does the kernel trick matter for concentric circles?
This trick allows spectral clustering to transform into the original space, enabling the separation of non-linearly separable points, such as inner and outer circles
29
What is the primary advantage of Isomap over traditional MDS?
Isomaps approximate geodesic distances, capturing global structure
30
Advantages and Limitations of Isomap
Advantage: - Captures global structure effectively, especially for data with a natural manifold shape Limitations: - Computationally intensive on large datasets - Struggles with non-manifold high-dimensional data
31
What is the core concept of Laplacian Eigenmaps?
Laplacian eigenmaps use spectral graph theory to focus on local neighborhood relationships, preserving local structure
32
What is the process for Laplacian Eigenmaps?
- Build a weighted similarity graph of the data - Compute the graph's Laplacian matrix - Perform eigenvalue decomposition to generate a low-dimensional embedding
33
Laplacian Eigenmaps are best suited for preserving what type of structure?
Local neighborhood structure
34
Name applications of Laplacian Eigenmaps
- Image Processing - Clustering in High-Dimensional Data - Sensor Data Analysis - Gene Expression Data
35
Advantages and Limitations of Laplacian Eigenmaps
Advantage: Efficient for preserving local structures, which is useful for clustering tasks Limitation: May lose global relationships as it focuses solely at the local level
36
Why might pruning be applied to a decision tree?
- To simplify the model and improve interpretability by removing less significant branches - To reduce overfitting by removing branches that were fit to noise or other training data fluctuations - To remove branches that provide little to no predictive power
37
Why are the following NOT reasons to prune a decision tree? - To ensure the tree is balanced - To always achieve the best accuracy - To increase tree depth
- The goal of pruning is not to balance a tree - Pruning doesn't guarantee the BEST accuracy - Pruning does the opposite of increase tree depth
38
Why are the following statements around cross-validation true? - In k-fold cross-validation, the dataset is divided into k subsets, and each subset is used as a validation set - Cross-validation helps in estimating the generalization performance of a regression model
- The dataset is divided into k subsets (folds), and each fold is used as a validation set once, while the remaining k - 1 folds are used for training - Testing on multiple validation sets helps to identify overfitting or underfitting
39
Is the following statement incorrect? Why? - Cross-validation is only applicable to regression tasks and not classification
Cross-validation is a general technique applicable to both regression and classification tasks
40
In the context of training neural networks, why is direct interpretability a challenge?
Neural networks are often referred to as "black-box" models because it is difficult to directly interpret how they arrive at predictions, especially as the number of layers and neurons increases
41
In the context of training neural networks, why is overfitting a challenge?
Overfitting occurs when a neural network learns the training data too well, including noise and irrelevant patterns, leading to poor generalization on unseen data
42
In the context of training neural networks, why is getting stuck in local minima a challenge?
While modern optimizers like stochastic gradient descent (SGD) with momentum or Adam help mitigate this, getting stuck in poor local minima or saddle points can still be a problem, especially for non-convex loss functions
43
In the context of training neural networks, why is the vanishing gradient problem a challenge?
The vanishing gradient problem occurs when gradients become very small during backpropagation, particularly in deep networks with activation functions like sigmoid or tanh. This slows down learning or even halts it in early layers
44
In the context of training neural networks, why is inefficient support vector calculations not a challenge?
Support vector calculations are specific to Support Vector Machines (SVMs), a different type of machine learning algorithm
45
In the context of training neural networks, why is collinearity among features not a challenge?
While collinearity among features can affect simpler models like linear regression, neural networks can often learn to handle correlated inputs effectively due to their ability to model complex relationships
46
Which of the following are characteristics of instance-based learning? - It always uses a probabilistic model for predictions - It’s often sensitive to irrelevant or redundant features - It is always faster than model-based learning - New instances are classified based on similarity measures - Training is typically computationally intensive - The model ”memorizes” the training instances
- It’s often sensitive to irrelevant or redundant features because instance-based learning relies on similarity measures, which can be skewed by outliers - New instances are classified based on similarity measures - The model ”memorizes” the training instances
47
Why are the following invalid reasons to use ensemble learning techniques? - To handle missing values in the data - To speed up training times for large datasets - To provide a more interpretive model
- Ensemble learning does not focus on handling missing values - Ensemble training may actually increase training times because multiple models need to be trained - Ensemble models are often less interpretable because they combine multiple models, making them very complex
48
Why is finding a hyperplane that maximizes the margin between classes in the transformed space a reason for using kernel methods in SVMs?
Kernel methods enable SVMs to find a separating hyperplane in a higher-dimensional transformed space when the data is not linearly separable in the original space
49
Why is handling missing values in data NOT a reason for using kernel methods in SVMs?
Kernel methods don't address missing data, but preprocessing can do so
50
Why is decreasing the number of support vectors in the model NOT a reason for using kernel methods in SVMs?
Kernel methods don't affect the support vectors. That depends on the data and the model's parameters
51
Why is tackling non-linearly separable data a reason for using kernel methods in SVMs?
Kernel methods allow SVMs to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space where separation IS possible
52
Why is reducing the computational complexity of the algorithm NOT a reason for using kernel methods in SVMs?
Kernel methods actually often increase the computational complexity because they require pairwise comparisons of ALL data points through the kernel function
53
Why is a fixed set of features to represent all possible inputs NOT an essential component of the PAC learning framework?
PAC is agnostic to the feature space and focuses on the hypothesis space, error bounds, and learning guarantees
54
Why is a confidence parameter an essential component in the PAC learning framework?
The confidence parameter represents the probability that the chosen hypothesis will not meet the desired accuracy
55
What does PAC stand for?
Probably Approximately Correct
56
What is the definition of PAC-learnable?
57
What makes up the "Approximately Correct" in PAC?
Requiring the true error not be zero, but bounded only by some constant that can be arbitrarily small
58
What make the "Probably" in PAC?
Requiring that the learner's probability of failure be bounded by some constant that can be made arbitrarily small
59
Why is a hypothesis space essential to the PAC learning framework?
PAC assumes there exists some hypothesis space from which all possible models/functions can be selected by the learner to approximate the target
60
Why is the VC Dimension for all linear classifiers in a 2D space not 1?
Certain configurations (a triangle) can shatter 3 points, but not four. This makes the VC Dimension for all linear classifiers in a 2D space 3.
61
Why is it not true that the number of features in the dataset is equal to the VC Dimension?
They aren't equivalent. Some models with nonlinear hypothesis spaces can have a VC Dimension greater than the number of features.
62
Why is a high VC Dimension an indicator of potentially overfitting?
A higher VC Dimension means it can create more complex patterns, which means it can overfit to the given training data.
63
In the context of Bayesian learning, why is a higher likelihood NOT always indicative of a more probable hypothesis?
A higher likelihood means that the data is more probable given the hypothesis, but the overall probability of the hypothesis depends on the prior
64
In the context of Bayesian learning, why is probability NOT always uniform across all hypotheses?
Some hypotheses explain the data better than others, resulting in different likelihood values
65
In the context of Bayesian learning, why is it NOT true that the likelihood of a hypothesis is the same as the prior possibility of a hypothesis?
The likelihood P(Data | Hypothesis) is different from the prior P(Hypothesis). The prior represents the probability of the hypothesis before seeing the data, while the likelihood measures how well the data supports the hypothesis.
66
In Bayesian learning, what is MAP and its formula?
MAP is the maximum a posterior, representing an estimate of the most likely value of a variable. h_MAP = argmax_(h in H) (Pr(h | D))
67
In Bayesian learning, what is ML and its formula?
ML is the maximum likelihood, which represents when all hypotheses are equally likely. h_ML = argmax_(h in H) (Pr(D | h))
68
How do we measure information gain?
Gain(S, A) = Entropy(S) - (summation(|S_v|/S) * Entropy(s_v))
69
What is restriction bias?
Bias that automatically occurs when we decide our hypothesis set
70
What is preference bias?
Bias that tells us what sort of hypotheses from our hypothesis space we prefer
71
What types of decision trees are preferred?
Ones with good splits at the top because it means less traversal on average
72
In terms of Ensemble Learning, what is bagging?
Bagging is when we make choosing data uniformly random, then combine the results of various weak learners. Bagging is the shortened named for Bootstrap Aggregation.
73
When Bagging in Ensemble Learning, why does taking the average of a set of weak learners trained on different subsets outperform a single learner trained on the entire dataset?
Overfitting. Overfitting a subset will not overfit the overall dataset, and the average will "smooth out" the specifics of each individual learner.
74
In Ensemble Learning, what is Boosting?
Boosting is focused on data that previous learners struggled with in order to form a cohesive picture of the entire dataset.
75
In the context of Ensemble Learning, describe the process of Boosting
- Initialize all training examples to be equally weighted - After each round, identify the weakest learner - Raise the weights of every example that weaker learner misclassified - Combine all the weak learners into a final learner composed of their weighted average
76
What is a weak learner?
A weak learner is a model that performs better than chance for any distribution of data
77
When can you not create a weak learner?
If there is any distribution for which a set of hypotheses can't do better than random chance, then there is no way to create a weak learner.
78
What are two rules designed for training the weights on perceptron units/networks?
- Perceptron Rule (threshold) - Gradient Descent / Delta Rule (unthresholded)
79
Give the Perceptron Rule formula
activation = Σ(w_i * x_i) + b where b is some bias constant
80
Why does the Perceptron Rule care about linearly separable data?
If the given data is linearly separable, the Perceptron Rule will find the dividing halfplane in a finite number of iterations
81
Describe the 4 main components of the Perceptron Rule
- The target y - The output y - The learning rate (how much we adjust) - The input
82
In the Perceptron Rule, what happens when the target and the output match?
When both match, we see no change in the weight because there is no error
83
In the Perceptron Rule, what happens when the difference between the target and output is negative (y - y_hat = -1)? When it is positive (y - y_hat = 1)?
When the output is positive, that means our output is too small so we need to increase the weight's size according to the learning rate and given input. When the output is negative, that means our sum is too large so we need to decrease the weight's size according to the learning rate and given input.
84
What is the criteria to stop running the Perceptron Rule?
We stop the Perceptron Rule once the change in the given weight return 0 (this means we've found a plane that separates the data)
85
What decides when to use the Perceptron Rule vs. Gradient Descent?
Whether the data is/is not linearly separable. If it is, use the Perceptron Rule. If not, use Gradient Descent.
86
Why not do Gradient Descent on y_hat (the thresholded activation) instead of a (the pure activation value)?
It's not differentiable. There is no way to take the derivative at a discontinuous point. Essentially, the function spikes in value after we meet the activation threshold and causes the function to be discontinuous.
87
Describe the behavior of the Sigmoid function
As x approaches infinity, e becomes infinitely small (x = -100 --> e^100) so the function approaches 0 because the denominator is basically infinite. As x approaches negative infinity, e becomes infinitely large (x = 100 --> e^-100) so the function approaches 1 because the denominator is basically 1.
88
Give the formula of the Sigmoid function
a(x) = 1 / (1 + e^(-x))
89
Backpropagation is dependent on what?
On ALL of the activation functions being differentiable. We cannot propagate the error information back up the layers if the functions cannot be differentiated
90
What is backpropagation?
It is the process of calculating the error between output and actual, then adjusting weights form the "bottom" of the network back up to the top
91
Backpropagation is also called what?
Error back propagation
92
What are the 2 main issues with gradient descent?
- Activation functions need to be differentiable - Can get stuck in local optima
93
Why does gradient descent have an issue with getting stuck in local optima?
As we combine more and more functions, the "landscape" of that combined error function becomes more complex and introduces more answers other than the best answer, the global maxima
94
With neural networks, what are the 3 ways to cause overfitting?
- Increasing the number of nodes - Increasing the number of layers - Increasing the size of the weights
95
What is the reference bias of neural networks?
With a sufficiently complex network, there is no hypothesis we will not consider
96
What is the preference bias of neural networks?
Simpler, more correct network configurations
97
What is the reference bias of decision trees?
98
What is the preference bias of decision trees?
99
What is the reference bias of K-NN?
- Points closer to each other are more similar to each other - We are expecting functions to act smoothly - We assume ALL features have equal importance
100
What is the preference bias of K-NN?
101
What is the Curse of Dimensionality?
As we add more features to our data, the amount of data needed to consider it sufficiently test grows exponentially
102
What is Ensemble Learning?
When you test multiple instances of the model against different subsets of the training data and then compile all of those resulting models into a singular model
103
What is the relationship between SVMs and the margin(s) of our data?
The bigger the margin, the less overfitting done by the SVM because it commits less to the specific data it was trained on
104
What actually is a support vector?
The points from the input data that are necessary for defining the largest margin
105
What is the kernel trick?
Projecting our currently given, non-linearly separable data into a higher dimensional space so that it becomes linearly separable
106
In SVMs, how is the kernel trick accomplished? What does it do?
The kernel trick is accomplished by turning the dot product into some other similarity metric. This gives us a way to imbed domain knowledge.
107
What is the only condition for a kernel trick to be successful?
The Mercer condition; that a kernel function is valid if and only if: - It is symmetric (K(x, y) = K(y, x)) - It is positive semi-definite (c^T * K * c >= 0, essentially that any vector we put it gives out a positive number or zero as the answer)
108
What happens to Boosting over time in SVMs?
Overtime, Boosting results in the SVMs becoming more confident in their classifications so we actually don't end up seeing any issues with overfitting
109
What is a consistent learner?
A model that produces a candidate hypothesis which is equal to the true hypothesis for every data point in the training set
110
When is a version space epsilon-exhausted (ε)
A version space is ε-exhausted if and only if (iff) for every candidate hypothesis in the version space, the error of those hypotheses is less than or equal to ε where 0 <= ε <= 1/2
111
Describe how exploration is accomplished in Simulated Annealing
With temperature. As temperature increases, simulated annealing is will to take initially worse steps hoping that they will eventually lead to a better, or global, optima
112
Describe how exploitation is accomplished in Simulated Annealing
With hill climbing. If we are solely focused on exploitation, then we reduced the temperature as close to 0 as possible so that we're only attempting steps that will improve our fitness function score
113
What is the inspiration behind Simulated Annealing?
That the same process for forging metals by repeatedly changing the ordering of molecules via heating and pressing until a viable combination is achieved can be used for randomized optimization
114
How does single link clustering work?
- Initialize each data point on its own cluster - Find the two clusters with the smallest distance between any two points belonging to those clusters - Merge those two clusters into a single cluster - Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
115
How does k-means clustering work?
- Randomly initialize k centers - Each center "claims" its closest point - Re-compute the center of each cluster by averaging the clustered points - Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
116
What kinds of shapes does single link clustering produce? Why?
117
What kinds of shapes does k-means clustering produce? Why?
118
How does soft clustering work?
119
What kinds of shapes does soft clustering produce? Why?
120
How does the EM (Expectation Maximization) algorithm work?
121
What are the properties of the EM (Expectation Maximization) algorithm?
- Monotonically non-decreasing likelihood (creates scores that never get lower) - Does not converge (but gets close) - Will not diverge - Can get stuck in local optima - Works with any distribution (so long as you can define "expectation" and "maximization")
122
Why does k-means clustering eventually converge?
There is some finite number of neighbors. As long as you have a means of always breaking ties, you will eventually converge on some configuration
123
What are the 3 clustering properties?
- Richness: the idea that we could achieve ANY cluster configuration available - Scale Invariance: the idea that making values more positive doesn't change the clustering, assuming that the relative positions stays the same - Consistency: the idea that shrinking intracluster distances and expanding intercluster distances doesn't change the clustering
124
What is the Kleinberg Impossibility Theorem?
There is no clustering algorithm that can achieve all three properties: richness, scale-invariance, and consistency
125
What are the two types of methods for feature selection?
- Filtering: we take the given features and narrow the list to a subset before handing that subset to some learning algorithm - Wrapping: the search for features is wrapped around our learning algorithm as we continuously give it a subset, assess its performance, and adjust which features its given the next time
126
What are the properties of Feature Selection via Filtering?
+ Generally increased speed - Ignores the learning problem
127
What are the properties of Feature Selection via Wrapping?
+ Takes into account model bias + Takes into account the learning problem - SOOOOOO slow
128
What are the 2 linear transformation algorithms we reviewed?
- Principal Component Analysis (PCA) - Independent Component Analysis (ICA)
129
What does PCA do?
- It finds the direction of maximum variance for some given set of training data - It finds directions that are mutually orthogonal (perpendicular to each other)
130
What are the characteristics of PCA?
- It is an eigen problem - It gives you the best reconstruction (lowest reconstruction error) of the data - Each principal component returns it's eigenvalue (positive value)
131
What does it mean if the eigenvalue of a principal component from PCA is 0?
If the eigenvalue is 0, then we can ignore that component (feature)
132
What are the characteristics of ICA?
- Constructs new features that are mutually independent of each other - Aims for the maximum mutual information between the original features and the constructed features
133
What is the Central Limit Theorem?
134
What is the main difference between PCA and ICA?
ICA is highly directional. PCA doesn't care about being given the original or transposed features because they're all just rotations of data
135
What are some alternatives to PCA and ICA?
- RCA (Random Component Analysis) - LDA (Linear Discriminant Analysis)
136
What is the main advantage of RCA?
It is FAST, so we can run many, many iterations of it
137
How does RCA work?
RCA generates a random direction for projection, and works quite well if used for classification
138
How does LDA work?
LDA finds a projection that discriminates based on the given labels
139
What does Entropy capture?
Entropy captures the amount of information contained in a random variable
140
What is Joint Entropy?
Joint Entropy is a measure of randomness contained in two variables together
141
What is Mutual Information?
A measure of the reduction of randomness of a variable, given knowledge of another variable
142
Why is Mutual Information a particular case of Kullback-Leibler Divergence?
Mutual Information is a case of KL Divergence because minimizing KL Divergence between two distributions maximizes the information shared between them, resulting in information gain between the two distributions
143
What is the Kullback-Liebler Diverence?
KL Divergence is a measure of the difference between two distributions, quantifying the information lost when one distribution is used to approximate another
144
What is Conditional Entropy?
Conditional Entropy is a measure of the randomness of one variable given another variable
145