Kursusgang 12 (Algorithm-independent machine learning) Flashcards

Question 1

Q

What is the no free lunch theorem?

Answer

A

No single optimization algorithm or learning algorithm is universally superior for all problems. It is the type of problem, prior distribution, and other information that determine which form of classifier should provide the best performance.
Averaged over all possible problems, all optimization or learning algorithms (even random guessing) perform equally well (or poorly).

Question 2

Q

What is cross validation?

Answer

A

It is a method for increasing performance based on datasets with small sample size.
The data is split into k groups where k-1 groups are used to train a set of models that are evaluated on the remaining group. This process is repeated for all k possible choices for the left out group. This is computationally expensive as the number of training runs to be performed increases by a factor of k.

Question 3

Q

What is mixture of experts?

Answer

A

It is a combination of models. Each of its component classifiers is highly trained (i.e., an “expert”) in a different (probabilistic) region in the feature space, thus granting good performance over the entire feature space. Multiple components are likely to be combined.

Question 4

Q

What is a decision tree?

Answer

A

It is a hierarchical non-parametric model for supervised learning. The input space is partitioned into regions and a simple model is assigned to each region. Only one model is responsible at any given point in the input space.
A decision tree recursively conducts binary partitioning of input space, along with the corresponding tree structure.
Useful for classification and regression.

Question 5

Q

Explain the internal decision nodes of decision trees.

Answer

A

The decision nodes can be either univariate or multivariate. Univariate uses a single attribute x_i and can be further divided into:
* Numeric x_i : for binary split , check if x_i > w_m
* Discrete x_i : the node creates one branch for each unique value of a categorical (discrete) feature.

The leaves on a decision tree also depends on the task it needs to solve. For classification, the leaves contain the class labels and in regression, there are numeric values.

Question 6

Q

How does a decision tree choose a split?

Answer

A

The goodness of a split is quantified by impurity measure. For node m, N_m instances reach the node m, and N_m^i instances belong to C_i. If an instance reaches m, the estimate of probability of class C_i is
P(C_i | x,m) = p_m^i = N_m^i / N_m.
The node is pure if p_m^i is 0 or 1. The measure of the impurity is entropy,
I_m = - \sum_{i=1}^K p_m^i * log_2 p_m^i.
If node m is pure, generate a leaf and stop, otherwise split and continue recursively. For all attributes, calculate the impurity and choose the one that has the minimum entropy

Question 7

Q

How is the goodness of a split in a regression tree measured?

Answer

A

The goodness of a split is measured by the mean squared error from the estimated value.
b_m(x) = 1 if x ∈ X_m: x reaches node m,
0 otherwise.

E_m = 1/N_m * \sum_t [(r^t - g_m)^2 * b_m(t)],
where
g_m = [\sum_t b_m(x^t) * r^t] / [\sum_t b_m(x^t)]

Question 8

Q

What are the characteristics of the decision boundary of a decision tree?

Answer

A

The decision tree’s boundary can only partition the space parallel to the feature axes, which can limit its ability to model complex patterns. The boundary is influenced by earlier splits in the tree, and its shape can depend heavily on the order of splits chosen during training.

Question 9

Q

What is inductive bias?

Answer

A

It is the set of assumptions of the learning algorithm. To introduce inductive bias, assume a hypothesis class, H, e.g., a linear function or a Gaussian model. If a model is overfitted, H is more complex than the underlying function of the data, f. If a model is underfitted, H is less complex than f.

Question 10

Q

What is the triple trade-off?

Answer

A

There is a trade-off between
* Complexity of H, c(H)
* Training set size, N
* Generalization error, E
As N increases, E decreases. As c(H) increases, first E decreases and then E increases.