Exame Flashcards
What are the ingredients of machine learning?
- Tasks
- Models
- Features
__________ are the output of machine learning.
Models
_________ are the problems that can be solved with machine learning.
Tasks
_________ are the workhorses of machine learning.
Features
What is the most fundamental concept in machine learning?
Generalisation
_________ __________ are solved by learning algorithms that produce _________.
Learning problems/tasks; models
In linear classification, ___________ are attributed to the features and the class is assigned based on a previously defined ______________.
weights; threshold
Faça as correspondências corretas.
- ML models
- ML algorithms
A. Coded algorithm that learns the weights from being fed the training set.
B. The output of the ML algorithm having been given the training data.
1.B; 2.A
Overfitting is the only possible reason for poor performance on new data. True or False?
False. Sometimes it is just that the training data are not representative. A solution could be using different training data that exhibits the same characteristics.
Trying too hard to achieve a good performance on the training data can easily lead to _____________.
overfitting
What is underfitting?
Underfitting happens when the model is unable to capture the relationship between the input data accurately, generating a high error rate on both the training set and unseen data.
A Bayesian classifier maintains a vocabulary of words and phrases for which statistics are collected from a training set. True or False?
True
The Bayesian classification scheme does not allow repetition if there is further evidence/information. True or False?
False
Bayes’ Rule assumes that the two pieces of evidence are independent. However, the Naïve Bayes classifier assumes ______________ independence.
conditional
A rule-based classifier is a __________ model.
logical
Rule-based classifiers work on a case-by-case basis. Cases can be defined by several __________ features.
nested
Effective and efficient algorithms exist for identifying the most ___________ feature combinations and organise them as rules or ________.
predictive; trees
Tasks are addressed by __________, whereas learning problems are solved by learning algorithms that produce __________.
models; models
Define learning problem.
Obtaining a mapping from training data is what constitutes a learning problem.
Mathematically, features are ___________.
functions
A single feature is not enough to build a model. True or False?
False
In unsupervised learning, there is a label space involved. True or False?
False
In classification, the output space is a set of ____________. In regression, it is a set of real ___________.
classes; numbers
Name a few supervised learning of predictive models.
- Classification
- Scoring and ranking
- Probability estimation
- Regression
Label noise occurs when there is a corrupted label and instance noise when we observe an instance x’ that is also _____________.
corrupted
What is the main consequence of working with noisy data?
It is generally not advisable to try to match the training data exactly, as this may lead to overfitting the noise.
Some of the labelled data is usually set aside for ___________ a classifier.
testing
Binary classification is also called ____________ _____________.
concept learning
When the number of classes is higher than ____ the task is called multi-class classification.
2
The indicator function is 1 if its argument evaluates to true. True or False?
True
Accuracy and error rate should equal up to ____.
1
Define accuracy.
Accuracy can be seen as the probability that an arbitrary instance is classified correctly.
True positives (TP) and true negatives (TN) are correctly classified ____________ and _____________, respectively.
positives; negatives
False positives (FP) and false negatives (FN) are incorrectly classified ___________ and _______________, respectively.
negatives; positives
The true positive rate (tpr) is also called _____________.
sensitivity
The true negative rate (tpn) is also called _____________.
specificity
If the test set contains equal number of positives and negatives, then what can we conclude about the accuracy of the model?
The accuracy is the average of the tpr and the tnr (and the error rate is the average of the false positive rate (fpr) and the false negative rate (fnr)).
In a scoring classifier, scores on which class predictions are based are computed. They indicate how likely it is that a specific class label applies. True or False?
True
If we only have 2 __________, it usually suffices to consider the score for only one of the classes.
classes
How can we turn the ranking into a classifier?
By selecting a split point in the ranking.
Define Class Probability Estimation.
It is a scoring classifier that outputs probability vectors over classes.
In scoring classifiers, we have direct access to respective probabilities of each class. True or False?
False
In Class Probability Estimation, the more frequent the instances of a certain class, the less confident we should be in our belief that a similar instance belongs to that class as well. True or False?
False
Assessing accuracy is always enough to evaluate performance on a multiple-class classifier. True or False?
False
Precision associates the correctly classified instance with the total of the instances that were predicted to be of that class as well. True or False?
True
Recall associates the correctly classified instance with the total of the instances whose true class is that same class. True or False?
True
Decision trees are a ML model that can not handle any number of classes quite naturally. True or False?
False
Some ML models, such as linear classifiers, can only handle more than two classes by combining the classification of multiple ___________ classifiers, since they are primarily designed to separate ______ classes.
binary; 2
What are the 2 options when we want to build a k-class classifier, but only have the hability to train only two-class ones?
- One-versus-rest: we train k binary classifiers, separately for each class, where C_i is treated as positive and all remaining classes are negative.
- One-versus-one: we train at least [k(k-1)]/2 classifiers for each pair of classes C_i and C_j treating them as positive and negative, respectively.
A regressor predicts ___________.
values
In supervised learning, we learn a mapping from instance space to output space using ____________ learning examples. There is a __________ variable in the training data, which has to be supplied with some knowledge about the true labelling function. The models are called ____________ since their output are either direct estimates of the _________ variable or provide further information about its most likely value.
labelled; target; predictive; target
In unsupervised learning, the goal is to learn a _____________ model. The examples are not ____________. Since it is unlikely that a ‘ground truth’ or a ‘gold standard’ is available to test the descriptive models against, evaluating descriptive models is much ________ straightforward than evaluating ____________ models. In this case, the task is to produce a descriptive model of the data.
descriptive; labelled; less; predictive
There is a separate training set for unsupervised learning models. True or False?
False
Machine Learning is the systematic study of ___________ and ____________ that improve their knowledge or performance with experience.
algorithms; systems
A model is a ____________ from the ____________ space to the output space.
mapping; instance
features = attributes = predictor variables = explanatory variables = independent variables
True or False?
True
In unsupervised learning, the task coincides with the learning problem. True or False?
True
C4.5 uses _________ _________ and ID3 uses _______________ _________.
Gain Ratio; Information Gain
CART uses the Gini Index. True or False?
True
C4.5 produces decision trees that are forced to be binary. True or False?
False
In general, the ___________ the train set, the better the capability of the model to _______________, and consequentely the better its performance when classifying new (unseen) examples.
larger; generalise
Decision trees can’t handle both categorical and numerical features in the same dataset. True or False?
False
What are the main strengths of decision trees?
The fact that they can handle both categorical and numerical features in the same dataset and their interpretability.
CART requires data transformations to deal with categorical features. True or False?
True
C4.5 can deal with categorical features natively, thus being transparent to the user how the different attribute types are managed by the algorithm. True or False?
True
Scikit-learn uses an optimised version of the _________ algorithm for decision trees.
CART
In Scikit-learn, for example, all learning algorithms are implemented to receive as input numeric values, preventing the use of certain types of features, such as categorical, without data transformations. True or False?
True
CART allows binary, categorical and numerical features; can be used for binary and multiclass classification together with regression; and learns _____________ trees. However, Scikit-Learn’s implementation does not support _____________ variables for now.
binary; categorical
Silhouette analysis can be used to study the ______________ distance between the resulting clusters and select the _______ number of clusters when no domain knowledge about the number of groups is available.
separation; best
A __________ Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores: the mean ____________ between a sample and all other points in the same class and the mean distance between a sample and all other points in the next _____________ cluster. The Silhouette Coefficient for a set of samples is given as the __________ of the Silhouette Coefficient for each sample. The score is bounded between -1 for incorrect clustering and +1 for highly ________ clustering. Scores around zero indicate _______________ clusters. The score is higher when clusters are _________ and well _______________, which relates to a standard concept of a cluster.
higher; distance; nearest; mean; dense; overlapping; dense; separated
If the ground truth labels are not known, the Calinski-Harabasz Index, also known as the ____________ ________ Criterion, can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with __________ __________ clusters. The score is higher when clusters are _________ and well _______________, which relates to a standard concept of a cluster.
Variance Ratio; better defined; dense; separated
If the ground truth labels are not known, the Davies-Bouldin index can be used to evaluate the model, where a _______ Davies-Bouldin index relates to a model with better separation between the clusters.
This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the ________ of the clusters themselves.
Zero is the lowest possible score. Values closer to zero indicate a better ______________.
lower; size; partition
Faça as correspondências corretas.
- Homogeneity
- Completeness
- V-measure
A. Harmonic mean between homogeneity and completeness.
B. All members of a given class are assigned to the same cluster.
C. Each cluster contains only members of a single class.
- C
- B
- A
Homogeneity and Completeness range from ___ to ___.
0; 1
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional ____________ analysis.
entropy
Contingency matrix reports the intersection _____________ for every true/predicted cluster pair. The contingency matrix provides sufficient statistics for all clustering metrics where the samples are ______________ and _______________ distributed and one doesn’t need to account for some instances not being clustered.
cardinality; independent; identically
The original Fowlkes-Mallows index (FMI) was intended to measure the similarity between two _______________ results, which is inherently an unsupervised comparison. The supervised adaptation of the Fowlkes-Mallows index can be used when the ground truth class assignments of the samples are known. The FMI is defined as the _______________ __________ of the pairwise precision and recall. The score ranges from 0 to 1. A high value indicates a good ________________ between two clusters.
clustering; geometric mean; similarity
The intuition behind SVM is that rather than simply drawing a zero-width line between the classes, we can draw around each _______ a __________ of some width, up to the nearest point. The line that _______________ this margin is the one we will choose as the ___________ model.
Support vector machines are thus an example of ______________ ____________ _______________.
line; margin; maximises; optimal; maximum margin estimator
A key to this classifier’s success is that for the classification, only the ____________ of the _____________ ___________ matter, that is, any points further from the margin which are on the correct side do not _____________ the classification.
Theoretically, this is because these points do not _____________ to the loss function used to fit the model, so their position and number do not matter as long as they do not cross the margin.
position; support vectors; modify; contribute
This insensitivity to the exact behavior of distant points is one of the limitations of the SVM model. True or False?
False
In SVM, when data is not linearly separable, we would like to somehow automatically find the best ________ functions to use.
One strategy to this end is to compute a radial basis function centered at every point in the dataset, and let the SVM algorithm sift through the results. This type of basis function transformation is known as a _________ _________________, as it is based on a similarity relationship (or kernel) between each pair of points.
A potential problem with this strategy projecting points into dimensions—is that it might become very computationally _______________ as grows large.
However, because of a neat little procedure known as the kernel trick, a fit on kernel-transformed data can be done ________________ (without ever building the full -dimensional representation of the _________ _____________). This kernel trick is built into the SVM, and is one of the reasons the method is so powerful.
basis; kernel transformation; intensive; implicitly; kernel projection
The SVM implementation has a bit of a fudge-factor which “softens” the margin: that is, it allows some of the points to creep into the __________ if that allows a __________ fit.
The hardness of the margin is controlled by a tuning parameter, most often known as ____. For very large C, the margin is ________, and points _________ lie in it. For smaller C, the margin is _______, and can grow to encompass some points.
margin; better; C; hard; cannot; softer
The optimal value of parameter C depends on the dataset and should be tuned using ______________ cross-validation cross validation, or a similar procedure.
grid-search
Linear models for regression can be characterised as _______________ _________ for which the prediction is a ______ for a single feature, a ________ when using two features, or a _______________ in higher dimensions (that is when having more features).
It is a strong assumption that our target is a linear combination of the features. But looking at one-dimensional data gives a somewhat skewed perspective.
For datasets with many features, linear
models can be very _____________. In particular, if you have more features than _____________ _______ points, any target can be perfectly modeled (on the training set) as a __________ function.
regression models; line; plane; hyperplane; powerful; training data; linear
What are the differences between the different linear models for regression.
The difference between these models is how the model parameters and are learned from the training data, and how model complexity can be controlled.
Linear regression or ________________ ________ ___________ (OLS) is the simplest and most classic linear method for regression.
Linear regression finds the parameters and that minimize the _________ __________ ________ between predictions and the true regression targets on the training set, which is the sum of the squared differences between the predictions and the true values. Linear regression has ____ parameters, which is a benefit, but it also has no way to control model ______________.
Ordinary Least Squares; mean squared error; no; complexity
In linear regression, for one-dimensional datasets, there is little danger of _________________, as the model is very simple (or ______________). However, with higher dimensional datasets (meaning a large number of features), linear models become more powerful and there is a higher chance of ___________________.
overfitting; restricted; overfitting
One way to control overfitting in a linear regression is to try to find a model that allows us to control complexity. True or False?
True
One of the most commonly used alternatives to standard linear regression is __________ ______________.
Ridge Regression (L2)
In __________ regression, the coefficients are chosen so that they predict well on the training data, but there is an additional constraint.
We also want the magnitude of coefficients to be as __________ as possible (close to 0). Intuitively, this means each feature should have as __________ effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraint is an example of what is called __________________.
__________________ means explicitly restricting a model to avoid overfitting.
Ridge; small; little; regularisation; Regularisation
Mathematically, Ridge penalizes the _____ norm of the coefficients, or the ___________ length of _____.
L2; Euclidean; w
Ridge is a more restricted model, so the _______ set score should be lower and the _______ set score higher (trade-off), in comparison with a linear regression model.
train; test
How much importance the Ridge Regression model places on simplicity versus train set performance can be specified by the user, using the __________ parameter. Increasing alpha forces coefficients to move more ___________ zero, which ______________ training set performance, but might help generalization.
For very small values of alpha, coefficients are ___________ restricted, and we end up with a model that resembles ________ _______________.
alpha; towards; decreases; barely; linear regression
The Lasso penalizes the L1 norm of the coefficient vector, or in other words the sum of the absolute values of the coefficients. True or False?
True
L__ regularisation be seen as a form of automatic feature selection. Having some coefficients be exactly ________ often makes a model _________ to interpret, and can reveal the most important features of the model.
1; zero; easier
Lasso regression is usually the first choice between Ridge and Lasso regression models. However, if there is a large amount of features and expect only a few of them to be important, Ridge might be a better choice. True or False?
False
The formula of the linear models for classification looks very similar to the one for linear regression, but instead of just returning the weighted sum of the features, we _____________ the predicted value at _______. If the function is smaller than ________, we predict the class -1, if it was larger than ________, we predict the class +1.
threshold; zero; zero; zero
For linear models for classification, the decision boundary is a linear function of the input. In other words, a (binary) linear classifier is a classifier that separates two classes using a line, a plane or a hyperplane. True or False?
True
Linear regression and logistic regression are equivalent algorithms. True or False?
True
For logistic regression and SVM, lower C (more regularisation) put more emphasis on finding a coefficient vector that is close to zero. True or False?
True
Using low values of C will cause the algorithms try to adjust to the “majority” of data points, while using a higher value of C stresses the importance that each individual data point be classified correctly. True or False?
True
Guarding against overfitting becomes increasingly important when considering less features. True or False?
False
Logistic Regression applies L__ regularisation by default.
2
Large alpha or small C mean _________ models. In particular for the regression models, tuning this parameter is quite important. Usually C and alpha are searched for on a ________________ scale.
simple; logarithmic
L___ can also be useful if _________________ of the model is important. As L1 will use only a few features, it is easier to explain which features are important to the model, and what the effect of these features is.
1; interpretability
Another strength of linear models is that they make relatively easy to understand how a prediction is made using the formulas above. Unfortunately, it is often not entirely clear why coefficients are the way they are. This is particularly true if your dataset has highly correlated features; in these cases, the coefficients might be hard to interpret. True or False?
True
Linear models often perform well when the number of features is __________ compared to the number of samples.
They are also often used on very _________ datasets, simply because other models are not feasible to train.
However, on __________ datasets, other models might yield better generalization performance.
large; large; smaller
Linear models are very ________ to train, and ________ at predicting. They scale to very large datasets and work well with __________ data. If your data consists of hundreds of thousands or millions of samples, you might want to investigate SGDClassifier and SGDRegressor, which implement even more scalable versions of the linear models described above.
fast; fast; sparse
Support vector machines (SVMs) are a set of (popular) _______________ learning methods used for classification, regression and ____________ detection.
supervised; outliers
What are the advantages of support vector machines?
Effective in high dimensional spaces. Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, such as Linear, Polynomial, Radial Basis Functions (RBF) and Sigmoid, but it is also possible to specify custom kernels.
What are the disadvantages of support vector machines?
- If the number of features is much greater than the number of samples, the choice of Kernel functions and regularisation term is crucial to avoid overfitting.
- SVMs do not directly provide probability estimates, these are calculated using a expensive five-fold cross-validation.
Using _______ values of C will cause the algorithms to try to adjust to the “majority” of data points, while using a _________ value C of stresses the importance that each individual data point be classified correctly.
low; higher
A common technique to extend a binary classification algorithm to a multi-class classification algorithm is the one-vs-rest approach.
In the one-vs-rest approach, a binary model is learned for each __________, which tries to separate this class from all of the other classes, resulting in as many binary models as there are __________. To make a prediction, all binary classifiers are run on a test point. The classifier that has the highest score on its single class “wins” and this class label is returned as prediction.
class; classes
Linear models can be quite limiting in low dimensional spaces, as lines or hyperplanes have limited _____________.
One way to make a linear model more flexible is by adding more ____________, for example by adding interactions or ______________ of the input features.
flexibility; features; polynomials
The Kernel Trick works by directly computing the _______________ (more precisely, the ________ products) of the data points for the ____________ feature representation, without ever actually computing the expansion.
distance; scalar; expanded
The Polynomial Kernel, which computes all possible _________________ up to a certain degree of the original features.
The Radial Basis Function (RBF) kernel, also known as ____________ Kernel, which explained in a simple way, considers all possible polynomials of all degrees, but the importance of the features _____________ for higher degrees.
polynomials; Gaussian; decreases
The ____________ parameter controls the width of the Gaussian Kernel. It determines the scale of what it means for points to be close together. The ____ parameter is a regularization parameter similar to the linear models. It limits the importance of each point.
gamma; C
A ___________ gamma means a large radius for the Gaussian kernel, which means that many points are considered close-by. This is reflected in very smooth decision boundaries. A low value of __________ means that the decision boundary will vary slowly, which yields a model of low complexity, while a high value of yields a more complex model.
low; gamma
Features having different orders of magnitude can be a problem for other models (like linear models), but it has devastating effects for the kernel SVM. True or False?
True
SVC implements the “one-against-one” approach. True or False?
True
What are the strengths of SVMs?
Kernelized support vector machines are very powerful models and perform very well on a variety of datasets.
SVMs allow for very complex decision boundaries, even if the data has only a few features.
SVMs work well on low-dimensional and high-dimensional data (few and many features), but don’t scale very well with the number of samples. Running on data with up to 10000 samples might work well, but working with datasets of size 100000 or more can become challenging in terms of runtime and memory usage.
What is the main downside of SVMs?
They require careful preprocessing of data and tuning of parameters. Furthermore, SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a non-expert. Still it might be worth trying SVMs, particularly if all of our features represent measurements in similar units (i.e. all are pixel intensities) and are on similar scales.
Gamma and C should be adjusted together. True or False?
True
Tree models are limited to classification. True or False?
False
K-Means++’s initialisation will still pick random points, but with probability proportional to square distance from the previously assigned centroids. True or False?
True. Points that are further away will have higher probability to be selected as starting centroids. Consequently, if there’s a group of points, the probability that a point from the group will be selected also gets higher as their probabilities add up, resolving the outlier problem we mentioned.