Exam Flashcards

1
Q

Feature Transformation is the process of obtaining new features (output) from existing features (input). Which of the following is not an example of that?

(a) shift from Cartesian to polar coordinates.
(b) Scaling the data to a speci!c interval as in normalizing.
(c) Computing the distance between two features with a given distance metric.
(d) Dropping some of the features in the original data set.

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following statements about Principal Component Analysis (PCA) is true?

(a) PCA is a high-dimensional clustering method.
(b) PCA is an unsupervised learning method.
(c) PCA reduces dimensionality.
(d) PCA enhances prediction performance.

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Which of the following is/are true?

(I) Using too many features might result in learning exceptions that are specific to the training data, which might not generalize to the real world.

(II) Pearson’s coefficient being 0 implies that there is no dependency between the two input variables.

(III) The death rate among intensive care patients is high. However, we can not deduce from that being in intensive care causes the deaths, because correlation does not imply causation.

(a) Only I,
(b) I and II,
(c) I and III,
(d) I, II, and III

A

C

2 is not right, because if the variables are independent, Pearson’s correlation coefficient is 0, but the other way around is not true because the correlation coefficient detects only linear dependencies between two variables.

https://en.wikipedia.org/wiki/Correlation_and_dependence
See Correlation and independence

OR/AND

It is in the two input variables part. Since input variables are X variables and correlation is about the X and Y variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following are true about k-means clustering?

(I) k refers to the size of each cluster.

(II) For a given data set, the ideal number of the clusters is independent of the problem statement or relevant features.

(III) Imagine a data set has weight (in kg), size (in m3) and value (in Euros) for each parcel for a Cargo carrier. It is likely that, the chosen number of clusters for insurance purposes (based on value only) can be different from the number of clusters based on size (weight and volume).

(IV) If the average silhouette coe”cient in a cluster is close to 1, then the points in the
cluster are tightly grouped together.

(a) II,IV
(b) I,II
(c) III,IV
(d) I,II,III,IV

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is required by k-means clustering ?

(a) a de!ned distance metric
(b) number of clusters
(c) initial guess as to cluster centroids
(d) all of the above.

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following is true about feature vectors?

(a) Prediction performance improves with the number of features in the feature vector.
(b) Prediction performance worsens if the number of features included in the feature vector decreases.
(c) Prediction performance depends on the balance between too few and too many features.
(d) Prediction performance bene!ts from the curse of dimensionality.

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the result of a Principal Component Analysis transformation?

(a) A reduced set of features by linearly ranking the most important features.
(b) A new set of correlated features extracted by weighting the feature distances.
(c) A new set of features extracted by linearly recombining the original features.
(d) A reduced set of features extracted by rotating the feature matrix.

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. What is the Curse of Dimensionality?
    (a) The vector distances between your instances decrease with the number of features.

(b) The number of required features grows almost exponentially with the amount of
data.

(c) The amount of required data grows almost exponentially with the number of features.
(d) In high-dimensional feature space, data tends to be normally distributed.

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When using a wrapper strategy, what is the best motivation for including feature selection
as a hyperparameter of a predictive model?

(a) It helps di#erent models select the same set of optimal features.
(b) A certain set of optimal features can be determined independently of the model.
(c) Reducing dimensions guarantees better generalization.
(d) A certain set of optimal features can be determined dependent on the model.

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. You are testing various predictive algorithms and found that they are very likely to overfit due to the curse of dimensionality. Which strategy below would most likely solve this problem?
    (a) Dimensionality Reduction
    (b) Feature selection
    (c) Both (a) and (b) may solve this problem.
    (d) Neither (a) or (b) can be applied to solve this problem.
A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. What is the di#erence between dimensionality reduction and clustering?
    (a) Clustering groups objects together based on their similarity in the feature space, while dimensionality reduction rotates them in the feature space.
    (b) Clustering is a type of classi!cation, while dimensionality reduction is a particular way of carrying out regression.
    (c) Clustering simpli!es data by grouping data points together based on their similarity in the feature space, while dimensionality reduction simpli!es the data by projecting the data points onto a smaller-dimensional space.
    (d) Clustering is supervised, while dimensionality reduction is unsupervised.
A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Given the data points x1 = (1, 1) and x2 = (8, 8), what is the Euclidean and the cosine distance between these points?
    (a) Euclidean = 9.8995, cosine = 0
    (b) Euclidean = 3,3416, cosine = 0
    (c) Euclidean = 9.8995, cosine = 9
    (d) Euclidean = 3,3416, cosine = 9
A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. For a classi!cation task, you apply Principal Component Analysis (PCA) to a data set with 10-dimensional feature vectors. You notice that the performance on your validation set is smaller for 9 than for 10 components. Which one of the following statements applies?
    (a) The classi!er trained on 9 components is overfitting.
    (b) The 10-th component contains information relevant for prediction.
    (c) The 10 components contain considerable amounts of noise.
    (d) All of the above.
A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What is the main di#erence between the decision boundary generated by logistic regression
    and the decision boundary generated by a linear Support Vector Machine (SVM)?

(a) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM can be nonlinear.
(b) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM can be linear.
(c) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM is high dimensional.
(d) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM is optimal.

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Consider a multiclass e-mail classi!cation task that tries to predict calendar categories (i.e. meeting, festival, delivery, etc) based on the content of an e-mail. We train a Naive Bayes classifier for this task. The category festival occurs infrequently compared to the other categories. Which statement is true regarding this category?

(a) If all words in an e-mail would have equal probabilities between classes, the overall
probability for this category would be lower.

(b) Regardless of the probabilities for the words in an e-mail, this category will generally always have low prediction probabilities compared to the others.

(c) For this category to get classi!ed by the model, the e-mail would require high fre-
quency words to occur in an e-mail.

(d) As the probabilities for the words under this class will all be low, this category will generally always have low prediction probabilities compared to the others.

A

A

Will always be lower due to prior. The occurence is not equally likely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Consider the features and prediction descriptions below. Which one of these is an example of information leakage?

(a) Predicting tweet sentiment (positive or negative) and using words such as good and bad as features.
(b) Predicting survival rates for a sinking ship and using date of birth of the passenger as one of the features.

(c) Predicting the severity of an incoming hurricane and using the money worth of
damages it caused as one of the features.

(d) Predicting daily ticket sales for a theme park and using yesterday’s amount of visitors and ticket price as one of the features.

A

C

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Als je een orkaan wil voorspellen en je gebruikt de schade die de orkaan heeft aangericht nadat de orkaan geweest is…..Dt kan je nog niet weten.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When doing regression, which step can be deemed as incorrect practice?

(a) Scaling features using standardization or normalization.
(b) Using the average target value as a baseline for model performance.
(c) Comparing different regression algorithms on the test set.
(d) Evaluating using several error metrics.

A

C

18
Q

Which description matches Random Decision Forests (RDFs)?

(a) RDFs consist of multiple identically trained decision trees.
(b) RDFs consist of multiple decision trees. After training the best tree is selected.
(c) RDFs consist of multiple decision trees each trained on subsets of features.
(d) RDFs consist of random decision trees the predictions of which are pooled.

A

C

19
Q

Which statement is true regarding Support Vector Machines (SVMs)?

(a) SVMs are less likely to tightly fit decision boundaries around class edges by using the kernel trick.
(b) SVMs can fit multiple decision boundaries by using the maximum margin between different classes.
(c) SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.
(d) SVMs can fit non-linear decision boundaries by using the maximum margin between different classes.

A

C

20
Q

A k Nearest Neighbour classi!er is trained on a student performance prediction task. Each
student is represented by a feature vector consisting of their grades for 8 courses (ranging between 1.0 and 10.0), their age (years), their weight (kg) and height (cm). What kind of preprocessing is least likely to help?

(a) Converting features to binary variables.
(b) Normalization.
(c) Feature selection.
(d) Removing outliers.

A

A

21
Q

A company wants to apply data mining to its client database but requires that the induced model’s most important features are as transparent as possible. Which algorithm should be applied?

(a) k Nearest Neighbours.
(b) Decision tree.
(c) Clustering.
(d) Principal Component Analysis.

A

B

22
Q

Consider a binary e-mail classi!cation task for spam (i.e. where the goal is to detect spam e-mails). We train Naive Bayes for this task. When we take the top 200 words with the highest probability allocated by Naive Bayes under the ‘spam’ label, this means that:

(a) These words occur with high probability under the positive label, and with low
probability under the negative label.

(b) These words will be uniquely associated with the positive label.
(c) Any word amongst these words can be used as a keyword to correctly predict a positive label.
(d) These words occur with high probability under the positive label.

A

D

23
Q

Which pair of rates is used to calculate (and visualize) the ROC curve?

(a) True Positive and False Negative.
(b) True Negative and False Negative.
(c) True Negative and False Positive.
(d) True Positive and False Positive.

A

D

24
Q

Why does PCA not work well on the Swiss Roll dataset?

(a) Because PCA cannot deal with nonlinear manifolds.

(b) Because PCA is su#ering from the curse of
dimensionality.

(c) Because PCA is an unsupervised learning method.
(d) Because PCA retains a limited number of components.

A

A

25
Q

You evaluated the k-nearest neighbour classi!er on two values of k, i.e., k = 1, k = 3. The train, validation, and test performances for k = 1 are: 0.7, 0.6, and 0.8, respectively. For k = 3 the performances are: 0.7, 0.7, and 0.7, respectively. Which classifier should you
select and what is its expected prediction performance on unseen data?

(a) The k = 1 classifier, expected performance is 0.7.

(b) The k = 1 classifier,
expected performance is 0.8.

(c) The k = 3 classifier, expected performance is 0.7.
(d) The k = 3 classifier, expected performance is 0.8.

A

C

26
Q

A Support Vector Machine with a Radial Basis Function Kernel has two hyperparameters: C
and and Y (upsilon) is applied to a binary classification problem. In order to increase the complexity
of the decision boundary you should:

(a) Increase C, increase Y (upsilon)
(b) Increase C, decrease Y (upsilon)
(c) Decrease C, increase Y (upsilon)
(d) Decrease C, decrease Y (upsilon)

A

A

27
Q

Which of the characteristics below does not fit Big Data?

Value

Volume

Variety

Velocity

A

Value

28
Q

Which of the following metrics can be used in k-NN to look for neighbours?
(multiple answers possible)

  • Information Gain
  • Euclidean
  • Cosine
  • Chebyshev
A
  • Euclidean
  • Cosine
  • Chebyshev

Lecture 3 (46-50) and Video Lecture 3 (16:15 onwards, 49:38 onwards): Information Gain has nothing to do with distance, or similarity in vector spaces. The others do.

29
Q

Which of the following concepts are relevant when dealing with
unbalanced binary classes?
(multiple answers possible)

  • F1-score
  • Stratification
  • L2 Normalization
  • Sigmoid function
A
  • F1 score
  • Stratification
Lectures 3 & 4 and Video Lecture 3 (48:03 onwards): F1 harmonizes precision and recall, and gives a fair score taking into account the prediction of your positive class. Stratification makes sure an equal amount of minority class examples is in any of your splits, which is essential for a good evaluation.
Normalizes your vectors, and the sigmoid function can be used as a decision boundary - therefore these have nothing to do with your class distribution.
30
Q

What is a good reason to opt for automatically inducing rules
from data rather than manually?

  • Algorithms can discover patterns in data that humans cannot.
  • Algorithms facilitate avoiding the use of expensive expert humans.
  • Algorithms make tedious human analysis unnecessary.
  • Algorithms are better than humans at interpreting complex information.
A
  • Algorithms can discover patterns in data that humans cannot.

Video Lecture 1, 41:12 onwards. Humans deal with very complex information on a daily basis. As such, algorithms will very likely never replace human elements in analysis; they might however prove more creative in discovering
patterns in data that might be of interest, which we simply do not have the time and mental capacity for.

31
Q

Which of these descriptions fit with the JSON file format?

  • JSON uses (nested) key/values to structure entries, and is often used for NoSQL databases.
  • JSON uses (nested) tags to structure entries, and is often used for NoSQL databases.
  • JSON is hierarchically structured and is often used for SQL databases.
  • JSON has a flat structure, and is often used for SQL databases.
A
  • JSON uses (nested) key/values to structure entries, and is often used for NoSQL databases.

Video Lecture 2, 21:26 onwards. JSON is hierarchical (uses nested elements), uses key / value pairs, and is therefore commonly used for NoSQL databases

32
Q

Which of these statements is true regarding a binary representation of text?

  • It is a very compact encoding of data.
  • Longer documents will be further away from short documents.
  • It is the only representation on which we can fit Decision Trees.
  • We can use the Jaccard coefficient on them to measure distance between any two
    vectors.
A
  • It is a very compact encoding of data.

Video Lecture 3, 11:58, 12:59, and 31:45 onwards. Binary vectors are very compact, they do not have a sense of distance. The Jaccard coefficient captures
similarity (as it looks at overlap), rather than distance. Decision Trees can also fit any other representation of data.

33
Q

What could be a motivation for weighting neighbours in k-NN?

  • It avoids ties in a majority voting scheme.
  • It takes into account the properties of the neighbours and therefore gives better performance.
  • Uniform weights do not scale well under a logarithmic function of distance.
  • The Euclidean distance between unweighted neighbours cannot be determined.
A
  • It avoids ties in a majority voting scheme.

Video Lecture 3, 53:51 onwards. Any weighting of the neighbours avoids having ties when an equal amount of labels is present amongst the neighbours. While
this does take into account the properties of the neighbours, this does not guarantee better performance.

34
Q

Which area typically does not benefit from the application of KDD?

  • Investment.
  • Fraud Detection.
  • Manufacturing.
  • All can benefit from KDD.
A
  • All can benefit from KDD.

All are named in From Data Mining to Knowledge Discovery in Databases - Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. Generally, it’s very hard to think of an area that doesn’t benefit from KDD.

35
Q

Which of the below descriptions does not fit Cross-Validation?

  • We can use cross-validation to test the generalizability of an algorithm.
  • We can use cross-validation to avoid having to split off a test set.
  • We can use cross-validation to choose the best amongst different models.
  • We can use cross-validation to choose the best hyperparameter for a particular algorithm.
A
  • We can use cross-validation to avoid having to split off a test set.

See Cross-Validation by Payam Refaeilzadeh, Lei Tang, and Huan Liu. We can use it to test robustness, either of algorithms or models (+ hyperparameters). We always leave a test set aside!

36
Q

We have a deadline and need our optimized model’s predictions fast, what evaluation scheme do we use?

  • K-fold Cross-Validation
  • Leave-One-Out
  • Hold-Out
  • Train / test split
A
  • Hold-Out

See Evaluating Data Mining Models: A Pattern Language by Jerffeson Souza, Stan Matwin, and Nathalie Japkowicz. Hold-out is fastest and will still reduce chances of overfitting.

37
Q

When two features have a non-zero correlation, this means that:

  • These two features are not causally related.
  • It says nothing about causation.
  • It is more likely that there exists a causal relationship between these.
  • It says that there is likely a third feature causing both features.
A
  • It says nothing about causation.

OR

  • It is more likely that there exists a causal relationship between these.

Multiple answers correct: See Using and Interpreting Linear Regression and Correlation
Analyses: Some Cautions and Considerations - Connie Tompkins. Correlation > 0 means
there is some correlation between the features. While causation ≠ correlation, it does give
some information.

38
Q

Q15: Which of the following statements is true regarding regression functions?

  • The intercept can be varied to deal with variation in the dataset.
  • The highest coefficient weight is allocated to the most informative feature.
  • The more interaction features are added, the better the model will perform.
  • The bigger the bias in predictions, the higher the error on new data.
A
  • The bigger the bias in predictions, the higher the error on new data.

The intercept can accommodate somewhat for bias, if it has a clear offset, however it does not account for variance. The coefficient weights are determined by the feature values. Therefore, big coefficients say nothing about the informativeness of a feature. Interaction features can improve performance, but can also cause overfitting. Bias always increases error.

39
Q

You are in charge of classifying possible fails of a nuclear reactor. Which metric do you look at?

  • Accuracy
  • Recall
  • Precision
  • F1 score
A
  • Recall

False Positives are not important; if the reactor goes boom people die. We don’t want to miss anything remotely near a fail.

F1 in this case obscures the fact that we want perfect Recall.

40
Q

Q17: Consider the following case: We have a dataset containing films as instances. The prediction target is the average grade both users and critics will give after the movie has been released. Our goal is to
build a model that can be used before the movie releases to see what would generate a higher score (adding more budget for example). Which of the features below should we not use?

  • Names of actors.
  • Genre of the movie.
  • Amount of reviews.
  • We should use all features.
A
  • Amount of reviews.

We don’t have access to the amount of reviews before the movie releases.

41
Q

Q19: Consider the following case: You are hired as a data scientist at a local theme park (de Efteling for example). Naturally, they are most
interested in maximizing their profits as much as possible. They recognized the potential of applying data mining techniques in their business, and have now asked you to work out a way of predicting
their ticket sales. They are most interested how the weather affects this. Which of the following concepts will determine how you split the data:

  • The feature values.
  • The target values.
  • The weather attributes.
  • The element of time.
A
  • The element of time. (v)

Given that weather is seasonal, and follow a historic trend, as do ticket sales, sampling a realistic (artificial) future is required for the task.

42
Q

Q20: Consider the following case: You have previously been tasked to train a model predicting housing prices for low income areas. However, the broker’s office you are working for has decided to focus only on higher segment houses from now on. Unfortunately,
because they are new to this, they only have little data. Which of the following statements is true?

  • The previous model is worthless, and should be trashed.
  • By correcting for the bias we predict, we can fix the current model for the time being.
  • Some coefficients might still hold, we should save some of them.
  • We add the new data to the old, and correct our model this way.
A
  • The previous model is worthless, and should be trashed. (v)

We expect different distributions of both our features, as well as our target; there is no fix for this other than starting over. Any of the other fixes below will introduce many unnecessary errors, and we have no need to use our old data (we will not be predicting these anymore anyways). Bias can’t be fixed if we still have the same coefficients; they will have a incorrect impact on the final
predictions. Coefficients are fitted given data, and should always be considered in combination with other coefficients. As such, we can’t use any of them. Merging data will still bias the model towards predicting too low prices.