Exam Flashcards
Feature Transformation is the process of obtaining new features (output) from existing features (input). Which of the following is not an example of that?
(a) shift from Cartesian to polar coordinates.
(b) Scaling the data to a speci!c interval as in normalizing.
(c) Computing the distance between two features with a given distance metric.
(d) Dropping some of the features in the original data set.
D
Which of the following statements about Principal Component Analysis (PCA) is true?
(a) PCA is a high-dimensional clustering method.
(b) PCA is an unsupervised learning method.
(c) PCA reduces dimensionality.
(d) PCA enhances prediction performance.
B
- Which of the following is/are true?
(I) Using too many features might result in learning exceptions that are specific to the training data, which might not generalize to the real world.
(II) Pearson’s coefficient being 0 implies that there is no dependency between the two input variables.
(III) The death rate among intensive care patients is high. However, we can not deduce from that being in intensive care causes the deaths, because correlation does not imply causation.
(a) Only I,
(b) I and II,
(c) I and III,
(d) I, II, and III
C
2 is not right, because if the variables are independent, Pearson’s correlation coefficient is 0, but the other way around is not true because the correlation coefficient detects only linear dependencies between two variables.
https://en.wikipedia.org/wiki/Correlation_and_dependence
See Correlation and independence
OR/AND
It is in the two input variables part. Since input variables are X variables and correlation is about the X and Y variable.
Which of the following are true about k-means clustering?
(I) k refers to the size of each cluster.
(II) For a given data set, the ideal number of the clusters is independent of the problem statement or relevant features.
(III) Imagine a data set has weight (in kg), size (in m3) and value (in Euros) for each parcel for a Cargo carrier. It is likely that, the chosen number of clusters for insurance purposes (based on value only) can be different from the number of clusters based on size (weight and volume).
(IV) If the average silhouette coe”cient in a cluster is close to 1, then the points in the
cluster are tightly grouped together.
(a) II,IV
(b) I,II
(c) III,IV
(d) I,II,III,IV
C
Which of the following is required by k-means clustering ?
(a) a de!ned distance metric
(b) number of clusters
(c) initial guess as to cluster centroids
(d) all of the above.
D
Which of the following is true about feature vectors?
(a) Prediction performance improves with the number of features in the feature vector.
(b) Prediction performance worsens if the number of features included in the feature vector decreases.
(c) Prediction performance depends on the balance between too few and too many features.
(d) Prediction performance bene!ts from the curse of dimensionality.
C
What is the result of a Principal Component Analysis transformation?
(a) A reduced set of features by linearly ranking the most important features.
(b) A new set of correlated features extracted by weighting the feature distances.
(c) A new set of features extracted by linearly recombining the original features.
(d) A reduced set of features extracted by rotating the feature matrix.
C
- What is the Curse of Dimensionality?
(a) The vector distances between your instances decrease with the number of features.
(b) The number of required features grows almost exponentially with the amount of
data.
(c) The amount of required data grows almost exponentially with the number of features.
(d) In high-dimensional feature space, data tends to be normally distributed.
C
When using a wrapper strategy, what is the best motivation for including feature selection
as a hyperparameter of a predictive model?
(a) It helps di#erent models select the same set of optimal features.
(b) A certain set of optimal features can be determined independently of the model.
(c) Reducing dimensions guarantees better generalization.
(d) A certain set of optimal features can be determined dependent on the model.
D
- You are testing various predictive algorithms and found that they are very likely to overfit due to the curse of dimensionality. Which strategy below would most likely solve this problem?
(a) Dimensionality Reduction
(b) Feature selection
(c) Both (a) and (b) may solve this problem.
(d) Neither (a) or (b) can be applied to solve this problem.
C
- What is the di#erence between dimensionality reduction and clustering?
(a) Clustering groups objects together based on their similarity in the feature space, while dimensionality reduction rotates them in the feature space.
(b) Clustering is a type of classi!cation, while dimensionality reduction is a particular way of carrying out regression.
(c) Clustering simpli!es data by grouping data points together based on their similarity in the feature space, while dimensionality reduction simpli!es the data by projecting the data points onto a smaller-dimensional space.
(d) Clustering is supervised, while dimensionality reduction is unsupervised.
C
- Given the data points x1 = (1, 1) and x2 = (8, 8), what is the Euclidean and the cosine distance between these points?
(a) Euclidean = 9.8995, cosine = 0
(b) Euclidean = 3,3416, cosine = 0
(c) Euclidean = 9.8995, cosine = 9
(d) Euclidean = 3,3416, cosine = 9
A
- For a classi!cation task, you apply Principal Component Analysis (PCA) to a data set with 10-dimensional feature vectors. You notice that the performance on your validation set is smaller for 9 than for 10 components. Which one of the following statements applies?
(a) The classi!er trained on 9 components is overfitting.
(b) The 10-th component contains information relevant for prediction.
(c) The 10 components contain considerable amounts of noise.
(d) All of the above.
B
- What is the main di#erence between the decision boundary generated by logistic regression
and the decision boundary generated by a linear Support Vector Machine (SVM)?
(a) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM can be nonlinear.
(b) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM can be linear.
(c) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM is high dimensional.
(d) In contrast to the decision boundary of logistic regression, the decision boundary of the SVM is optimal.
D
Consider a multiclass e-mail classi!cation task that tries to predict calendar categories (i.e. meeting, festival, delivery, etc) based on the content of an e-mail. We train a Naive Bayes classifier for this task. The category festival occurs infrequently compared to the other categories. Which statement is true regarding this category?
(a) If all words in an e-mail would have equal probabilities between classes, the overall
probability for this category would be lower.
(b) Regardless of the probabilities for the words in an e-mail, this category will generally always have low prediction probabilities compared to the others.
(c) For this category to get classi!ed by the model, the e-mail would require high fre-
quency words to occur in an e-mail.
(d) As the probabilities for the words under this class will all be low, this category will generally always have low prediction probabilities compared to the others.
A
Will always be lower due to prior. The occurence is not equally likely.
Consider the features and prediction descriptions below. Which one of these is an example of information leakage?
(a) Predicting tweet sentiment (positive or negative) and using words such as good and bad as features.
(b) Predicting survival rates for a sinking ship and using date of birth of the passenger as one of the features.
(c) Predicting the severity of an incoming hurricane and using the money worth of
damages it caused as one of the features.
(d) Predicting daily ticket sales for a theme park and using yesterday’s amount of visitors and ticket price as one of the features.
C
Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.
Als je een orkaan wil voorspellen en je gebruikt de schade die de orkaan heeft aangericht nadat de orkaan geweest is…..Dt kan je nog niet weten.
When doing regression, which step can be deemed as incorrect practice?
(a) Scaling features using standardization or normalization.
(b) Using the average target value as a baseline for model performance.
(c) Comparing different regression algorithms on the test set.
(d) Evaluating using several error metrics.
C
Which description matches Random Decision Forests (RDFs)?
(a) RDFs consist of multiple identically trained decision trees.
(b) RDFs consist of multiple decision trees. After training the best tree is selected.
(c) RDFs consist of multiple decision trees each trained on subsets of features.
(d) RDFs consist of random decision trees the predictions of which are pooled.
C
Which statement is true regarding Support Vector Machines (SVMs)?
(a) SVMs are less likely to tightly fit decision boundaries around class edges by using the kernel trick.
(b) SVMs can fit multiple decision boundaries by using the maximum margin between different classes.
(c) SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.
(d) SVMs can fit non-linear decision boundaries by using the maximum margin between different classes.
C
A k Nearest Neighbour classi!er is trained on a student performance prediction task. Each
student is represented by a feature vector consisting of their grades for 8 courses (ranging between 1.0 and 10.0), their age (years), their weight (kg) and height (cm). What kind of preprocessing is least likely to help?
(a) Converting features to binary variables.
(b) Normalization.
(c) Feature selection.
(d) Removing outliers.
A
A company wants to apply data mining to its client database but requires that the induced model’s most important features are as transparent as possible. Which algorithm should be applied?
(a) k Nearest Neighbours.
(b) Decision tree.
(c) Clustering.
(d) Principal Component Analysis.
B
Consider a binary e-mail classi!cation task for spam (i.e. where the goal is to detect spam e-mails). We train Naive Bayes for this task. When we take the top 200 words with the highest probability allocated by Naive Bayes under the ‘spam’ label, this means that:
(a) These words occur with high probability under the positive label, and with low
probability under the negative label.
(b) These words will be uniquely associated with the positive label.
(c) Any word amongst these words can be used as a keyword to correctly predict a positive label.
(d) These words occur with high probability under the positive label.
D
Which pair of rates is used to calculate (and visualize) the ROC curve?
(a) True Positive and False Negative.
(b) True Negative and False Negative.
(c) True Negative and False Positive.
(d) True Positive and False Positive.
D
Why does PCA not work well on the Swiss Roll dataset?
(a) Because PCA cannot deal with nonlinear manifolds.
(b) Because PCA is su#ering from the curse of
dimensionality.
(c) Because PCA is an unsupervised learning method.
(d) Because PCA retains a limited number of components.
A