Lecture 11 Flashcards

Question 1

Q

What are the three main datasets used in supervised machine learning?

Answer

A

Training set, validation set, and test set.

Question 2

Q

What is the purpose of a validation set?

Answer

A

To tune hyperparameters and improve model performance without using the test set.

Question 3

Q

Why is it important not to ‘peek’ at the test set?

Answer

A

To ensure the model is evaluated on completely unseen data, preventing overfitting.

Question 4

Q

What is a feature in machine learning?

Answer

A

An attribute-value pair representing characteristics of each data point.

Question 5

Q

What is the goal of feature selection?

Answer

A

To choose the most relevant attributes that contribute to a model’s predictive power.

Question 6

Q

What is a feature vector?

Answer

A

A numerical representation of a data point using selected features.

Question 7

Q

What is binary classification?

Answer

A

A classification problem where there are only two possible classes.

Question 8

Q

What is a probabilistic classifier?

Answer

A

A classifier that assigns probabilities to each class and selects the one with the highest probability.

Question 9

Q

What is k-Nearest Neighbour (k-NN)?

Answer

A

A classification algorithm that assigns a class based on the majority class of the k closest data points.

Question 10

Q

How does k-NN measure similarity?

Answer

A

Using distance metrics such as Euclidean distance between feature vectors.

Question 11

Q

What is the role of text features in machine learning?

Answer

A

They convert text into numerical representations for processing by algorithms.

Question 12

Q

What is one way to represent text features in machine learning?

Answer

A

Using a binary vector where 1 indicates a word’s presence and 0 indicates absence.

Question 13

Q

What is term frequency-inverse document frequency (tf-idf)?

Answer

A

A weighting method that measures how important a word is in a document relative to a corpus.

Question 14

Q

What is Hamming distance?

Answer

A

A similarity measure that counts the number of feature differences between two vectors.

Question 15

Q

What is Euclidean distance?

Answer

A

A similarity measure based on the straight-line distance between two points in a multi-dimensional space.

Question 16

Q

What is cosine similarity?

Answer

A

A metric that measures the angle between two vectors to determine their similarity.

Question 17

Q

What is the purpose of stemming in text processing?

Answer

A

To reduce words to their root form, improving consistency in feature extraction.

Question 18

Q

Why is stop word removal useful in text classification?

Answer

A

It eliminates common words (e.g., ‘the’, ‘and’) that do not contribute meaningful information.

Question 19

Q

What is cross-validation?

Answer

A

A technique where data is repeatedly split into training and test sets to assess model performance.

Question 20

Q

What is a confusion matrix?

Answer

A

A table that summarizes a classification model’s performance by showing true and false positives and negatives.

Question 21

Q

What is the formula for accuracy in classification?

Answer

A

(TP + TN) / Total Samples.

Question 22

Q

What is precision in classification?

Answer

A

The fraction of correctly predicted positive instances out of all predicted positive instances.

Question 23

Q

What is recall in classification?

Answer

A

The fraction of correctly predicted positive instances out of all actual positive instances.

Question 24

Q

What is mean squared error (MSE) used for?

Answer

A

Evaluating the performance of regression models by measuring the average squared differences between predicted and actual values.

Question 25

Q

What is precision@k used for?

Answer

A

An evaluation metric in ranking tasks that measures the fraction of relevant items in the top k recommendations.