Lecture 11 Flashcards
What are the three main datasets used in supervised machine learning?
Training set, validation set, and test set.
What is the purpose of a validation set?
To tune hyperparameters and improve model performance without using the test set.
Why is it important not to ‘peek’ at the test set?
To ensure the model is evaluated on completely unseen data, preventing overfitting.
What is a feature in machine learning?
An attribute-value pair representing characteristics of each data point.
What is the goal of feature selection?
To choose the most relevant attributes that contribute to a model’s predictive power.
What is a feature vector?
A numerical representation of a data point using selected features.
What is binary classification?
A classification problem where there are only two possible classes.
What is a probabilistic classifier?
A classifier that assigns probabilities to each class and selects the one with the highest probability.
What is k-Nearest Neighbour (k-NN)?
A classification algorithm that assigns a class based on the majority class of the k closest data points.
How does k-NN measure similarity?
Using distance metrics such as Euclidean distance between feature vectors.
What is the role of text features in machine learning?
They convert text into numerical representations for processing by algorithms.
What is one way to represent text features in machine learning?
Using a binary vector where 1 indicates a word’s presence and 0 indicates absence.
What is term frequency-inverse document frequency (tf-idf)?
A weighting method that measures how important a word is in a document relative to a corpus.
What is Hamming distance?
A similarity measure that counts the number of feature differences between two vectors.
What is Euclidean distance?
A similarity measure based on the straight-line distance between two points in a multi-dimensional space.
What is cosine similarity?
A metric that measures the angle between two vectors to determine their similarity.
What is the purpose of stemming in text processing?
To reduce words to their root form, improving consistency in feature extraction.
Why is stop word removal useful in text classification?
It eliminates common words (e.g., ‘the’, ‘and’) that do not contribute meaningful information.
What is cross-validation?
A technique where data is repeatedly split into training and test sets to assess model performance.
What is a confusion matrix?
A table that summarizes a classification model’s performance by showing true and false positives and negatives.
What is the formula for accuracy in classification?
(TP + TN) / Total Samples.
What is precision in classification?
The fraction of correctly predicted positive instances out of all predicted positive instances.
What is recall in classification?
The fraction of correctly predicted positive instances out of all actual positive instances.
What is mean squared error (MSE) used for?
Evaluating the performance of regression models by measuring the average squared differences between predicted and actual values.
What is precision@k used for?
An evaluation metric in ranking tasks that measures the fraction of relevant items in the top k recommendations.