ML assignment 1: Theory part Flashcards
In the context of machine learning, what is classification?
The process of grouping data into different subsets
The process of predicting a continuous output
The process of assigning labels to data points
The process or reducing the dimensonality of data
The process of assigning labels to data points
What do we mean by the term feature?
A characteristic or a property of a machine learning model
An individual measurable property or characteristic of a phenomenon being observed
The set of predictions made by a machine learning model
The type of machine learning algorithm use for a project
An individual measurable property or characteristic of a phenomenon being observed
Match the task with the type of learning involved:
Self-driving car
Image classification
customer or market segmentation
task:
Supervised learning
Unsupervised learning
Reinforcement learning
Self-driving car - Reinforcement learning
Image classification - Supervised learning
customer or market segmentation - Unsupervised learning
Training a model is the process of
Finding relevant data points
Finding optimal model parameters
Estimating model performance
Deploying the model to users
Finding optimal model parameters
Select the properties of a dataset that can pose problems for a machine learning project:
The dataset contains very many features
The dataset has very few data points
The dataset contains only numerical values
The dataset contains private, personal information
The dataset was downloaded from the Internet
The dataset has very few data points
The dataset contains private, personal information
What are reasonable methods of handling missing or corrupted data in a dataset?
Remove data points where values are missing
Remove entire features where data are missing
Replace missing values with values from a neighboring feature
Replace missing values with the mean or the median of the feature (computed from the training set)
Replace missing values with the mean or the median of the feature (computed from the test set)
Remove data points where values are missing
Remove entire features where data are missing
Replace missing values with the mean or the median of the feature (computed from the training set)
A dataset contains a feature named “has_computer”, where the values can be “yes”, “no”, and “unknown”. What is the best strategy for processing this feature?
No processing needed, the ML algorithm will figure things out.
Text values can’t be input to an ML algorithm, so the feature must be removed.
The text strings “yes” and “no” should be converted to the binary values True and False, and data points with “unknown” should be removed.
The text should be converted to the categorical values 1, 2, and 3.
The text should be converted to the categorical values 1, 2, and 3.
We want to analyse the iris datasetLinks to an external site., and have done the following to get the data as a numpy array named X:
> > > from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset[‘data’]
type(X)
<class ‘numpy.ndarray’>
How do we print out all the values of the second column (second feature) of this array?
We want to analyse the iris datasetLinks to an external site., and have done the following to get the data as a numpy array named X:
> > > from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset[‘data’]
type(X)
<class ‘numpy.ndarray’>
How do we print out all the values of the second column (second feature) of this array?
print(X[1, :])
print(X[2, :])
print(X[:, 1])
print(X[:, 2])
print(X[:, 1])
In a binary classification problem, what does the confusion matrix show?
The number of data points in the training and the test set
The correlation between the different features
The mean-squared error between predictions and true values of the classes
The number of true positives, false positives, true negatives, and false negatives
The number of true positives, false positives, true negatives, and false negatives
We train a model to classify pictures of road vehicles into the following types: Trucks, personal cars, and taxis. The training dataset contains 250 pictures of trucks, 1750 pictures of personal cars, and 7 pictures of taxis. Why may accuracy not be the best metric for evaluating the model’s performance?
Accuracy can be misleading when it comes to performance on the minority classes
Accuracy can only be computed for binary classifiers
The interpretation of accuracy in multiclass classification is unclear
Accuracy applies only to regression tasks
Accuracy can be misleading when it comes to performance on the minority classes
Increasing the threshold of a binary classifier is likely to produce which of the following effects?
Increasing the threshold of a binary classifier is likely to produce which of the following effects?
False positives increase
False positives decrease
False positives and false negatives both increase
False positives and false negatives both decrease
False positives decrease
We want to develop a classifier that detects if students are cheating on an exam. Since we don’t want to wrongfully accuse a student of cheating, the classifier should keep false positives to a minimum. Which metric is most important to pay attention to in this case?
Accuracy
Precision
Recall
Mean squared error
Precision
What do we call a dataset where the majority of data points belong to one class, making classification difficult?
Noisy dataset
Asymmetric dataset
Imbalanced dataset
Sparse dataset
Imbalanced dataset
What does the term overfitting refer to?
When a dataset has high dimensionality
When a model shows equal performance on the training and the testing datasets
When a model has higher recall than precision
When a model performs well on training data but fails to generalise to new data
When a model performs well on training data but fails to generalise to new data
What is the purpose of using cross-validation in model evaluation?
To reduce the training time of the model
To ensure that data processing is applied uniformly to all data points
To remove the need for different training and test sets
To evaluate the model’s performance more reliably using multiple dataset splits
To evaluate the model’s performance more reliably using multiple dataset splits