Unsupervised learning and model evaluation Flashcards by Marcus Hellberg

What is the main goal of unsupervised learning?

a) Predicting future outcomes

b) Grouping or finding patterns in data without labels

c) Testing hypotheses with known outputs

d) Optimizing supervised algorithms

b) Grouping or finding patterns in data without labels

How well did you know this?

Not at all

Perfectly

What is cluster analysis?

a) A statistical method for finding correlations between variables

b) The process of partitioning data into subsets based on similarity

c) A supervised learning technique for predicting outcomes

d) A method for data cleaning

b) The process of partitioning data into subsets based on similarity

How well did you know this?

Not at all

Perfectly

In what applications is clustering commonly used?

a) Fraud detection, image recognition, and customer segmentation

b) Regression tasks and time-series analysis

c) Hyperparameter tuning for machine learning models

d) Feature engineering for supervised tasks

a) Fraud detection, image recognition, and customer segmentation

Clustering is an unsupervised learning technique that groups data points into clusters based on their similarity. It is commonly used in tasks where labels are not provided, and the goal is to identify inherent patterns or groupings in the data.

How well did you know this?

Not at all

Perfectly

What is a partitioning clustering method?

a) A hierarchical decomposition of data into clusters

b) Dividing data into a predefined number of non-overlapping clusters

c) Using density measures to find clusters of arbitrary shapes

d) Grouping based on sequential patterns in time-series data

b) Dividing data into a predefined number of non-overlapping clusters

How well did you know this?

Not at all

Perfectly

What type of clustering method is k-means?

a) Density-based clustering

b) Grid-based clustering

c) Centroid-based partitioning

d) Hierarchical clustering

c) Centroid-based partitioning

How well did you know this?

Not at all

Perfectly

What is the first step in the k-means clustering algorithm?

a) Calculate the distances between all data points

b) Assign data points randomly to clusters

c) Select k initial centroids from the dataset

d) Measure the density of each cluster

c) Select k initial centroids from the dataset

How well did you know this?

Not at all

Perfectly

How does k-means determine cluster membership for a data point?

a) By assigning it to the closest centroid

b) By checking its density within a neighborhood

c) Based on predefined labels

d) Using hierarchical splitting of the dataset

a) By assigning it to the closest centroid

How well did you know this?

Not at all

Perfectly

What is a major limitation of k-means clustering?

a) It requires labeled data for training

b) It struggles with high-dimensional data and outliers

c) It only works for binary classification problems

d) It is computationally too slow for small datasets

b) It struggles with high-dimensional data and outliers

How well did you know this?

Not at all

Perfectly

What is frequent pattern mining?

a) Discovering associations and correlations in a dataset

b) A method to predict the next event in a sequence

c) Grouping data points into clusters

d) A supervised learning approach for regression

a) Discovering associations and correlations in a dataset

How well did you know this?

Not at all

Perfectly

What is an example of a frequent pattern?

a) Clustering customers by demographics

b) Milk and bread frequently bought together in transactions

c) Predicting house prices based on features

d) Finding the best split in a decision tree

b) Milk and bread frequently bought together in transactions

How well did you know this?

Not at all

Perfectly

What does “support” indicate in market basket analysis?

a) The number of items in a cluster

b) The fraction of transactions containing a specific itemset

c) The probability of an itemset occurring given another itemset

d) The number of times an item appears in the dataset

b) The fraction of transactions containing a specific itemset

How well did you know this?

Not at all

Perfectly

What does “lift” measure in association rules?

a) The total number of transactions in the dataset

b) The strength of an association relative to its random occurrence

c) The distance between clusters

d) The time complexity of the rule-mining algorithm

b) The strength of an association relative to its random occurrence

How well did you know this?

Not at all

Perfectly

What is association rule mining?

a) Grouping items in a dataset into clusters

b) Predicting future sales trends

c) Finding relationships between items in transactional data

d) Labeling data points for supervised learning

c) Finding relationships between items in transactional data

How well did you know this?

Not at all

Perfectly

Which metrics are commonly used to evaluate association rules?

a) Accuracy and precision

b) Support, confidence, and lift

c) Variance and standard deviation

d) Recall and specificity

b) Support, confidence, and lift

How well did you know this?

Not at all

Perfectly

What is the Apriori algorithm designed for?

a) Clustering high-dimensional datasets

b) Mining frequent itemsets in a dataset

c) Optimizing the parameters of a regression model

d) Predicting sequential patterns in time-series data

b) Mining frequent itemsets in a dataset

How well did you know this?

Not at all

Perfectly

What is a key concept of the Apriori algorithm?

a) All subsets of a frequent itemset must also be frequent

b) The centroids of clusters are updated iteratively

c) Patterns are evaluated using gradient descent

d) Clusters are assigned based on density

a) All subsets of a frequent itemset must also be frequent

In which type of machine learning are labels NOT provided?

a) Supervised learning

b) Unsupervised learning

c) Reinforcement learning

d) None of the above

b) Unsupervised learning

What is reinforcement learning?

a) Learning patterns without feedback

b) Learning by interacting with an environment and receiving rewards or penalties

c) Optimizing a decision tree

d) Training neural networks with labeled data

b) Learning by interacting with an environment and receiving rewards or penalties

What is overfitting in machine learning?

a) When a model performs well on test data but poorly on training data

b) When a model learns training data too well, including noise, and performs poorly on new data

c) When a model cannot learn patterns from the training data

d) A method to optimize models for higher accuracy

b) When a model learns training data too well, including noise, and performs poorly on new

What is underfitting?

a) A model trained too long on training data

b) A model failing to capture patterns in training data, leading to poor predictions

c) A high-variance issue in machine learning models

d) A model performing well on all datasets

b) A model failing to capture patterns in training data, leading to poor predictions

What does “goodness of fit” measure?

a) How closely the model matches the ground truth or true values

b) The amount of noise in the data

c) The time taken to train a model

d) The complexity of the model architecture

a) How closely the model matches the ground truth or true values

What is training data?

a) Data used to evaluate model performance

b) Data used to build and train a machine learning model

c) A subset of test data

d) Data used only for cross-validation

b) Data used to build and train a machine learning model

What is testing data?

a) Data used to fine-tune the hyperparameters

b) A subset of training data

c) Data used to assess a trained model’s performance on unseen data

d) Data used for exploratory data analysis

c) Data used to assess a trained model’s performance on unseen data

What is a balanced dataset?

a) A dataset with missing values evenly distributed

b) A dataset where the number of samples in each class is roughly equal

c) A dataset containing only numeric features

d) A dataset with equal amounts of training and testing data

b) A dataset where the number of samples in each class is roughly equal

Why is an imbalanced dataset a challenge for machine learning? a) It reduces the dataset size b) It may bias the model toward the majority class, reducing predictive performance for the minority class c) It makes preprocessing impossible d) It invalidates the evaluation metrics

b) It may bias the model toward the majority class, reducing predictive performance for the minority class

What is model validation? a) A process to assess how well a model performs on unseen data b) A method to optimize dataset size c) Training the model with random weights d) Adjusting hyperparameters to improve accuracy

a) A process to assess how well a model performs on unseen data

What is cross-validation? a) Dividing the data into multiple subsets to train and test the model iteratively b) A technique to detect overfitting c) A preprocessing step for balancing data d) A method to identify noise in datasets

a) Dividing the data into multiple subsets to train and test the model iteratively

What is accuracy in machine learning? a) The fraction of correct predictions out of all predictions made b) The ability to identify only positive samples c) The measure of a model's robustness to noise d) The trade-off between recall and precision

a) The fraction of correct predictions out of all predictions made

What is precision? a) The fraction of correctly identified positive instances among all predicted positive instances b) The ability to correctly identify negative instances c) The fraction of true negatives in the data d) The ratio of false negatives to true negatives

a) The fraction of correctly identified positive instances among all predicted positive

What does the F1 score combine? a) Accuracy and recall b) Recall and precision c) Precision and training error d) Accuracy and noise reduction

b) Recall and precision

What is bias in machine learning? a) A model’s ability to perform consistently across datasets b) A systematic error introduced by oversimplified assumptions in a model c) A variance in data points d) The effect of noise in data

b) A systematic error introduced by oversimplified assumptions in a model

What is concept drift? a) Changes in the model’s performance over different datasets b) A shift in the relationship between features and labels over time c) Variance in data labeling methods d) Noise reduction during training

b) A shift in the relationship between features and labels over time