Unsupervised learning and model evaluation Flashcards
What is the main goal of unsupervised learning?
a) Predicting future outcomes
b) Grouping or finding patterns in data without labels
c) Testing hypotheses with known outputs
d) Optimizing supervised algorithms
b) Grouping or finding patterns in data without labels
What is cluster analysis?
a) A statistical method for finding correlations between variables
b) The process of partitioning data into subsets based on similarity
c) A supervised learning technique for predicting outcomes
d) A method for data cleaning
b) The process of partitioning data into subsets based on similarity
In what applications is clustering commonly used?
a) Fraud detection, image recognition, and customer segmentation
b) Regression tasks and time-series analysis
c) Hyperparameter tuning for machine learning models
d) Feature engineering for supervised tasks
a) Fraud detection, image recognition, and customer segmentation
Clustering is an unsupervised learning technique that groups data points into clusters based on their similarity. It is commonly used in tasks where labels are not provided, and the goal is to identify inherent patterns or groupings in the data.
What is a partitioning clustering method?
a) A hierarchical decomposition of data into clusters
b) Dividing data into a predefined number of non-overlapping clusters
c) Using density measures to find clusters of arbitrary shapes
d) Grouping based on sequential patterns in time-series data
b) Dividing data into a predefined number of non-overlapping clusters
What type of clustering method is k-means?
a) Density-based clustering
b) Grid-based clustering
c) Centroid-based partitioning
d) Hierarchical clustering
c) Centroid-based partitioning
What is the first step in the k-means clustering algorithm?
a) Calculate the distances between all data points
b) Assign data points randomly to clusters
c) Select k initial centroids from the dataset
d) Measure the density of each cluster
c) Select k initial centroids from the dataset
How does k-means determine cluster membership for a data point?
a) By assigning it to the closest centroid
b) By checking its density within a neighborhood
c) Based on predefined labels
d) Using hierarchical splitting of the dataset
a) By assigning it to the closest centroid
What is a major limitation of k-means clustering?
a) It requires labeled data for training
b) It struggles with high-dimensional data and outliers
c) It only works for binary classification problems
d) It is computationally too slow for small datasets
b) It struggles with high-dimensional data and outliers
What is frequent pattern mining?
a) Discovering associations and correlations in a dataset
b) A method to predict the next event in a sequence
c) Grouping data points into clusters
d) A supervised learning approach for regression
a) Discovering associations and correlations in a dataset
What is an example of a frequent pattern?
a) Clustering customers by demographics
b) Milk and bread frequently bought together in transactions
c) Predicting house prices based on features
d) Finding the best split in a decision tree
b) Milk and bread frequently bought together in transactions
What does “support” indicate in market basket analysis?
a) The number of items in a cluster
b) The fraction of transactions containing a specific itemset
c) The probability of an itemset occurring given another itemset
d) The number of times an item appears in the dataset
b) The fraction of transactions containing a specific itemset
What does “lift” measure in association rules?
a) The total number of transactions in the dataset
b) The strength of an association relative to its random occurrence
c) The distance between clusters
d) The time complexity of the rule-mining algorithm
b) The strength of an association relative to its random occurrence
What is association rule mining?
a) Grouping items in a dataset into clusters
b) Predicting future sales trends
c) Finding relationships between items in transactional data
d) Labeling data points for supervised learning
c) Finding relationships between items in transactional data
Which metrics are commonly used to evaluate association rules?
a) Accuracy and precision
b) Support, confidence, and lift
c) Variance and standard deviation
d) Recall and specificity
b) Support, confidence, and lift
What is the Apriori algorithm designed for?
a) Clustering high-dimensional datasets
b) Mining frequent itemsets in a dataset
c) Optimizing the parameters of a regression model
d) Predicting sequential patterns in time-series data
b) Mining frequent itemsets in a dataset
What is a key concept of the Apriori algorithm?
a) All subsets of a frequent itemset must also be frequent
b) The centroids of clusters are updated iteratively
c) Patterns are evaluated using gradient descent
d) Clusters are assigned based on density
a) All subsets of a frequent itemset must also be frequent
In which type of machine learning are labels NOT provided?
a) Supervised learning
b) Unsupervised learning
c) Reinforcement learning
d) None of the above
b) Unsupervised learning
What is reinforcement learning?
a) Learning patterns without feedback
b) Learning by interacting with an environment and receiving rewards or penalties
c) Optimizing a decision tree
d) Training neural networks with labeled data
b) Learning by interacting with an environment and receiving rewards or penalties
What is overfitting in machine learning?
a) When a model performs well on test data but poorly on training data
b) When a model learns training data too well, including noise, and performs poorly on new data
c) When a model cannot learn patterns from the training data
d) A method to optimize models for higher accuracy
b) When a model learns training data too well, including noise, and performs poorly on new
What is underfitting?
a) A model trained too long on training data
b) A model failing to capture patterns in training data, leading to poor predictions
c) A high-variance issue in machine learning models
d) A model performing well on all datasets
b) A model failing to capture patterns in training data, leading to poor predictions
What does “goodness of fit” measure?
a) How closely the model matches the ground truth or true values
b) The amount of noise in the data
c) The time taken to train a model
d) The complexity of the model architecture
a) How closely the model matches the ground truth or true values
What is training data?
a) Data used to evaluate model performance
b) Data used to build and train a machine learning model
c) A subset of test data
d) Data used only for cross-validation
b) Data used to build and train a machine learning model
What is testing data?
a) Data used to fine-tune the hyperparameters
b) A subset of training data
c) Data used to assess a trained model’s performance on unseen data
d) Data used for exploratory data analysis
c) Data used to assess a trained model’s performance on unseen data
What is a balanced dataset?
a) A dataset with missing values evenly distributed
b) A dataset where the number of samples in each class is roughly equal
c) A dataset containing only numeric features
d) A dataset with equal amounts of training and testing data
b) A dataset where the number of samples in each class is roughly equal
Why is an imbalanced dataset a challenge for machine learning?
a) It reduces the dataset size
b) It may bias the model toward the majority class, reducing predictive performance for the minority class
c) It makes preprocessing impossible
d) It invalidates the evaluation metrics
b) It may bias the model toward the majority class, reducing predictive performance for the minority class
What is model validation?
a) A process to assess how well a model performs on unseen data
b) A method to optimize dataset size
c) Training the model with random weights
d) Adjusting hyperparameters to improve accuracy
a) A process to assess how well a model performs on unseen data
What is cross-validation?
a) Dividing the data into multiple subsets to train and test the model iteratively
b) A technique to detect overfitting
c) A preprocessing step for balancing data
d) A method to identify noise in datasets
a) Dividing the data into multiple subsets to train and test the model iteratively
What is accuracy in machine learning?
a) The fraction of correct predictions out of all predictions made
b) The ability to identify only positive samples
c) The measure of a model’s robustness to noise
d) The trade-off between recall and precision
a) The fraction of correct predictions out of all predictions made
What is precision?
a) The fraction of correctly identified positive instances among all predicted positive instances
b) The ability to correctly identify negative instances
c) The fraction of true negatives in the data
d) The ratio of false negatives to true negatives
a) The fraction of correctly identified positive instances among all predicted positive
What does the F1 score combine?
a) Accuracy and recall
b) Recall and precision
c) Precision and training error
d) Accuracy and noise reduction
b) Recall and precision
What is bias in machine learning?
a) A model’s ability to perform consistently across datasets
b) A systematic error introduced by oversimplified assumptions in a model
c) A variance in data points
d) The effect of noise in data
b) A systematic error introduced by oversimplified assumptions in a model
What is concept drift?
a) Changes in the model’s performance over different datasets
b) A shift in the relationship between features and labels over time
c) Variance in data labeling methods
d) Noise reduction during training
b) A shift in the relationship between features and labels over time