Data mining and supervised learning Flashcards
What is CRISP-DM?
a) A software for data mining
b) A standard methodology for conducting data mining projects
c) A programming language for machine learning
d) A tool for data visualization
b) A standard methodology for conducting data mining projects
What are the six phases of the CRISP-DM process?
a) Data Cleaning, Analysis, Visualization, Modelling, Evaluation, Presentation
b) Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment
c) Problem Identification, Data Gathering, Feature Selection, Modelling, Testing, Reporting
d) Data Collection, Cleaning, Modelling, Testing, Evaluation, Delivery
b) Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment
Which phase of CRISP-DM focuses on understanding project objectives from a business perspective?
a) Data Preparation
b) Business Understanding
c) Modelling
d) Evaluation
b) Business Understanding
What is the main goal of the Deployment phase in CRISP-DM?
a) To build machine learning models
b) To ensure the results of data mining are used in decision-making
c) To clean and preprocess data
d) To explore data relationship
b) To ensure the results of data mining are used in decision-making
What is machine learning?
a) A process of explicitly programming computers to solve problems
b) A field of study where computers learn from data without being explicitly programmed
c) A visualization method for large datasets
d) A data cleaning tool
b) A field of study where computers learn from data without being explicitly programmed
What is an example of machine learning application?
a) Predicting stock market trends
b) Cleaning unstructured data
c) Managing relational databases
d) Archiving data
a) Predicting stock market trends
What is supervised learning?
a) A technique to cluster unlabeled data
b) A learning process using labeled data to predict outcomes for unseen data
c) A statistical method for creating decision trees
d) A method for dimensionality reduction in datasets
b) A learning process using labeled data to predict outcomes for unseen data
What type of data does supervised learning require?
a) Only categorical data
b) Data without labels
c) Data with both features and labels
d) Data with missing values
c) Data with both features and labels
What is unsupervised learning?
a) A method that uses labeled data to make predictions
b) An approach that analyzes and clusters unlabeled data
c) A process for supervised classification tasks
d) A technique to preprocess data for supervised models
b) An approach that analyzes and clusters unlabeled data
Which of these is an example of unsupervised learning?
a) Predicting house prices based on past data
b) Clustering customers based on purchase behavior
c) Sentiment analysis using labeled data
d) Fraud detection with supervised models
b) Clustering customers based on purchase behavior
What are features in machine learning?
a) The rows in a dataset
b) Attributes or input variables describing each observation
c) The process of splitting data
d) The final predictions of a model
b) Attributes or input variables describing each observation
What is a label in machine learning?
a) A categorical or continuous value that an observation is meant to predict
b) An algorithm used for training
c) A type of data preprocessing method
d) The summary statistic of a dataset
a) A categorical or continuous value that an observation is meant to predict
What is the purpose of cross-validation in machine learning?
a) To find the best features in a dataset
b) To validate a model’s performance on unseen data
c) To split data into training and testing sets
d) To preprocess the raw data
b) To validate a model’s performance on unseen data
What is a model in machine learning?
a) A tool for visualizing data
b) A representation of learned patterns used to make predictions
c) A data cleaning technique
d) A preprocessing step for supervised learning
b) A representation of learned patterns used to make predictions
What are the two main types of supervised learning tasks?
a) Clustering and Regression
b) Classification and Regression
c) Clustering and Dimensionality Reduction
d) Regression and Association Rule Learning
b) Classification and Regression
Which supervised learning task predicts continuous values?
a) Clustering
b) Classification
c) Regression
d) Dimensionality Reduction
c) Regression
What is classification in supervised learning?
a) Grouping data without predefined categories
b) Predicting continuous numerical values
c) Predicting discrete class labels for data points
d) Optimizing model performance using gradient descent
c) Predicting discrete class labels for data points
Which of these is an example of binary classification?
a) Predicting house prices
b) Categorizing reviews as positive or negative
c) Grouping customer segments
d) Identifying weather patterns
b) Categorizing reviews as positive or negative
What is regression used for in supervised learning?
a) Grouping data into clusters
b) Predicting categorical labels
c) Predicting continuous numerical values
d) Summarizing datasets with descriptive statistics
c) Predicting continuous numerical values
Which of these tasks would use regression?
a) Identifying spam emails
b) Predicting future sales revenue
c) Grouping customer demographics
d) Categorizing news articles
b) Predicting future sales revenue
What is a decision tree?
a) A flowchart-like structure used for classification and regression
b) A clustering algorithm for grouping unlabeled data
c) A neural network architecture for image processing
d) A data cleaning method for missing values
a) A flowchart-like structure used for classification and regression
What does each leaf node in a decision tree represent?
a) A test on an attribute
b) A class label or continuous value
c) A statistical measure of purity
d) A split point for numerical data
b) A class label or continuous value
How does a decision tree decide where to split data?
a) By random partitioning
b) Using an attribute selection measure like information gain
c) By creating equal-sized partitions
d) Based on user-defined thresholds
b) Using an attribute selection measure like information gain
What is information gain in the context of decision trees?
a) A measure of statistical variance in the data
b) The reduction in entropy (uncertainty) after splitting data on an attribute
c) The difference between training and testing accuracy
d) The probability of a class given an attribute value
b) The reduction in entropy after splitting data on an attribute
information gain quantifies the reduction in entropy (or uncertainty) in the dataset after the data is split based on an attribute. The goal is to use the attribute that provides the highest information gain to make the split at each node in the decision tree.
Entropy is a measure of uncertainty or disorder in the dataset.
- High entropy means the data is highly mixed and uncertain (e.g., a dataset with an equal number of classes).
- Low entropy means the data is more pure or certain (e.g., most data points belong to the same class).
Which attribute is selected for splitting using information gain?
a) The attribute with the highest gain
b) The attribute with the lowest gain
c) The attribute with the most categories
d) The attribute with continuous values
a) The attribute with the highest gain
- Information gain is used to measure the effectiveness of a feature in separating the dataset into pure subsets.
- The attribute with the highest information gain is selected for splitting because it reduces the uncertainty (entropy) the most.
- This ensures the resulting nodes are as “pure” as possible, meaning they contain more data points from a single class.
What is the k-Nearest Neighbors (KNN) algorithm?
a) A clustering technique
b) A lazy learning algorithm for classification and regression
c) A dimensionality reduction method
d) A statistical test for data correlation
b) A lazy learning algorithm for classification and regression
How does KNN classify a new data point?
a) By assigning it to the most frequent class among its k nearest neighbors
b) By computing the average distance to all data points
c) By using entropy to decide its class
d) By building a decision tree around the point
a) By assigning it to the most frequent class among its k nearest neighbors
What is one major challenge in supervised learning?
a) Finding unlabeled datasets
b) Handling large amounts of labeled data
c) Correctly labeling training data
d) Eliminating overfitting in unsupervised models
c) Correctly labeling training data
Why can supervised learning models be resource-intensive?
a) They require clustering data
b) They need extensive labeled data for training
c) They cannot predict new data points
d) They are only suitable for small datasets
b) They need extensive labeled data for training