Data mining and supervised learning Flashcards by Marcus Hellberg

What is CRISP-DM?

a) A software for data mining

b) A standard methodology for conducting data mining projects

c) A programming language for machine learning

d) A tool for data visualization

b) A standard methodology for conducting data mining projects

How well did you know this?

Not at all

Perfectly

What are the six phases of the CRISP-DM process?

a) Data Cleaning, Analysis, Visualization, Modelling, Evaluation, Presentation

b) Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment

c) Problem Identification, Data Gathering, Feature Selection, Modelling, Testing, Reporting

d) Data Collection, Cleaning, Modelling, Testing, Evaluation, Delivery

b) Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment

How well did you know this?

Not at all

Perfectly

Which phase of CRISP-DM focuses on understanding project objectives from a business perspective?

a) Data Preparation

b) Business Understanding

c) Modelling

d) Evaluation

b) Business Understanding

How well did you know this?

Not at all

Perfectly

What is the main goal of the Deployment phase in CRISP-DM?

a) To build machine learning models

b) To ensure the results of data mining are used in decision-making

c) To clean and preprocess data

d) To explore data relationship

b) To ensure the results of data mining are used in decision-making

How well did you know this?

Not at all

Perfectly

What is machine learning?

a) A process of explicitly programming computers to solve problems

b) A field of study where computers learn from data without being explicitly programmed

c) A visualization method for large datasets

d) A data cleaning tool

b) A field of study where computers learn from data without being explicitly programmed

How well did you know this?

Not at all

Perfectly

What is an example of machine learning application?

a) Predicting stock market trends

b) Cleaning unstructured data

c) Managing relational databases

d) Archiving data

a) Predicting stock market trends

How well did you know this?

Not at all

Perfectly

What is supervised learning?

a) A technique to cluster unlabeled data

b) A learning process using labeled data to predict outcomes for unseen data

c) A statistical method for creating decision trees

d) A method for dimensionality reduction in datasets

b) A learning process using labeled data to predict outcomes for unseen data

How well did you know this?

Not at all

Perfectly

What type of data does supervised learning require?

a) Only categorical data

b) Data without labels

c) Data with both features and labels

d) Data with missing values

c) Data with both features and labels

How well did you know this?

Not at all

Perfectly

What is unsupervised learning?

a) A method that uses labeled data to make predictions

b) An approach that analyzes and clusters unlabeled data

c) A process for supervised classification tasks

d) A technique to preprocess data for supervised models

b) An approach that analyzes and clusters unlabeled data

How well did you know this?

Not at all

Perfectly

Which of these is an example of unsupervised learning?

a) Predicting house prices based on past data

b) Clustering customers based on purchase behavior

c) Sentiment analysis using labeled data

d) Fraud detection with supervised models

b) Clustering customers based on purchase behavior

How well did you know this?

Not at all

Perfectly

What are features in machine learning?

a) The rows in a dataset

b) Attributes or input variables describing each observation

c) The process of splitting data

d) The final predictions of a model

b) Attributes or input variables describing each observation

How well did you know this?

Not at all

Perfectly

What is a label in machine learning?

a) A categorical or continuous value that an observation is meant to predict

b) An algorithm used for training

c) A type of data preprocessing method

d) The summary statistic of a dataset

a) A categorical or continuous value that an observation is meant to predict

How well did you know this?

Not at all

Perfectly

What is the purpose of cross-validation in machine learning?

a) To find the best features in a dataset

b) To validate a model’s performance on unseen data

c) To split data into training and testing sets

d) To preprocess the raw data

b) To validate a model’s performance on unseen data

How well did you know this?

Not at all

Perfectly

What is a model in machine learning?

a) A tool for visualizing data

b) A representation of learned patterns used to make predictions

c) A data cleaning technique

d) A preprocessing step for supervised learning

b) A representation of learned patterns used to make predictions

How well did you know this?

Not at all

Perfectly

What are the two main types of supervised learning tasks?

a) Clustering and Regression

b) Classification and Regression

c) Clustering and Dimensionality Reduction

d) Regression and Association Rule Learning

b) Classification and Regression

How well did you know this?

Not at all

Perfectly

Which supervised learning task predicts continuous values?

a) Clustering

b) Classification

c) Regression

d) Dimensionality Reduction

Study These Flashcards

c) Regression

What is classification in supervised learning?

a) Grouping data without predefined categories

b) Predicting continuous numerical values

c) Predicting discrete class labels for data points

d) Optimizing model performance using gradient descent

Study These Flashcards

c) Predicting discrete class labels for data points

Which of these is an example of binary classification?

a) Predicting house prices

b) Categorizing reviews as positive or negative

c) Grouping customer segments

d) Identifying weather patterns

Study These Flashcards

b) Categorizing reviews as positive or negative

What is regression used for in supervised learning?

a) Grouping data into clusters

b) Predicting categorical labels

c) Predicting continuous numerical values

d) Summarizing datasets with descriptive statistics

Study These Flashcards

c) Predicting continuous numerical values

Which of these tasks would use regression?

a) Identifying spam emails

b) Predicting future sales revenue

c) Grouping customer demographics

d) Categorizing news articles

Study These Flashcards

b) Predicting future sales revenue

What is a decision tree?

a) A flowchart-like structure used for classification and regression

b) A clustering algorithm for grouping unlabeled data

c) A neural network architecture for image processing

d) A data cleaning method for missing values

Study These Flashcards

a) A flowchart-like structure used for classification and regression

What does each leaf node in a decision tree represent?

a) A test on an attribute

b) A class label or continuous value

c) A statistical measure of purity

d) A split point for numerical data

Study These Flashcards

b) A class label or continuous value

How does a decision tree decide where to split data?

a) By random partitioning

b) Using an attribute selection measure like information gain

c) By creating equal-sized partitions

d) Based on user-defined thresholds

Study These Flashcards

b) Using an attribute selection measure like information gain

What is information gain in the context of decision trees?

a) A measure of statistical variance in the data

b) The reduction in entropy (uncertainty) after splitting data on an attribute

c) The difference between training and testing accuracy

d) The probability of a class given an attribute value

Study These Flashcards

b) The reduction in entropy after splitting data on an attribute

information gain quantifies the reduction in entropy (or uncertainty) in the dataset after the data is split based on an attribute. The goal is to use the attribute that provides the highest information gain to make the split at each node in the decision tree.

Entropy is a measure of uncertainty or disorder in the dataset.

High entropy means the data is highly mixed and uncertain (e.g., a dataset with an equal number of classes).
Low entropy means the data is more pure or certain (e.g., most data points belong to the same class).

Which attribute is selected for splitting using information gain? a) The attribute with the highest gain b) The attribute with the lowest gain c) The attribute with the most categories d) The attribute with continuous values

a) The attribute with the highest gain * Information gain is used to measure the effectiveness of a feature in separating the dataset into pure subsets. * The attribute with the highest information gain is selected for splitting because it reduces the uncertainty (entropy) the most. * This ensures the resulting nodes are as "pure" as possible, meaning they contain more data points from a single class.

What is the k-Nearest Neighbors (KNN) algorithm? a) A clustering technique b) A lazy learning algorithm for classification and regression c) A dimensionality reduction method d) A statistical test for data correlation

b) A lazy learning algorithm for classification and regression

How does KNN classify a new data point? a) By assigning it to the most frequent class among its k nearest neighbors b) By computing the average distance to all data points c) By using entropy to decide its class d) By building a decision tree around the point

a) By assigning it to the most frequent class among its k nearest neighbors

What is one major challenge in supervised learning? a) Finding unlabeled datasets b) Handling large amounts of labeled data c) Correctly labeling training data d) Eliminating overfitting in unsupervised models

c) Correctly labeling training data

Why can supervised learning models be resource-intensive? a) They require clustering data b) They need extensive labeled data for training c) They cannot predict new data points d) They are only suitable for small datasets

b) They need extensive labeled data for training

Data mining and supervised learning Flashcards

(29 cards)