8.3 Predictive anaylitics - Classification Flashcards

1
Q

What is classification in data science?

A

A supervised learning method that determines what class a new data point belongs to based on past labeled observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some examples of classification problems?

A
  • Is this a dog?
  • Is this a spam email?
  • What will a person buy in a store?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does “supervised” mean in classification?

A

It means the algorithm learns from past data where the correct labels are already known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an instance-based classification method?

A

A method where new data points are classified based on their similarity to previous observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the K-Nearest Neighbors (KNN) algorithm?

A

A classification method that assigns a label to a new data point based on the majority class of its k nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a downside of KNN?

A

It requires storing all training data, making predictions slow and memory-intensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a rule-based classification model?

A

A model that predicts labels based on learned decision rules, such as Decision Trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do Decision Trees work?

A

They split data into branches based on feature values, leading to a final decision at a leaf node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Random Forest?

A

An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are Random Forests effective?

A

They use multiple weak classifiers (decision trees) and combine their predictions, reducing bias and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Artificial Neural Networks (ANNs)?

A

Complex models inspired by the human brain, using layers of neurons to learn patterns in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the layers of a neural network?

A
  • Input Layer: Receives data
  • Hidden Layers: Process information
  • Output Layer: Produces predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of weights in a neural network?

A

Weights determine the strength of connections between neurons, influencing how the model learns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is model explainability?

A

The ability to understand and interpret how a model makes decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are white-box, grey-box, and black-box models?

A
  • White-box: Fully interpretable (e.g., Decision Trees)
  • Grey-box: Partially interpretable (e.g., KNN)
  • Black-box: Hard to interpret (e.g., Neural Networks)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why might someone use a black-box model like a Neural Network?

A

They can learn more complex patterns and often provide higher accuracy despite being difficult to interpret.

17
Q

How does model choice relate to the CRISP-DM framework?

A

The choice of model and evaluation metrics should align with business understanding, linking back to the first stage of CRISP-DM.

18
Q

What are the key steps in building a classification model?

A
  1. Load the data
  2. Split into training and testing sets
  3. Set aside some validation data (not used for training or tuning)
  4. Train the model
  5. Choose an appropriate model (Regression, Classification, Clustering)
  6. Evaluate using training data
  7. Tune hyperparameters
  8. Evaluate using test data
19
Q

Why do we split data into training and testing sets?

A

To train the model on one portion and evaluate its performance on unseen data to prevent overfitting.

20
Q

Why set aside validation data?

A

It helps tune hyperparameters without affecting the final test set evaluation.

21
Q

What are hyperparameters, and why are they important?

A

Parameters set before training (e.g., number of neighbors in KNN, neurons/layers in ANN) that impact model performance.

22
Q

What is the purpose of evaluating a model on test data?

A

To check how well it generalizes to new, unseen data.

23
Q

Q: Why is evaluating a model multiple times important?

A

To ensure stability and reliability of performance across different datasets.

24
Q

What is the next consideration after model evaluation?

A

Understanding whether all errors are equal (some mistakes may have greater consequences than others).