8.3 Predictive anaylitics - Classification Flashcards
What is classification in data science?
A supervised learning method that determines what class a new data point belongs to based on past labeled observations.
What are some examples of classification problems?
- Is this a dog?
- Is this a spam email?
- What will a person buy in a store?
What does “supervised” mean in classification?
It means the algorithm learns from past data where the correct labels are already known.
What is an instance-based classification method?
A method where new data points are classified based on their similarity to previous observations.
What is the K-Nearest Neighbors (KNN) algorithm?
A classification method that assigns a label to a new data point based on the majority class of its k nearest neighbors.
What is a downside of KNN?
It requires storing all training data, making predictions slow and memory-intensive.
What is a rule-based classification model?
A model that predicts labels based on learned decision rules, such as Decision Trees.
How do Decision Trees work?
They split data into branches based on feature values, leading to a final decision at a leaf node.
What is a Random Forest?
An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting.
Why are Random Forests effective?
They use multiple weak classifiers (decision trees) and combine their predictions, reducing bias and variance.
What are Artificial Neural Networks (ANNs)?
Complex models inspired by the human brain, using layers of neurons to learn patterns in data.
What are the layers of a neural network?
- Input Layer: Receives data
- Hidden Layers: Process information
- Output Layer: Produces predictions
What is the role of weights in a neural network?
Weights determine the strength of connections between neurons, influencing how the model learns.
What is model explainability?
The ability to understand and interpret how a model makes decisions.
What are white-box, grey-box, and black-box models?
- White-box: Fully interpretable (e.g., Decision Trees)
- Grey-box: Partially interpretable (e.g., KNN)
- Black-box: Hard to interpret (e.g., Neural Networks)
Why might someone use a black-box model like a Neural Network?
They can learn more complex patterns and often provide higher accuracy despite being difficult to interpret.
How does model choice relate to the CRISP-DM framework?
The choice of model and evaluation metrics should align with business understanding, linking back to the first stage of CRISP-DM.
What are the key steps in building a classification model?
- Load the data
- Split into training and testing sets
- Set aside some validation data (not used for training or tuning)
- Train the model
- Choose an appropriate model (Regression, Classification, Clustering)
- Evaluate using training data
- Tune hyperparameters
- Evaluate using test data
Why do we split data into training and testing sets?
To train the model on one portion and evaluate its performance on unseen data to prevent overfitting.
Why set aside validation data?
It helps tune hyperparameters without affecting the final test set evaluation.
What are hyperparameters, and why are they important?
Parameters set before training (e.g., number of neighbors in KNN, neurons/layers in ANN) that impact model performance.
What is the purpose of evaluating a model on test data?
To check how well it generalizes to new, unseen data.
Q: Why is evaluating a model multiple times important?
To ensure stability and reliability of performance across different datasets.
What is the next consideration after model evaluation?
Understanding whether all errors are equal (some mistakes may have greater consequences than others).