Core Concepts Flashcards
Supervised Learning
Involves training a model using labeled data to predict outcomes based on past data. Types include:
Classification: Predicting a category (e.g., spam vs. not spam).
Regression: Predicting a continuous value (e.g., house prices).
Unsupervised Learning
Involves finding patterns or groupings in unlabeled data, without a specific target. Types include:
Clustering: Grouping similar data points (e.g., customer segmentation).
Association: Discovering relationships between variables (e.g., items bought together in a store)(RM06-Clustering in Tabl…)(RM07-Altair AI Studio P…).
CRISP-DM Phases
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding:
Define objectives and requirements.
Purpose
Explore opportunities that can be leveraged through data mining
Understand the problem to be solved and its use scenario
Deliverables
Project objectives
Success criteria
Review of current situation
Resources
Requirements, assumptions, constraints
Risks & contingencies
Terminology
Costs/benefits
Project plan
Data Understanding:
Collect and explore data to identify issues or insights.
Purpose
Evaluate the raw material (data) you have for the project
Perform exploratory data analysis (EDA)
Deliverables
Initial data collection report
Data description report
Data exploration report
Data quality report
Evaluation:
Assess model performance with metrics (e.g., R² for regression).
Purpose
Assess model results rigorously to gain confidence in its validity and reliability
Assure that model can help achieve your business objective
Deliverables
Assessment of modeling results
Approved models
Process review
List of potential actions
Decision/recommendation
Data Preparation:
Clean and transform data for modeling (e.g., dealing with missing values, normalizing data).
Purpose
Manipulate/convert data into a form that will provide the best results
Iterative in nature
Deliverables
Data inclusion/exclusion report
Data cleaning report
Derived attributes list
Generated records list
Merged data report
Aggregations report
Deployment:
Implement the model in a real-world setting(RM08-Altair AI Studio F…)(RM05-Introduction to Mo…).
Purpose
Put the model into production in order to realize benefits
Deliverables
Deployment plan
Monitoring & maintenance plan
Final report & presentation
Documentation
Modeling:
Apply algorithms to the data, such as MLR or clustering.
Purpose
Identify the model(s) that can capture patterns or regularities in the data
Calibrate models for optimum performance
Deliverables
Selected (best) modeling technique
List of assumptions
Design description
Parameter settings
Report of tested models
Model descriptions
Model assessment
Assumptions of Multiple Linear Regression (MLR)
Assumptions to Check:
Linearity: The relationship between predictors and outcome should be linear.
Normal Distribution of
Errors: Errors (residuals) should follow a normal distribution.
Homoskedasticity: Variance of residuals should be constant across all levels of independent variables.
Independence of Errors: Residuals should not be autocorrelated.
No Multicollinearity: Predictors should not be highly correlated with each other(RM07-Altair AI Studio P…).
Regression Evaluation Metrics:
R2
RMSE
MAE
R² (Coefficient of Determination):
Proportion of the variance in the dependent variable explained by the model.
RMSE (Root Mean Squared Error):
Measures the average magnitude of prediction errors.
MAE (Mean Absolute Error):
Average of absolute differences between predictions and actual values; less sensitive to large errors compared to RMSE(RM08-Altair AI Studio F…).
k-NN:
A classification method where an instance is classified based on the most common class among its k-nearest neighbors, a form of instance-based learning(RM06-Clustering in Tabl…).
Distance Calculation:
Most models like k-NN use Euclidean distance to measure closeness. For multiple dimensions, it’s the square root of the sum of squared differences between each feature pair.
Bayes Theorem:
Calculates probability by combining prior knowledge and likelihood. Used to update predictions as new evidence is added.