Core Concepts Flashcards
Supervised Learning
Involves training a model using labeled data to predict outcomes based on past data. Types include:
Classification: Predicting a category (e.g., spam vs. not spam).
Regression: Predicting a continuous value (e.g., house prices).
Unsupervised Learning
Involves finding patterns or groupings in unlabeled data, without a specific target. Types include:
Clustering: Grouping similar data points (e.g., customer segmentation).
Association: Discovering relationships between variables (e.g., items bought together in a store)(RM06-Clustering in Tabl…)(RM07-Altair AI Studio P…).
CRISP-DM Phases
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding:
Define objectives and requirements.
Purpose
Explore opportunities that can be leveraged through data mining
Understand the problem to be solved and its use scenario
Deliverables
Project objectives
Success criteria
Review of current situation
Resources
Requirements, assumptions, constraints
Risks & contingencies
Terminology
Costs/benefits
Project plan
Data Understanding:
Collect and explore data to identify issues or insights.
Purpose
Evaluate the raw material (data) you have for the project
Perform exploratory data analysis (EDA)
Deliverables
Initial data collection report
Data description report
Data exploration report
Data quality report
Evaluation:
Assess model performance with metrics (e.g., R² for regression).
Purpose
Assess model results rigorously to gain confidence in its validity and reliability
Assure that model can help achieve your business objective
Deliverables
Assessment of modeling results
Approved models
Process review
List of potential actions
Decision/recommendation
Data Preparation:
Clean and transform data for modeling (e.g., dealing with missing values, normalizing data).
Purpose
Manipulate/convert data into a form that will provide the best results
Iterative in nature
Deliverables
Data inclusion/exclusion report
Data cleaning report
Derived attributes list
Generated records list
Merged data report
Aggregations report
Deployment:
Implement the model in a real-world setting(RM08-Altair AI Studio F…)(RM05-Introduction to Mo…).
Purpose
Put the model into production in order to realize benefits
Deliverables
Deployment plan
Monitoring & maintenance plan
Final report & presentation
Documentation
Modeling:
Apply algorithms to the data, such as MLR or clustering.
Purpose
Identify the model(s) that can capture patterns or regularities in the data
Calibrate models for optimum performance
Deliverables
Selected (best) modeling technique
List of assumptions
Design description
Parameter settings
Report of tested models
Model descriptions
Model assessment
Assumptions of Multiple Linear Regression (MLR)
Assumptions to Check:
Linearity: The relationship between predictors and outcome should be linear.
Normal Distribution of
Errors: Errors (residuals) should follow a normal distribution.
Homoskedasticity: Variance of residuals should be constant across all levels of independent variables.
Independence of Errors: Residuals should not be autocorrelated.
No Multicollinearity: Predictors should not be highly correlated with each other(RM07-Altair AI Studio P…).
Regression Evaluation Metrics:
R2
RMSE
MAE
R² (Coefficient of Determination):
Proportion of the variance in the dependent variable explained by the model.
RMSE (Root Mean Squared Error):
Measures the average magnitude of prediction errors.
MAE (Mean Absolute Error):
Average of absolute differences between predictions and actual values; less sensitive to large errors compared to RMSE(RM08-Altair AI Studio F…).
k-NN:
A classification method where an instance is classified based on the most common class among its k-nearest neighbors, a form of instance-based learning(RM06-Clustering in Tabl…).
Distance Calculation:
Most models like k-NN use Euclidean distance to measure closeness. For multiple dimensions, it’s the square root of the sum of squared differences between each feature pair.
Bayes Theorem:
Calculates probability by combining prior knowledge and likelihood. Used to update predictions as new evidence is added.
Naïve Bayes:
Assumes independence among features. It’s highly efficient for text classification (e.g., spam detection), often with surprisingly high accuracy despite simplifying assumptions(RM08-Altair AI Studio F…).
Entropy:
Measures the randomness or impurity in a dataset. A set is considered pure (entropy = 0) if it contains only one class.
Data Preparation Basics:
Handling Missing Values
Standardizing/Normalization
Correcting Skewness/Kurtosis
Information Gain:
Used to select features in decision trees. It calculates the reduction in entropy after splitting the data by a specific feature, aiming to increase purity in resulting subsets(RM05-Introduction to Mo…).
Handling Missing Values:
Drop rows/columns with missing values, or impute them using methods like mean, median, mode, or prediction based on other attributes.
Correcting Skewness/Kurtosis:
Apply log, square root, or cube root transformations for positive skew; square or exponential transformations for negative skew(RM07-Altair AI Studio P…).
Standardizing/Normalizing:
Important for models like k-NN and k-means, where feature scales matter.
Options include:
Z-score standardization
Min-max normalization
Correcting Skewness/Kurtosis
Z-score standardization:
Centers data around zero with a standard deviation of one.
Min-max normalization:
Rescales data to a range, usually 0 to 1.
Classification Evaluation Metrics & Confusion Matrix
Metrics:
Accuracy
Precision
Recall/Sensitivity
F1 Score
Confusion Matrix
Accuracy:
Correct predictions divided by total predictions.
Precision:
True positives divided by total predicted positives; useful for minimizing false positives.
Recall (Sensitivity):
True positives divided by actual positives; useful for reducing false negatives.
F1 Score:
Harmonic mean of precision and recall; balances the two metrics.
Confusion Matrix:
A matrix that summarizes classification results, showing true positives, false positives, true negatives, and false negatives, which helps evaluate model performance in detail(RM08-Altair AI Studio F…).
Clustering Methods:
k-means: Partitions data into k clusters around centroids by minimizing the variance within each cluster.
k-medoids: Similar but chooses actual data points (medoids) as centers, making it more robust to outliers.
Distance Measure:
k-means commonly uses Euclidean distance.
Calinski-Harabasz Criterion:
A metric used in Tableau to determine the optimal number of clusters by balancing within-cluster and between-cluster variance(RM06-Clustering in Tabl…).
Correcting for Skewness/Kurtosis:
Transformations:
For positive skew: Use square root, cube root, or log transformations.
For negative skew: Use square or exponential functions.
Skew and kurtosis corrections ensure the data distribution is closer to normal, which is especially useful in models that assume normality like MLR(RM07-Altair AI Studio P…).
Feature Selection:
Reduces data dimensionality to avoid overfitting, enhance interpretability, and improve model performance.
Techniques include removing irrelevant features, using correlation to exclude highly correlated variables, or applying algorithms to rank feature importance(RM08-Altair AI Studio F…)(RM05-Introduction to Mo…).
Inductive Logic:
Deriving general principles from specific observations (often used in training data-based modeling).
Deductive Logic:
Starting from general principles to make predictions about specific cases (testing hypotheses, rule-based reasoning)(RM05-Introduction to Mo…).
Missing Data Classification
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)
Missing Completely at Random (MCAR)
Can check characteristics of missing against non-missing
You can impute the value
Missing at Random (MAR)
Characteristics are significantly different
Better name is missing conditionally at random
Dangerous to impute values – may need to delete these observations
Missing Not at Random (MNAR)
Don’t delete, don’t predict
Replace using an appropriate strategy for your concept