Core Concepts Flashcards

Question 1

Q

Supervised Learning

Answer

A

Involves training a model using labeled data to predict outcomes based on past data. Types include:

Classification: Predicting a category (e.g., spam vs. not spam).

Regression: Predicting a continuous value (e.g., house prices).

Question 2

Q

Unsupervised Learning

Answer

A

Involves finding patterns or groupings in unlabeled data, without a specific target. Types include:

Clustering: Grouping similar data points (e.g., customer segmentation).

Association: Discovering relationships between variables (e.g., items bought together in a store)(RM06-Clustering in Tabl…)(RM07-Altair AI Studio P…).

Question 3

Q

CRISP-DM Phases

Answer

A

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Question 4

Q

Business Understanding:

Answer

A

Define objectives and requirements.

Purpose
Explore opportunities that can be leveraged through data mining
Understand the problem to be solved and its use scenario

Deliverables
Project objectives
Success criteria
Review of current situation
Resources
Requirements, assumptions, constraints
Risks & contingencies
Terminology
Costs/benefits
Project plan

Question 5

Q

Data Understanding:

Answer

A

Collect and explore data to identify issues or insights.

Purpose
Evaluate the raw material (data) you have for the project
Perform exploratory data analysis (EDA)

Deliverables
Initial data collection report
Data description report
Data exploration report
Data quality report

Question 6

Q

Evaluation:

Answer

A

Assess model performance with metrics (e.g., R² for regression).

Purpose
Assess model results rigorously to gain confidence in its validity and reliability
Assure that model can help achieve your business objective

Deliverables
Assessment of modeling results
Approved models
Process review
List of potential actions
Decision/recommendation

Question 7

Q

Data Preparation:

Answer

A

Clean and transform data for modeling (e.g., dealing with missing values, normalizing data).

Purpose
Manipulate/convert data into a form that will provide the best results
Iterative in nature

Deliverables
Data inclusion/exclusion report
Data cleaning report
Derived attributes list
Generated records list
Merged data report
Aggregations report

Question 8

Q

Deployment:

Answer

A

Implement the model in a real-world setting(RM08-Altair AI Studio F…)(RM05-Introduction to Mo…).

Purpose
Put the model into production in order to realize benefits

Deliverables
Deployment plan
Monitoring & maintenance plan
Final report & presentation
Documentation

Question 9

Q

Modeling:

Answer

A

Apply algorithms to the data, such as MLR or clustering.

Purpose
Identify the model(s) that can capture patterns or regularities in the data
Calibrate models for optimum performance

Deliverables
Selected (best) modeling technique
List of assumptions
Design description
Parameter settings
Report of tested models
Model descriptions
Model assessment

Question 10

Q

Assumptions of Multiple Linear Regression (MLR)
Assumptions to Check:

Answer

A

Linearity: The relationship between predictors and outcome should be linear.

Normal Distribution of
Errors: Errors (residuals) should follow a normal distribution.

Homoskedasticity: Variance of residuals should be constant across all levels of independent variables.

Independence of Errors: Residuals should not be autocorrelated.

No Multicollinearity: Predictors should not be highly correlated with each other(RM07-Altair AI Studio P…).

Question 11

Q

Regression Evaluation Metrics:

Answer

A

R2
RMSE
MAE

Question 12

Q

R² (Coefficient of Determination):

Answer

A

Proportion of the variance in the dependent variable explained by the model.

Question 13

Q

RMSE (Root Mean Squared Error):

Answer

A

Measures the average magnitude of prediction errors.

Question 14

Q

MAE (Mean Absolute Error):

Answer

A

Average of absolute differences between predictions and actual values; less sensitive to large errors compared to RMSE(RM08-Altair AI Studio F…).

Question 15

Q

k-NN:

Answer

A

A classification method where an instance is classified based on the most common class among its k-nearest neighbors, a form of instance-based learning(RM06-Clustering in Tabl…).

Question 16

Q

Distance Calculation:

Answer

A

Most models like k-NN use Euclidean distance to measure closeness. For multiple dimensions, it’s the square root of the sum of squared differences between each feature pair.

Question 17

Q

Bayes Theorem:

Answer

A

Calculates probability by combining prior knowledge and likelihood. Used to update predictions as new evidence is added.

Question 18

Q

Naïve Bayes:

Answer

A

Assumes independence among features. It’s highly efficient for text classification (e.g., spam detection), often with surprisingly high accuracy despite simplifying assumptions(RM08-Altair AI Studio F…).

Question 19

Q

Entropy:

Answer

A

Measures the randomness or impurity in a dataset. A set is considered pure (entropy = 0) if it contains only one class.

Question 20

Q

Data Preparation Basics:

Answer

A

Handling Missing Values
Standardizing/Normalization
Correcting Skewness/Kurtosis

Question 21

Q

Information Gain:

Answer

A

Used to select features in decision trees. It calculates the reduction in entropy after splitting the data by a specific feature, aiming to increase purity in resulting subsets(RM05-Introduction to Mo…).

Question 22

Q

Handling Missing Values:

Answer

A

Drop rows/columns with missing values, or impute them using methods like mean, median, mode, or prediction based on other attributes.

Question 23

Q

Correcting Skewness/Kurtosis:

Answer

A

Apply log, square root, or cube root transformations for positive skew; square or exponential transformations for negative skew(RM07-Altair AI Studio P…).

Question 24

Q

Standardizing/Normalizing:

Answer

A

Important for models like k-NN and k-means, where feature scales matter.

Options include:
Z-score standardization
Min-max normalization
Correcting Skewness/Kurtosis

Question 25

Q

Z-score standardization:

Answer

A

Centers data around zero with a standard deviation of one.

Question 26

Q

Min-max normalization:

Answer

A

Rescales data to a range, usually 0 to 1.

Question 27

Q

Classification Evaluation Metrics & Confusion Matrix
Metrics:

Answer

A

Accuracy
Precision
Recall/Sensitivity
F1 Score
Confusion Matrix

Question 28

Q

Accuracy:

Answer

A

Correct predictions divided by total predictions.

Question 29

Q

Precision:

Answer

A

True positives divided by total predicted positives; useful for minimizing false positives.

Question 30

Q

Recall (Sensitivity):

Answer

A

True positives divided by actual positives; useful for reducing false negatives.

Question 31

Q

F1 Score:

Answer

A

Harmonic mean of precision and recall; balances the two metrics.

Question 32

Q

Confusion Matrix:

Answer

A

A matrix that summarizes classification results, showing true positives, false positives, true negatives, and false negatives, which helps evaluate model performance in detail(RM08-Altair AI Studio F…).

Question 33

Q

Clustering Methods:

Answer

A

k-means: Partitions data into k clusters around centroids by minimizing the variance within each cluster.
k-medoids: Similar but chooses actual data points (medoids) as centers, making it more robust to outliers.

Question 34

Q

Distance Measure:

Answer

A

k-means commonly uses Euclidean distance.

Question 35

Q

Calinski-Harabasz Criterion:

Answer

A

A metric used in Tableau to determine the optimal number of clusters by balancing within-cluster and between-cluster variance(RM06-Clustering in Tabl…).

Question 36

Q

Correcting for Skewness/Kurtosis:
Transformations:

Answer

A

For positive skew: Use square root, cube root, or log transformations.

For negative skew: Use square or exponential functions.

Skew and kurtosis corrections ensure the data distribution is closer to normal, which is especially useful in models that assume normality like MLR(RM07-Altair AI Studio P…).

Question 37

Q

Feature Selection:

Answer

A

Reduces data dimensionality to avoid overfitting, enhance interpretability, and improve model performance.
Techniques include removing irrelevant features, using correlation to exclude highly correlated variables, or applying algorithms to rank feature importance(RM08-Altair AI Studio F…)(RM05-Introduction to Mo…).

Question 38

Q

Inductive Logic:

Answer

A

Deriving general principles from specific observations (often used in training data-based modeling).

Question 39

Q

Deductive Logic:

Answer

A

Starting from general principles to make predictions about specific cases (testing hypotheses, rule-based reasoning)(RM05-Introduction to Mo…).

Question 40

Q

Missing Data Classification

Answer

A

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Question 41

Q

Missing Completely at Random (MCAR)

Answer

A

Can check characteristics of missing against non-missing
You can impute the value

Question 42

Q

Missing at Random (MAR)

Answer

A

Characteristics are significantly different
Better name is missing conditionally at random
Dangerous to impute values – may need to delete these observations

Question 43

Q

Missing Not at Random (MNAR)

Answer

A

Don’t delete, don’t predict
Replace using an appropriate strategy for your concept