Core Concepts Flashcards

1
Q

Supervised Learning

A

Involves training a model using labeled data to predict outcomes based on past data. Types include:

Classification: Predicting a category (e.g., spam vs. not spam).

Regression: Predicting a continuous value (e.g., house prices).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Unsupervised Learning

A

Involves finding patterns or groupings in unlabeled data, without a specific target. Types include:

Clustering: Grouping similar data points (e.g., customer segmentation).

Association: Discovering relationships between variables (e.g., items bought together in a store)​(RM06-Clustering in Tabl…)​(RM07-Altair AI Studio P…).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

CRISP-DM Phases

A

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Business Understanding:

A

Define objectives and requirements.

Purpose
Explore opportunities that can be leveraged through data mining
Understand the problem to be solved and its use scenario

Deliverables
Project objectives
Success criteria
Review of current situation
Resources
Requirements, assumptions, constraints
Risks & contingencies
Terminology
Costs/benefits
Project plan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Understanding:

A

Collect and explore data to identify issues or insights.

Purpose
Evaluate the raw material (data) you have for the project
Perform exploratory data analysis (EDA)

Deliverables
Initial data collection report
Data description report
Data exploration report
Data quality report

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluation:

A

Assess model performance with metrics (e.g., R² for regression).

Purpose
Assess model results rigorously to gain confidence in its validity and reliability
Assure that model can help achieve your business objective

Deliverables
Assessment of modeling results
Approved models
Process review
List of potential actions
Decision/recommendation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Preparation:

A

Clean and transform data for modeling (e.g., dealing with missing values, normalizing data).

Purpose
Manipulate/convert data into a form that will provide the best results
Iterative in nature

Deliverables
Data inclusion/exclusion report
Data cleaning report
Derived attributes list
Generated records list
Merged data report
Aggregations report

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deployment:

A

Implement the model in a real-world setting​(RM08-Altair AI Studio F…)​(RM05-Introduction to Mo…).

Purpose
Put the model into production in order to realize benefits

Deliverables
Deployment plan
Monitoring & maintenance plan
Final report & presentation
Documentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Modeling:

A

Apply algorithms to the data, such as MLR or clustering.

Purpose
Identify the model(s) that can capture patterns or regularities in the data
Calibrate models for optimum performance

Deliverables
Selected (best) modeling technique
List of assumptions
Design description
Parameter settings
Report of tested models
Model descriptions
Model assessment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Assumptions of Multiple Linear Regression (MLR)
Assumptions to Check:

A

Linearity: The relationship between predictors and outcome should be linear.

Normal Distribution of
Errors: Errors (residuals) should follow a normal distribution.

Homoskedasticity: Variance of residuals should be constant across all levels of independent variables.

Independence of Errors: Residuals should not be autocorrelated.

No Multicollinearity: Predictors should not be highly correlated with each other​(RM07-Altair AI Studio P…).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression Evaluation Metrics:

A

R2
RMSE
MAE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

R² (Coefficient of Determination):

A

Proportion of the variance in the dependent variable explained by the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RMSE (Root Mean Squared Error):

A

Measures the average magnitude of prediction errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MAE (Mean Absolute Error):

A

Average of absolute differences between predictions and actual values; less sensitive to large errors compared to RMSE​(RM08-Altair AI Studio F…).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

k-NN:

A

A classification method where an instance is classified based on the most common class among its k-nearest neighbors, a form of instance-based learning​(RM06-Clustering in Tabl…).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Distance Calculation:

A

Most models like k-NN use Euclidean distance to measure closeness. For multiple dimensions, it’s the square root of the sum of squared differences between each feature pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bayes Theorem:

A

Calculates probability by combining prior knowledge and likelihood. Used to update predictions as new evidence is added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Naïve Bayes:

A

Assumes independence among features. It’s highly efficient for text classification (e.g., spam detection), often with surprisingly high accuracy despite simplifying assumptions​(RM08-Altair AI Studio F…).

19
Q

Entropy:

A

Measures the randomness or impurity in a dataset. A set is considered pure (entropy = 0) if it contains only one class.

20
Q

Data Preparation Basics:

A

Handling Missing Values
Standardizing/Normalization
Correcting Skewness/Kurtosis

21
Q

Information Gain:

A

Used to select features in decision trees. It calculates the reduction in entropy after splitting the data by a specific feature, aiming to increase purity in resulting subsets​(RM05-Introduction to Mo…).

22
Q

Handling Missing Values:

A

Drop rows/columns with missing values, or impute them using methods like mean, median, mode, or prediction based on other attributes.

23
Q

Correcting Skewness/Kurtosis:

A

Apply log, square root, or cube root transformations for positive skew; square or exponential transformations for negative skew​(RM07-Altair AI Studio P…).

23
Q

Standardizing/Normalizing:

A

Important for models like k-NN and k-means, where feature scales matter.

Options include:
Z-score standardization
Min-max normalization
Correcting Skewness/Kurtosis

24
Q

Z-score standardization:

A

Centers data around zero with a standard deviation of one.

25
Q

Min-max normalization:

A

Rescales data to a range, usually 0 to 1.

26
Q

Classification Evaluation Metrics & Confusion Matrix
Metrics:

A

Accuracy
Precision
Recall/Sensitivity
F1 Score
Confusion Matrix

27
Q

Accuracy:

A

Correct predictions divided by total predictions.

28
Q

Precision:

A

True positives divided by total predicted positives; useful for minimizing false positives.

29
Q

Recall (Sensitivity):

A

True positives divided by actual positives; useful for reducing false negatives.

30
Q

F1 Score:

A

Harmonic mean of precision and recall; balances the two metrics.

31
Q

Confusion Matrix:

A

A matrix that summarizes classification results, showing true positives, false positives, true negatives, and false negatives, which helps evaluate model performance in detail​(RM08-Altair AI Studio F…).

32
Q

Clustering Methods:

A

k-means: Partitions data into k clusters around centroids by minimizing the variance within each cluster.
k-medoids: Similar but chooses actual data points (medoids) as centers, making it more robust to outliers.

33
Q

Distance Measure:

A

k-means commonly uses Euclidean distance.

34
Q

Calinski-Harabasz Criterion:

A

A metric used in Tableau to determine the optimal number of clusters by balancing within-cluster and between-cluster variance​(RM06-Clustering in Tabl…).

35
Q

Correcting for Skewness/Kurtosis:
Transformations:

A

For positive skew: Use square root, cube root, or log transformations.

For negative skew: Use square or exponential functions.

Skew and kurtosis corrections ensure the data distribution is closer to normal, which is especially useful in models that assume normality like MLR​(RM07-Altair AI Studio P…).

36
Q

Feature Selection:

A

Reduces data dimensionality to avoid overfitting, enhance interpretability, and improve model performance.
Techniques include removing irrelevant features, using correlation to exclude highly correlated variables, or applying algorithms to rank feature importance​(RM08-Altair AI Studio F…)​(RM05-Introduction to Mo…).

37
Q

Inductive Logic:

A

Deriving general principles from specific observations (often used in training data-based modeling).

38
Q

Deductive Logic:

A

Starting from general principles to make predictions about specific cases (testing hypotheses, rule-based reasoning)​(RM05-Introduction to Mo…).

39
Q

Missing Data Classification

A

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

40
Q

Missing Completely at Random (MCAR)

A

Can check characteristics of missing against non-missing
You can impute the value

41
Q

Missing at Random (MAR)

A

Characteristics are significantly different
Better name is missing conditionally at random
Dangerous to impute values – may need to delete these observations

42
Q

Missing Not at Random (MNAR)

A

Don’t delete, don’t predict
Replace using an appropriate strategy for your concept