lecture 5 Flashcards

Question

clustering

Answer 1

The real values of 𝑌 are unknown The ML algorithm tries to identify existing patterns in the data (without prior supervision) Clustering tries to group observations such that elements belonging to the same group (or cluster) are more similar - according to some similarity measure - and thoses belonging to different groups are more dissimilar. Clustering typically is an **unsupervised** learning task.

Answer 2

Baseline/Benchmark * Simple model * Easy/quick to fit * Reference point for performance analysis State-of-the-art model * Usually very complex model * Costly/optimized fit * Best possible performances

Answer 3

Linear Regression Artificial Neural Networks Deep Artificial Neural Networks Support Vector Regression (SVR) K-Nearest Neighbours (k-NN)

Answer 4

**Linear Regression** Dataset requirement : Supervised Data provisioning: Batch Model representation: Model-based : 𝑌 = 𝛽𝑋 + ε Task: Regression For Classification, the equivalent model is Logistic Regression

Answer 5

Dataset requirement : Supervised (ANN, RNN, CNN, GAN) Unsupervised (Autoencoders) Data provisioning: Batch/Online Model representation: Model-based Task: Regression/Classification Ensemble model

Answer 6

Dataset requirement : ▪ Supervised (ANN, RNN, CNN, GAN) Unsupervised (Autoencoders) Data provisioning: Batch/Online Model representation: Model- based Task: Regression/Classification

Answer 7

Dataset requirement : Supervised Data provisioning: Batch Model representation: Model- based : 𝑌 = 𝐾(𝛽𝑋) + ε Task: Classification For Regression, the equivalent model is Support Vector Regression

Answer 8

Dataset requirement: Supervised Data provisioning: Batch/Online Model representation: Model-based Task: Classification/Regression Regression -> Mean Classification -> Majority vote

Answer 9

* Naïve Bayes * Logistic Regression * Support Vector Machines (SVM) * Decision Tree * Random forest * Artificial Neural Networks

Answer 10

Dataset requirement : Supervised Data provisioning: Batch Model representation: Model-based Task: Classification

Answer 11

Dataset requirement : Supervised Data provisioning: Batch Model representation: Model-based : 𝑌 = 𝛽𝑋 + ε Task: Classification For Regression, the equivalent model is Linear Regression

Answer 12

Dataset requirement : Supervised Data provisioning: Batch Model representation: Model-based : 𝑌 = 𝐾(𝛽𝑋) + ε Task: Classification For Regression, the equivalent model is Support Vector Regression

Answer 13

Dataset requirement : Supervised Data provisioning: Batch Model representation: Instance-based Task: Regression/Classification Regression VS Classification Decision Tree

Answer 14

Dataset requirement : Supervised Data provisioning: Batch Model representation: Instance-based Task: Regression/Classification Ensemble model

Answer 15

Dataset requirement : Supervised (ANN, RNN, CNN, GAN) Unsupervised (Autoencoders) Data provisioning: Batch/Online Model representation: Model-based Task: Regression/Classification Ensemble model

Answer 16

* K-Means Clustering * Hierarchical clustering * And many more... ## Footnote many more: Dimensionality Reduction * PCA * t-SNE * Autoencoders Clustering * DBSCAN * Self-organizing maps Reinforcement Learning * Q-Learning * Deep Q-Learning ...

Answer 17

Dataset requirement: Unsupervised Data provisioning: Batch Model representation: Instance-based Task: Clustering/pattern recognition N.B. : As clustering is unsupervised, multiple solutions can be found!

Answer 18

Dataset requirement: Unsupervised Data provisioning: Batch Model representation: Instance-based Task: Clustering/pattern recognition

Answer 19

**raw data** * collection * download * scraping **Data preprocessing** * Data quality (cf. diagnostic) * missing data * categorical variables **Train-test split** * single validation * cross validation **model fit** * fit on training data * test on testing data **performance evaluation** * performance metric choice * evaluation on validation data

Answer 20

Data is split for three different uses: * trees of different depths are fit to the **training data** * their performance is evaluated on the **validation set** (the lower the validation error the better) * and a final estimate of model performance is computed on the **test set**

Answer 21

**Feature**: With respect to a dataset, a feature represents an *attribute* and *value* combination. Color is an attribute. “Color is blue” is a feature (blue is one of the values color can have). **target**: target variable, also known as a dependent variable, is the outcome we aim to predict or explain using our model. It is the variable that we want to estimate or classify based on the available data. **sample**: a row, one instance in a dataset, so an answer for all the features (and thus variables) **Training Set**: A set of observations used to generate machine learning models. **Test Set**: A set of observations used at the end of model training and validation to assess the predictive power of your model. How generalizable is your model to unseen data?

Answer 22

**ordinal** **one-hot-encoding** Use One-Hot Encoding: When dealing with nominal categorical variables that lack any inherent order. Use Ordinal Encoding: When you have categorical variables with a clear ordinal relationship and the order between categories holds valuable information.

Answer 23

transforms categorical variables into a binary matrix where each category is represented as a column, and each instance is marked with a ‘1’ in the corresponding column and ‘0’ in all other columns. (so for instance, three values: red, green and yellow, then 3 columns, if it is red then a 1 in the red column a 0 in the others.) advantages: 1. Preservation of Information: One-hot encoding preserves the uniqueness of each category. It ensures that the algorithm does not assume any ordinal relationship among the categories. 2. Lack of Bias: Since each category is represented independently, one-hot encoding prevents introducing unintended biases based on the order of categories. 3. Suitable for Most Algorithms: One-hot encoded data is widely accepted by various machine learning algorithms, such as decision trees, random forests, and neural networks limitations: 1. Dimensionality: One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables with many unique categories. This can lead to the curse of dimensionality and negatively impact model performance. 2. Loss of Order Information: One-hot encoding discards any inherent order that might exist among categories, which can be crucial in some scenarios.

Answer 24

Ordinal encoding is a technique that assigns a unique integer value to each category based on their order or rank. It is suitable for categorical variables that exhibit a clear ordinal relationship, where one category is greater or lesser than another. (for instance: flight ticket, first, second or a third class) advantages: 1. Efficiency in Dimensionality: Ordinal encoding does not inflate the dataset’s dimensionality like one-hot encoding does. It replaces categorical values with integers, saving space and computation time. 2. Retains Order Information: This technique preserves the ordinal information that exists among categories, allowing the algorithm to leverage this information if it is relevant to the problem. limitations: 1. Assumption of Equal Steps: Ordinal encoding assumes equal intervals between categories, which might not always be the case in real-world scenarios. 2. Potential Misrepresentation: If the assigned integer values do not accurately reflect the ordinal relationships, the encoded data might mislead the algorithm.

Answer 25

Case deletion Missing data imputation Approcahes that take into account data distribution

Answer 26

Generally replace the missing quantitative values using Mean/Median and when it comes to categorical or qualitative data, we use Mode to impute the missing data.

Answer 27

List Wise Deletion: If we have missing values in the row then, delete the entire row. So, here we get some data loss. But to avoid this, we can use the Pairwise deletion method. Pair Wise Deletion: We find the correlation matrix here. If the feature is highly correlated with the target variable, then we use some different imputation methods to deal with missing values. But, if the feature is not highly correlated with the target variable, then we delete the entire column.

Answer 28

**exactness** of model True positive / (true positive + false positive)

Answer 29

percentage correct predictions (true positive + true negative) / (tp + fn + fp + tn)

Answer 30

Completeness of model TP / (TP+FN)

Answer 31

Combines precision and recall (precision * recall) \ *2 (precision + recall) (so the fraction and then time 2)

lecture 5 Flashcards

(55 cards)