Lectures Flashcards

Question

When to choose LDA over PCA

Answer 1

- Supervised learning: Maximize the separation between classes for better classification or visualization - Class separation: Finding a lower-dimensional representation that best separates classes rather than simply capturing the maximum variance in PCA - Interpretability: LDA components may be more interpretable than PCA components since they are directly related to class separation

Answer 2

Not directly, but it can through extensions that LDA offers such as Kernel LDA and quadratic discriminant analysis

Answer 3

- Unsupervised model uncovers interesting structure in the data. - It can identify clusters of related datapoints without relying on pre-existing labels or target variables

Answer 4

- It is the classification of objects into different groups, or more precisely, the partitioning of a dataset into subsets, so that the data in each subset share some common traits, often according to some defined distance measure - The information clustering uses is the similarity between examples - The data within the same cluster are very similar, while the data in distinct clusters are different

Answer 5

- Discover the nature and structure of the data - Data classification - Data coding and compression - Cluster data whose characteristics change over time

Answer 6

- High within cluster similarity and low inter-cluster similarity

Answer 7

A distance measure determines how the similarity of two elements is calculated and it will influence the shape of the clusters - Euclidean distance - Manhattan distance

Answer 8

- Flat or partitional clustering: Partitions are independent of each other such as K-means clustering - Hierarchical clustering: Partitions can be visualized using a tree structure (dendogram) such as Agglomerative clustering and divisive clustering

Answer 9

- Unsupervised learning algorithm that groups unlabeled datasets into k different clusters - The k means clustering algorithm mainly performs two tasks: 1- Determines the best value for K center points 2- Assigns each data point to its closest k-center based on the distance measure

Answer 10

1- Choose number of clusters k 2- Initialize centers: select k points at random to be our initial clusters 3- Measure the distance between each data point to each cluster cetner 4- Assign each data point to that cluster whose center is nearest to that data point 5- Re-compute the new center of the newly formed clusters, it is taken by taking the mean of all points in the cluster 6- Keep on repeating until there is no change in clusters or reaching maximum iterations

Answer 11

- Using the elbow method - Using the silhouette method

Answer 12

1- Compute clustering algorithm for different values of k 2- For each k calculate the within-cluster sum of squares (WCSS) 3- Plot curve of WCSS according to the number of clusters 4- Identify the elbow in the plot, which is the point on the plot that looks like an elbow (check slides for formula of WCSS)

Answer 13

The silhouette coefficient is used to determine the quality of the clusters by checking how similar a data point is within a cluster compared to other clusters - Silhouette coefficient ranges between -1 to 1, 1 indicating that it is in the correct cluster. 0 indicates it is either closely to the decision boundary. -1 means it is in wrong cluster Steps: 1- Compute k means clustering for a range of values 2- For each value of k, find the the average silhouette score of data points 3- Plot the collection of silhouette scores for each value of k 4- Select the number of clusters when the silhouette score is maximum (check formula)

Answer 14

Advantages: - Easy to understand and implement - Computationally efficient Disadvantages: - Difficult to guess the correct K - Sensitive to the initial selection of centroids - Sensitive to outliers - K is ineffective when the number of dimensions increase

Answer 15

It is another type of unsupervised machine learning technique that does not require a pre-specified choice of K and groups objects into a tree of clusters It can be a bottom up (agglomerative): Starting with each item in its own cluster finding the best pair to merge into a new cluster It can be a top down (divisive): starting with all data in a single cluster, consider every possible way to divide cluster into two

Answer 16

- Single linkage: Computes the minimum distance between clusters before merging them - Complete linkage: Computes the maximum distance between clusters before merging them - Average linkage: Computes the average distance between clusters before merging them

Answer 17

1- Compute the distance matrix between the points 2- Join the closest points to each other, and form a single group with a new calculated height which is the smallest distance between the 2 points 3- Keep on repeating 4- Draw dendogram

Answer 18

Take the largest difference of heights and count how many vertical lines you see

Answer 19

1- Compute the distance matrix of the points 2- Combine the 2 points that have the smallest distance 3- The new distance between a point and a combined point corresponds to the maximum distance and not the shortest distance 4- Repeat 5- Draw dendogram

Answer 20

Detecting outliers

Answer 21

- Domain understanding: Clearly defning the problem statement and objectives - Data collection - Data preprocessing/Features engineering: Includes cleaning, preparing the data for analysis and more - Model selection and training: Choosing appropriate models, tuning hyperparameters, and training techniques - Evaluation: Assessing model performance, interpreting results, and drawing insights - Deployment or production step

Answer 22

- Prepare the appropriate input dataset, compatible with the requirements of machine learning algorithm - Improve performance of machine learning models

Answer 23

- flexibility - Simpler models - Better results

Answer 24

- Imputation: Handling missing data - Outlier detection - Logarithmic transformation - One-hot encoding - Scaling - Features extraction

Answer 25

It is the difference between the predicted and real values of the training data. It is measured by evaluating the performance of a model on a training dataset. Models with high bias leads to high error on training data. High bias means underfitting the data

Answer 26

- Incorporate additional features from data to improve the model's accuracy - Increasing the number of training iterations to allow the model to learn more complex data

Answer 27

- It is the amount that the estimate of the target function will change if different training data was used

Answer 28

- To measure variance we perform cross validation experiments and looking at how the model performs on different random splits of your training data - Low variance indicates a limited change in the target function in response to changes in training data - High variance means a significant difference. It means overfitting of the data

Answer 29

- Reducing the number of features in the model - Replacing the current model with a simpler one - Increasing the training data diversity to balance out the complexity of the model and thr data structure - Performing hyperparameter tuning to avoid overfitting

Answer 30

- Low bias, low variance (ideal): Ideal but not often the case in practice, so there are terms such as reasonable bias and reasonable variance - Low bias, high variance (overfitting): When a model has too many parameters and fits too closely to the training data - High bias, low variance (underfitting): When the model doesn't learn well from the training data or has too few parameters - High bias, high variance (inaccurate predictions): The predictions are inconsistent and inaccurate on average

Answer 31

- Accuracy - Training time and space complexity - Testing time and space complexity - Interpretability

Answer 32

- Linear relationship - Simplicity - Large sample size - Baseline modeling

Answer 33

- It is not robust to outliers - It works well when the classes are separable

Answer 34

Classification and Regression

Answer 35

When we bring down all variables to the same scale. We subtract the mean and divide by the standard deviation of that variable

Answer 36

Another method to bring down all variable to the same scale, which transforms a feature such that all of its values fall in a range between 0 and 1

Answer 37

- If the data is not linearly separable, we will not be able to find a hard margin - If the data has outliers this will affect the margin and make it more difficult to find a hard margin between classes

Answer 38

They are introduced to handle misclassified points or points that fall within the margin

Answer 39

They are functions that apply some complex mathematical operations on lower dimensional data points and convert them into higher dimensional space. This avoids the need to compute the transformed feature vectors which can be computationally expensive

Answer 40

- C: Influences the shape of the decision boundary and degree to which misclassifications are allowed in the determination of the soft margin hyperplane - Kernel functions: Specified when training an SVM. if the decision boundary is linear, the linear kernel is the best and most efficient choice, and it requires the gamma parameters to be provided and tuned - Gamma (RBF): The gamma parameter determines how much each training example influences the shape of the decision boundary. It decides how much curvature we ant in a decision boundary

Answer 41

- It is a simple type of classification algorithm for supervised learning in the form of a tree structure - Root node: Where the splitting starts - Decision/Child node: Where decisions or rules are made - Leaf/Terminal: It is associated with a value of the class. It cannot be split further

Answer 42

- Formula to calculate the uncertainty of a node - A homogeneous sample has entropy of 0 - An equally divided sample has entropy of 1 - Entropy is maximum if we have no knowledge of the system

Answer 43

- Data may be overfitted - Only one attribute at a time is tested for making decision - Does not handle numeric attributes and missing values

Answer 44

- It is used for generating classification and regression trees - It uses a gini index as a metric/cost function to evaluate split in feature - It uses least square as a mtric to select features in regression tree

Answer 45

- Discarding the example - Treating a missing value as a separate category - Imputing the missing value with the most common value for that attribute

Answer 46

1- Pre-prunning: Stop growing the decision tree wen the gain in terms of precision is negligible 2- Post-pruning: Build a complete tree and simplify it rhough pruning operations

Answer 47

- Stops the algorithm before it becomes a full grown tree - Stops if all instances belong to the same class - Stops if all attribute values are the same

Answer 48

- Split dataset into 2/3 training, 1/3 testing - Grow decision tree to its entirety - Trim nodes using the bottom-up fashion - If error rates improves, replace sub tree by a leaf nodle

Answer 49

- Uses many decision trees trained n slightly different subsets of data - Bootstrap sample - Models are trained individually yielding different results to build multiple decision trees - Each base classified classifies a new vector of features from the original data - The last step, the results are combined and the generated output is based on majority voting. This is known as Bagging and is done by using an Ensemble Classifier

Answer 50

Advantages: - Handles categorical and real data - Naturally multi class - Highly effective classification - Robust to outliers - Better generalization Disadvantages: - Longer training time

Lectures Flashcards

(74 cards)