Lectures Flashcards

1
Q

What is the drawbacks of increasing dimensionaity?

A
  • Data becomes sparse
  • It becomes harder to generalize the model
  • Increasing the number of features will not always improve classification accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the correlation between the number of training examples and dimensionality

A

The number of training examples required increases exponentially with dimensionality D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hughes phenomeon?

A

If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of the machine learning model increases but after a certain point it tends to decrease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is dimensionality reduction required?

A
  • The space required to store the dataset also gets reduced
  • Less computation/training time is required
  • It removes redundant and irrelevant features
  • It helps in interpretation and visualization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is dimensionality reduction achieved?

A
  • Some features may contain negligible or irrelevant information
  • Several features can be combined together without loss or gain of information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is dimensionality reduction

A

It is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the dimensionality reduction techniques?

A
  • Feature selection:
    Chooses a subset of the original features
  • Feature extraction:
    Computes a new set of features from the original features through some transformation f()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the feature selection technique

A
  • Selects the most relevant ones to build better, faster, and easier to understand learning models.

Filter approach:
These methods evaluate the relevance of features independently of the chosen learning algorithm based on statistical measurements

Wrapper approach:
These methods assess the performance of a specific machine learning algorithm by repeatedly training and evaluating models with different subsets of features and select the best ones

Embedding approach:
These methods integrate feature selection within the model building process itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the common techniques of feature selection through filtering

A
  • This method is done as one of the pre-processing step before passing the data to build a model
  • Mutual information:
    Calculate the MI (level of independence) of each feature with respect to the class variable. Next, rank the features based on their MI and select the top ones
  • Correlation coefficient:
    Statistical measure of strength of linear association between two variables. It helps identify which variables closely resembles the other one. If the coefficient value is higher than the threshold value, we can remove one of the variable from the dataset. Ranges between -1 to 1 where value closer to 1 shows that they are highly correlated and value closer to -1 shows that they are negatively correlated
  • Variance threshold:
    It removes all features which variance are lower than a given threshold

Set of features -> selecting best feature -> learning algorithm -> performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is feature selection done through wrapper methods

A
  • The selection of features is done by considering it as a search problem, which different combinations are made, evaluated, and compared with other combinations.

1- Split data into subsets and train a model
2- Based on the output of the model, we add or subtract features and train the model again
3- It evaluates the accuracy of all the possible combinations of features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some common techniques for feature selection through wrapper methods?

A

1- Forward selection:
Start with an empty set of attributes S. At each step, add one more attribute that decreases the validation error the most then stop the selection when the validation error becomes stable or no significant improvement

2- Backward elimination:
Start with the set of all attributes then drop features with smallest impact on error

set of features -> (generate subset -> algorithm) -> performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Filtering vs wrapper methods

A

Check slides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain feature extraction

A
  • It transforms the space containing too many dimensions into a space with fewer dimensions
  • It aims to reduce the number of features in a dataset by creating new features from existing ones
  • The primary goal is to compress the data with the goal of maintaining most of the relevant information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are feature extraction techniques?

A
  • Principal components analysis (PCA):
    Seeks a projection that preserves as much information in the data as possible
  • Linear discriminant analysis (LDA):
    Seeks a projection that best discriminates the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the PCA technique

A
  • It is an unsupervised linear dimensionality reduction method that increases interpretability and minmizes information loss
  • PCA assumes linear relationships between variables
  • It is a statistical process that converts the observations of the correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation
  • These new transformed features are called the principal components which capture the maximum variance in the data
  • So they are straight lines that capture most of the variance of the data and they have a direction and magnitude
  • These components are linear combinations of the original features and provide a new coordinate system for the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the mathematical steps for the PCA algorithm

A

1- Standardize the data:
PCA requires standardized data so the first step is to standardize the data to ensure that all variables have a mean of 0 and a standard deviation of 1

2- Calculate the covariance matrix:
The next step is to calculate the covariance matrix of the standardized data. This matrix shows how each variable is related to every other variable

3- Calculate the eigenvectors and eigenvalues

4- Choose the principal components:
Computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance

5- Create the new feature vector:
The final step is to transform the original data into the lower-dimensional space defined by the principal components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some important properties of PCA

A
  • PCA assumes that the relationship between variables are linear
  • PCA assumes that the principal and components with larger variances are more important and should be retained
  • PCA assumes that the principal components are orthogonal to each other
  • PCA works best when the data is approximately normally distributed
  • Number of principal components is always less that or equal to the number of attributes
  • The priority of PCs decreases as their numbers increase
  • In general, the first components explain the largest variance of the data
  • PCA can handle missing data by using techniques such as mean imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the advantages of PCA

A
  • Dimensionality reduction:
    By determining the most crucial features or components, PCA reduces the dimensionality of the data, which is one of its primary benefits. This can be helpful when the initial data contains a lot of variables and is therefore challenging to visualize or analyze
  • Feature extraction:
    PCA can also be used to derive new featires or elements from the original data that might be more insightful or understandable than the original features. This is particularly helpful when the initial features are correlated or noisy
  • Data visualization:
    By projecting the data onto the first few principal components, PCA can be used to visualize high-dimensional data in two or three dimensions. This can aid in locating data patterns or clusters that may not have been visible in the initial high-dimensional space
  • Noise reduction:
    By locating the underlying signal or pattern in the data, PCA can also be used to lessen the impacts of noise or measurement errors in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the limitations of PCA

A
  • Interpretability:
    The principal components may lack interpretability as they are linear combinations of the original features
  • Scale dependence:
    PCA is sensitive to the scaling of the features, so features should be standardized before applying PCA
  • Linear assumption:
    PCA is a linear technique and may not capture nonlinear relationships in the data
  • Outlier sensitivity:
    PCA is sensitive to outliers in the data, which can distort the principal components
  • Computing complexity:
    For big datasets, it may be costly to compute the eigenvectors and eigenvalues of the covariance matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How we should choose K in PCA?

A
  • K is typically chosen based on how much information (variance) we want to preserve in the data:
  • It is usually chosen to preserve 90% of the information in the data
  • If K = D we preserve 100% of the information in the data

Use cross validation to determine the number of PCs that maximize the the model’s performance on unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain feature extraction using LDA

A
  • It is a supervised linear dimensionality reduction technique that aims to find a new set of variables that maximizes the separation between classes while minimizing the variation within each class
  • The resulting components are ranked by their discriminative power and can be used to visualize and interpret the data, as well as for classification or regression tasks
  • LDA assumes that the input data follows a Gaussian distribution therefore applying LDA to not Gaussian data can possibly lead to poor classification results
  • LDA assumes that the classes are linearly separable in the lower-dimensional space
  • LDA seeks to find directions along which the classes are best separated
  • It takes into consideration the scatter within-classes and between classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does LDA work?

A

1- Computing the within-class and between-class scatter matrices

2- Computing the eigenvectors and their corresponding eigenvalues for the scatter matrices

3- Sorting the eigenvalues and selecting the top k

4- Creating a new matrix that will contain the eigenvectors mapped to the k eigenvalues

5- The data is then projected onto the eigenvectors with the largest eigenvalues, which represent the most discriminative directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to evaluate the performance of dimensionality reduction techniques?

A
  • Explained variance ratio (PCA):
    It refers to the amount of variance in the original data that is captured or explained by each principal component
  • Classification accuracy (LDA):
    Train a classifier on the lower-dimensional data and measure the classification accuracy on a test set
  • Visualization (PCA/LDA):
    Visualize the lower-dimensional data and assess if the classes are well-separated or if the structure of the data is preserved
  • Cross validation (PCA/LDA):
    Use cross validation to estimate the generalization performance of the dimensionality reduction technique on unseen data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain the difference between linear discriminant analysis and PCA

A
  • PCA ignores class labels and focuses on finding the principal components that maximizes the variance in a given data. Thus it is an unsupervised algorithm
  • LDA is a supervised algorithm that intends to find the linear disciminants that represents those axes which maximize separation between different classes
  • LDA is typically chosen over PCA when the goal is classification or when the class structure in the data is known and important
  • PCA and LDA can be used together in a pipeline, where PCA is applied first to reduce the dimensionality of the data, followed by LDA for class separation
  • PCA can have at most n_features components while LDA can have at most n_classes -1 components
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

When to choose LDA over PCA

A
  • Supervised learning:
    Maximize the separation between classes for better classification or visualization
  • Class separation:
    Finding a lower-dimensional representation that best separates classes rather than simply capturing the maximum variance in PCA
  • Interpretability:
    LDA components may be more interpretable than PCA components since they are directly related to class separation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Can LDA handle nonlinear relationships between features?

A

Not directly, but it can through extensions that LDA offers such as Kernel LDA and quadratic discriminant analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is unsupervised learning?

A
  • Unsupervised model uncovers interesting structure in the data.
  • It can identify clusters of related datapoints without relying on pre-existing labels or target variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is clustering?

A
  • It is the classification of objects into different groups, or more precisely, the partitioning of a dataset into subsets, so that the data in each subset share some common traits, often according to some defined distance measure
  • The information clustering uses is the similarity between examples
  • The data within the same cluster are very similar, while the data in distinct clusters are different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the reasons for data clustering?

A
  • Discover the nature and structure of the data
  • Data classification
  • Data coding and compression
  • Cluster data whose characteristics change over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the properties of a good cluster?

A
  • High within cluster similarity and low inter-cluster similarity
31
Q

What is a distance measure and what are the types of distance measures?

A

A distance measure determines how the similarity of two elements is calculated and it will influence the shape of the clusters

  • Euclidean distance
  • Manhattan distance
32
Q

What are the types of clustering?

A
  • Flat or partitional clustering:
    Partitions are independent of each other such as K-means clustering
  • Hierarchical clustering:
    Partitions can be visualized using a tree structure (dendogram) such as Agglomerative clustering and divisive clustering
33
Q

What is K-means clustering?

A
  • Unsupervised learning algorithm that groups unlabeled datasets into k different clusters
  • The k means clustering algorithm mainly performs two tasks:
    1- Determines the best value for K center points
    2- Assigns each data point to its closest k-center based on the distance measure
34
Q

How does k-means clustering work?

A

1- Choose number of clusters k
2- Initialize centers: select k points at random to be our initial clusters
3- Measure the distance between each data point to each cluster cetner
4- Assign each data point to that cluster whose center is nearest to that data point
5- Re-compute the new center of the newly formed clusters, it is taken by taking the mean of all points in the cluster
6- Keep on repeating until there is no change in clusters or reaching maximum iterations

35
Q

How to choose K in k-means?

A
  • Using the elbow method
  • Using the silhouette method
36
Q

What are the steps for the elbow method?

A

1- Compute clustering algorithm for different values of k
2- For each k calculate the within-cluster sum of squares (WCSS)
3- Plot curve of WCSS according to the number of clusters
4- Identify the elbow in the plot, which is the point on the plot that looks like an elbow

(check slides for formula of WCSS)

37
Q

What are the steps for the silhouette method?

A

The silhouette coefficient is used to determine the quality of the clusters by checking how similar a data point is within a cluster compared to other clusters

  • Silhouette coefficient ranges between -1 to 1, 1 indicating that it is in the correct cluster. 0 indicates it is either closely to the decision boundary. -1 means it is in wrong cluster

Steps:
1- Compute k means clustering for a range of values
2- For each value of k, find the the average silhouette score of data points
3- Plot the collection of silhouette scores for each value of k
4- Select the number of clusters when the silhouette score is maximum
(check formula)

38
Q

What are the advantages and disadvantages of k-means clustering?

A

Advantages:
- Easy to understand and implement
- Computationally efficient

Disadvantages:
- Difficult to guess the correct K
- Sensitive to the initial selection of centroids
- Sensitive to outliers
- K is ineffective when the number of dimensions increase

39
Q

What is hierarchical clustering?

A

It is another type of unsupervised machine learning technique that does not require a pre-specified choice of K and groups objects into a tree of clusters

It can be a bottom up (agglomerative): Starting with each item in its own cluster finding the best pair to merge into a new cluster

It can be a top down (divisive): starting with all data in a single cluster, consider every possible way to divide cluster into two

40
Q

What are the methods to measure the distance between clusters (linkage methods) ?

A
  • Single linkage:
    Computes the minimum distance between clusters before merging them
  • Complete linkage:
    Computes the maximum distance between clusters before merging them
  • Average linkage:
    Computes the average distance between clusters before merging them
41
Q

What are the steps for Single linkage?

A

1- Compute the distance matrix between the points
2- Join the closest points to each other, and form a single group with a new calculated height which is the smallest distance between the 2 points
3- Keep on repeating
4- Draw dendogram

42
Q

How to determine the number of clusters from a dendogram?

A

Take the largest difference of heights and count how many vertical lines you see

43
Q

What are the steps for complete linkage?

A

1- Compute the distance matrix of the points
2- Combine the 2 points that have the smallest distance
3- The new distance between a point and a combined point corresponds to the maximum distance and not the shortest distance
4- Repeat
5- Draw dendogram

44
Q

What is a potential use of a dendogram

A

Detecting outliers

45
Q

What are the key components of ML experiments?

A
  • Domain understanding:
    Clearly defning the problem statement and objectives
  • Data collection
  • Data preprocessing/Features engineering:
    Includes cleaning, preparing the data for analysis and more
  • Model selection and training:
    Choosing appropriate models, tuning hyperparameters, and training techniques
  • Evaluation:
    Assessing model performance, interpreting results, and drawing insights
  • Deployment or production step
46
Q

What are the objectives of feature engineering?

A
  • Prepare the appropriate input dataset, compatible with the requirements of machine learning algorithm
  • Improve performance of machine learning models
47
Q

What do better features mean?

A
  • flexibility
  • Simpler models
  • Better results
48
Q

What are features engineering techniques?

A
  • Imputation: Handling missing data
  • Outlier detection
  • Logarithmic transformation
  • One-hot encoding
  • Scaling
  • Features extraction
49
Q

What is bias?

A

It is the difference between the predicted and real values of the training data. It is measured by evaluating the performance of a model on a training dataset. Models with high bias leads to high error on training data. High bias means underfitting the data

50
Q

What are some ways to reduce bias?

A
  • Incorporate additional features from data to improve the model’s accuracy
  • Increasing the number of training iterations to allow the model to learn more complex data
51
Q

What is variance?

A
  • It is the amount that the estimate of the target function will change if different training data was used
52
Q

What is a common approach to measure variance? What do low and high variance mean?

A
  • To measure variance we perform cross validation experiments and looking at how the model performs on different random splits of your training data
  • Low variance indicates a limited change in the target function in response to changes in training data
  • High variance means a significant difference. It means overfitting of the data
53
Q

How to reduce variance?

A
  • Reducing the number of features in the model
  • Replacing the current model with a simpler one
  • Increasing the training data diversity to balance out the complexity of the model and thr data structure
  • Performing hyperparameter tuning to avoid overfitting
54
Q

What are some bias-variance scenarios?

A
  • Low bias, low variance (ideal):
    Ideal but not often the case in practice, so there are terms such as reasonable bias and reasonable variance
  • Low bias, high variance (overfitting):
    When a model has too many parameters and fits too closely to the training data
  • High bias, low variance (underfitting):
    When the model doesn’t learn well from the training data or has too few parameters
  • High bias, high variance (inaccurate predictions):
    The predictions are inconsistent and inaccurate on average
55
Q

What factors should affect our decision when training a model?

A
  • Accuracy
  • Training time and space complexity
  • Testing time and space complexity
  • Interpretability
56
Q

When to use linear regression?

A
  • Linear relationship
  • Simplicity
  • Large sample size
  • Baseline modeling
57
Q

What are the limitations of classification by least squares?

A
  • It is not robust to outliers
  • It works well when the classes are separable
58
Q

Which problems does KNN solve?

A

Classification and Regression

59
Q

What is normalization

A

When we bring down all variables to the same scale. We subtract the mean and divide by the standard deviation of that variable

60
Q

What is min-max scaling

A

Another method to bring down all variable to the same scale, which transforms a feature such that all of its values fall in a range between 0 and 1

61
Q

What are the limitations of hard margins?

A
  • If the data is not linearly separable, we will not be able to find a hard margin
  • If the data has outliers this will affect the margin and make it more difficult to find a hard margin between classes
62
Q

What are slack variables in SVM?

A

They are introduced to handle misclassified points or points that fall within the margin

63
Q

What are kernel tricks?

A

They are functions that apply some complex mathematical operations on lower dimensional data points and convert them into higher dimensional space. This avoids the need to compute the transformed feature vectors which can be computationally expensive

64
Q

What are the hyperparameters that are typically tuned in SVM?

A
  • C:
    Influences the shape of the decision boundary and degree to which misclassifications are allowed in the determination of the soft margin hyperplane
  • Kernel functions:
    Specified when training an SVM. if the decision boundary is linear, the linear kernel is the best and most efficient choice, and it requires the gamma parameters to be provided and tuned
  • Gamma (RBF):
    The gamma parameter determines how much each training example influences the shape of the decision boundary. It decides how much curvature we ant in a decision boundary
65
Q

What is a decision tree and explain its structure

A
  • It is a simple type of classification algorithm for supervised learning in the form of a tree structure
  • Root node:
    Where the splitting starts
  • Decision/Child node:
    Where decisions or rules are made
  • Leaf/Terminal:
    It is associated with a value of the class. It cannot be split further
66
Q

What is entropy?

A
  • Formula to calculate the uncertainty of a node
  • A homogeneous sample has entropy of 0
  • An equally divided sample has entropy of 1
  • Entropy is maximum if we have no knowledge of the system
67
Q

What are the drawbacks of ID3

A
  • Data may be overfitted
  • Only one attribute at a time is tested for making decision
  • Does not handle numeric attributes and missing values
68
Q

What is a classification and regression tree?

A
  • It is used for generating classification and regression trees
  • It uses a gini index as a metric/cost function to evaluate split in feature
  • It uses least square as a mtric to select features in regression tree
69
Q

How does C4.5 handle missing attributes?

A
  • Discarding the example
  • Treating a missing value as a separate category
  • Imputing the missing value with the most common value for that attribute
70
Q

How to avoid overfitting a decision tree?

A

1- Pre-prunning:
Stop growing the decision tree wen the gain in terms of precision is negligible

2- Post-pruning:
Build a complete tree and simplify it rhough pruning operations

71
Q

How does pre prunning work

A
  • Stops the algorithm before it becomes a full grown tree
  • Stops if all instances belong to the same class
  • Stops if all attribute values are the same
72
Q

How does post pruning work?

A
  • Split dataset into 2/3 training, 1/3 testing
  • Grow decision tree to its entirety
  • Trim nodes using the bottom-up fashion
  • If error rates improves, replace sub tree by a leaf nodle
73
Q

What is a random forest?

A
  • Uses many decision trees trained n slightly different subsets of data
  • Bootstrap sample
  • Models are trained individually yielding different results to build multiple decision trees
  • Each base classified classifies a new vector of features from the original data
  • The last step, the results are combined and the generated output is based on majority voting. This is known as Bagging and is done by using an Ensemble Classifier
74
Q

What are advantages and disadvantages of random forest?

A

Advantages:
- Handles categorical and real data
- Naturally multi class
- Highly effective classification
- Robust to outliers
- Better generalization

Disadvantages:
- Longer training time