Project Flashcards

1
Q

USP of our paper

A

o Identification of vivax parasite: little research but most diversified geographically & no vaccine exists yet on market <-> falciparum
o Zhao et al. (2020) tested falciparum trained model to identified vivax -> overfitting
o Binary classification task with this dataset (others only multi) -> because info about malaria existence most important with high accuracy (stage can be identified next) -> results show that CNN1 model same or outperformed multilabel classificaiton with the same BBBC dataset 98,3% (Li et al. 98,3% & Meng et al 94,17%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Importance

A

rapid diagnosis & treatment best method to prevent severe outcomes & deaths
o No vaccine
o PCR costly & lack of equipment & health personnel
o Lack of health experts in low resource countries for manual microscopy identification
-> ML = cost-efficient, less expert knowledge necessary (as machine detects) & less equipment (no PCR, but just microscope & ml tool)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Business Case

A

o Our business case focuses on leveraging machine learning to improve malaria detection, specifically targeting the P. vivax strain. With over 250 million cases and 600,000 deaths reported annually, malaria remains a significant global health challenge affecting 84 countries. By addressing the neglected P. vivax strain and utilizing advanced machine learning techniques, we aim to provide accurate and cost-effective malaria detection solutions
o Target businesses: clinics (in low resource countries), pharmaceutical companies, public healthcare agencies or NGOs, researchers
o Advantages: cost- & expert-efficient, enhanced diagnosis accuracy & research advancement
o Market: 84 countries effected by malaria (developing countries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why SVM?

A

o used in literature similar tasks, compare to other results as well
o machine learning comparison to deep learning
o Ability to handle high-dimensional data using kernel trick
(= random forest, <-> KNN, Naïve Bayes, Linear Models (logisitc & linear regression))
o Robust against overfitting (by tuning C -> regularization parameter deciding size of hyperplane margin, if high -> low training error but lower generalization)
o Effective for binary classification (due to separation by hyperplane)
o Memory efficient (image with many pixel): Only have to save support vectors for decision boundaries (<-> KNN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why CNN? Why did use CNN and not fully connected network?

A

o Need to flatten image to 1D vector. Any spatial information is lost! (Position and layout of elements in two-dimensional space), want to identify infected cell part where it doesn’t matter at what location this is exactly
o Solution: convolutional layers -> only partially connect previous layer -> reduced number of connections needed
o Ability learn & identify intricate patterns & structures
o Learn hierarchical representations
o Widely used for image classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why different CNNs?

A

o One build from scratch
o One adapted from similar classification task
o Comparing if similar task CNN will also be good for other purpose (ours, vivax parasite, this dataset) -> need for other, specifically trained models?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why scaling between 0 & 1 of the image pixels (normalizing)?

A

o Equal importance of features: prevents that certain features with large number dominate the learning process (more important for learning) -> biased towards features with larger magnitude
o Enhancing convergence speed (convergence = reaching a stable & optimal state)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How did you handle unbalanced dataset?

A

o Undersampling: randomly choosing 5.000 uninfected images
o Oversampling: ADASYNE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data Augmentation?

A

o Increasing number of training data by creating new samples by applying transformations of existing data
o We used: geometric transformations: rotation & flipping (horizontal & vertical)
o Not: scaling or cropping: losing important details of cell images -> boundary boxes close to each other
o Used: ImageDataGenerator from TensorFlow
o Only for CNN (SVM serves as baseline: beneficial to keep the model as simple as possible to have a clear foundation for comparison with other more advanced approaches) -> could have tried with data augmentation as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why Data Augmentation?

A

o more variations and diversity into the training data -> model generalizes better
o mittigating overfitting: larger & more diverse dataset -> learns the characteristics insteady of memorizing training sample -> better generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why not use Data Augmentation for SVM?

A

o Image Generator in Tensorflow not possible to implement in SVM flow
o Baseline model -> should have “true” images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What optimizer?

A

Adam: Adaptive Moment estimation
- Extension of gradient descent
- Combines two algorithms: computes first-oder momentum (moving average of gradients) and second-oder momentum (moving average of squared gradients) of loss function
-> adapts learning rate for each parameter based on its historical gradients & momentum
- GD with momentum = taking into consideration the ‘exponentially weighted average’ of the gradients -> average = converge faster
- RMSP algorithm = taking into consideration the “exponential moving average” (average change in data over time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ADAM - elements

A
  • Adaptive Learning Rate (dynamically adapts it for each parameter in network -> smooth & fast convergence)
  • Momentum (helps accelerate convergence by adding a fraction of the previous gradient’s direction to the current gradient update -> overcome local minima)
  • Bias correction (to address
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why used Adam?

A

 Used in similar study
 Pro: efficiency & robustness, converge faster, works well on noisy & sparse data
o Difference to normal stochastic GD
 Learning grate adaptive (GD: maintains single learning rate during training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  • What is a momentum?
A

o technique where term is a added to the parameter updates that accounts for previous direction of movement
o e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are different ways to choose the learning rate in gradient decent?

A

o Fixed learning rate: constant learning rate is set before training
o Learning rate schedules: adjust learning rate over time
 Step decay: Learning rate is reduced by a certain factor after a fixed number of epochs or iterations
 Exponential decay: The learning rate decreases exponentially over time.
 Performance-Based Schedules: The learning rate is adjusted based on the performance on a validation set
o Adaptive learning rates: dynamically adjust the learning rate based on the behavior of the optimization process.
 !!Adam combines adaptive learning rates with momentum. It adapts the learning rate for each parameter based on both the first-order (gradient) and second-order (gradient squared) moments of the gradients.
o Learning Rate Search: e.g. grid search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does a learning rate of 0.001 mean?

A

o represents the step size at which a machine learning algorithm updates the model parameters during training = magnitude of adjusments made to model based on calculated gradients
o hyperparameter
o 0.0001 = low learning rate
o Small step sizes, doesn’t overshoot minimum but takes long time to converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does binary cross entropy loss mean?

A

o Cost function that measure the difference between the predicted probability distribution and the true probability distribution for a binary classification problem
o calculates the average of the logarithmic loss for each instance, where a higher loss is assigned to incorrect predictions and a lower loss to correct predictions
- loss: H. (q) =-1/N * sum(yi * log(p(yi)) + (1-yi) * log(1-p(yi))
- yi = 1 or 0 true outcome
- p(yi) probability that this outcome
- log: If p = 1 0 -log(p) outcome, if p smaller larger outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is overfitting?

A

performance high on training data but low on validation set (low bias, but high variance)
o observing the learning curves reveals a pattern of overfitting when the training loss consistently decreases, but the validation loss begins to rise or remains stagnant.
o Goal: small gap between training & validation curve -> generalization good

20
Q

What done to prevent overfitting?

A
  • CNN:  Drop out layer, early stopping
  • SVM: PCA: removes noise in the data and keeps only the most important features. Less dimensions reduces the risk of overfitting.
     Grid search & cross-validation
     Look at each result of each fold of cross-validation -> if approxi. Same no overfitting
21
Q

how do you choose number of folds (cross-validation)?

A

o higher number of folds -> more accurate estimate of performance but require more computational resources and time
o Smaller datasets may benefit from a higher number of folds, while larger datasets may be adequately assessed with fewer folds
o We used 3 folds (cv=3) to reduce the CPU time needed (we could have used 5)

22
Q

What is GPU?

A

o GPU stands for Graphics Processing Unit
- is a specialized electronic circuit designed to quickly process and render graphics
- widely used for accelerating computations in areas such as ml

23
Q

How did you tune hp in CNNs? And which ones?

A

o With babysitting
o number of hidden layers, units per layer, learning rate, epochs, batch size, early stop patience

24
Q

What are epochs?

A

o Number of times entire training dataset is passed through NN during training process
o Each epoch consists of: 1 forwards, 1 backward pass (update of weights) for all training examples
o Result: NN can learn & refine parameters gradually

25
Q

What is the activation function ReLu and why does it work well for images?

A

o Replaces all negative values (pixels) with 0, keep all positive values the same
o ReLu are easy to compute, computationally efficient, gradients are cleanly defined and are constant (except for a piecewise non-linearity)
o introduce non-linearity into the network, allowing it to learn complex patterns and gradients

26
Q

Why PCA only for SVM

A

 CNN detected weights and features as deep learning
 SVM only weights as machine learning

27
Q

What are kernal functions? SVM

A

 To classify non linearly sepearble data
 transform the images into a high-dimensional space and consequently find the optimal decision boundaries in this new high-dimensional space
 Functions include: linear, polynomial, radial basis function

28
Q

Which kernal function did we use & why?

A

 Radial basis function: combines multiple polynomial kernels multiple times of different degrees to project the non-linearly
 Detect via grid search

29
Q

Why grid search to select hyperparameters

A
  • Random Search & Try-Error: needs more time, random and thus expected to provide worse hp, not structured search
  • PCA -> less dimensions -> grid search computationally ok
30
Q

What are relevant hyperparameters of SVM?

A

 C = controls balance between maximizing margin & minimizing training error
* Small -> wide „street“ seperating 2 classes but datapoints might be within boundary -> larger training error but better generalization
* Large -> small „street“ -> lower trianing error but lower generalization
 Gamma = influences how decision boundary adapts to different datapoints = defines how many dp considered for hyperplane
* Small -> considers many dp
* Large -> conisders few dp

31
Q

What where your optimal hyperparameters? What did they tell you? SVM

A

 C = 100
* Relatively high -> small street, lower training error, but lower generalization
 gamma = 0.0001
* relatively low -> decision boundary less adjusted to individual datapoints, considers a broader range of training instances in determining the decision boundary -> considers more instances -> more generalizable

32
Q

Accuracy

A

o Accuracy: measures overall correctones of model, ratio of correctly predicted instances to total number of instances

33
Q

Precision

A

o Precision: measures model ability to predict positive instances correctly -> presents ratio of false positives

34
Q

Recall

A

o Recall: ratio of true positives to the sum of true positives and false negatives > presents ratio of false negative

35
Q

Which CNN models did you choose? And what are the differences?

A

CNN1: self-developed model <-> CNN2: adapted from similar paper

36
Q

Differences between CNN models

A

o CNN1:
 padding (because of cropping)
 more layers (5 conv., 5 maxpooling, 2 dropout, 2 dense)
 dropout
 smaller kernel size (for malaria important -> consider local features, don’t miss info)
 max pooling (<-> average) (better because max focuses on most important / striking features in surrounding)
 2 (<-> 3) dense layers
 Dense layers with 512 units (<-> 256), more information -> capture more complex patterns & nuances in the input (also due to larger image size)
 dense layers have activation function (introduction of non-linearity)

37
Q

Potential explanation based on Differences from CNN2 paper:

A

o Other parasite
o different images (background etc.)
o no cropping
o smaller images (44x44, we have 128 x 128 -> can have more pooling -> using their architecture will not work (still large images with few pooling layers)

38
Q

Which kernel size should you use?

A

a. In our case: used small one (3,3) -> because malaria part can be small
b. Also don’t want to risk to miss some information, pattern that indicates malaria

39
Q

Why always relu?

A

o introduce non-linearity into the network, allowing it to learn complex patterns and gradients
o computationally efficient
o avoiding vanishing gradients
o Images typically consist of pixel intensities ranging from 0 to 255 -> black = 0, dont need negative values

o Non-linearity: For negative inputs (x < 0) -> output 0. This non-linear behavior introduces an element of discontinuity, breaking the linearity of the network
o For positive: still linear -> can capture both non-linearity & linearity

40
Q

What is padding? Why used? -

A

0 boarders around image pixels, info on boarder, we had boundary boxes -> wanted to classify well

41
Q

Why did CNN 2 use average and max pooling and which one is better

A
  • Max pooling better for image classification
  • features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map (hence, the term feature map),
  • more informative to look at maximal presence of different features than at their average presence
  • Full contrast
42
Q

Why not KNN?

A
  • Does not scale well: As dataset grows, KNN becomes increasingly inefficientcompromising overall model performance. (Scaling problem)
  • prone to overfitting
  • Curse of dimensionality: volume of space grows exponentially with dimensions, Need more points to ‘fill’ a high-dimensional volume
43
Q

Why not Random Forest?

A
  • RF better for multi-class classification
  • not so well with high-dimensional data compared to SVM
  • research showed that RF performs better if 10 to 100 features, SVM if > 100 features (paper that compares which papers used RF or SVM & how they performed)
  • could be a good option as well (but literature said that SVM outperformed for Malaria detection)
44
Q

Logistic regression

A
  • linear model that’s why can’t really work
  • risk of overfitting if many independent variable
  • limited in capturing complexity
45
Q

In can why use dropout and early stopping?

A

address different aspects of overfitting
* Early stopping: control the capacity of the model by stopping training before it starts overfitting
* Dropout: introduces randomness during training, preventing the network from relying too heavily on any specific set of features or neurons