Project Flashcards
USP of our paper
o Identification of vivax parasite: little research but most diversified geographically & no vaccine exists yet on market <-> falciparum
o Zhao et al. (2020) tested falciparum trained model to identified vivax -> overfitting
o Binary classification task with this dataset (others only multi) -> because info about malaria existence most important with high accuracy (stage can be identified next) -> results show that CNN1 model same or outperformed multilabel classificaiton with the same BBBC dataset 98,3% (Li et al. 98,3% & Meng et al 94,17%)
Importance
rapid diagnosis & treatment best method to prevent severe outcomes & deaths
o No vaccine
o PCR costly & lack of equipment & health personnel
o Lack of health experts in low resource countries for manual microscopy identification
-> ML = cost-efficient, less expert knowledge necessary (as machine detects) & less equipment (no PCR, but just microscope & ml tool)
Business Case
o Our business case focuses on leveraging machine learning to improve malaria detection, specifically targeting the P. vivax strain. With over 250 million cases and 600,000 deaths reported annually, malaria remains a significant global health challenge affecting 84 countries. By addressing the neglected P. vivax strain and utilizing advanced machine learning techniques, we aim to provide accurate and cost-effective malaria detection solutions
o Target businesses: clinics (in low resource countries), pharmaceutical companies, public healthcare agencies or NGOs, researchers
o Advantages: cost- & expert-efficient, enhanced diagnosis accuracy & research advancement
o Market: 84 countries effected by malaria (developing countries)
Why SVM?
o used in literature similar tasks, compare to other results as well
o machine learning comparison to deep learning
o Ability to handle high-dimensional data using kernel trick
(= random forest, <-> KNN, Naïve Bayes, Linear Models (logisitc & linear regression))
o Robust against overfitting (by tuning C -> regularization parameter deciding size of hyperplane margin, if high -> low training error but lower generalization)
o Effective for binary classification (due to separation by hyperplane)
o Memory efficient (image with many pixel): Only have to save support vectors for decision boundaries (<-> KNN)
Why CNN? Why did use CNN and not fully connected network?
o Need to flatten image to 1D vector. Any spatial information is lost! (Position and layout of elements in two-dimensional space), want to identify infected cell part where it doesn’t matter at what location this is exactly
o Solution: convolutional layers -> only partially connect previous layer -> reduced number of connections needed
o Ability learn & identify intricate patterns & structures
o Learn hierarchical representations
o Widely used for image classification
Why different CNNs?
o One build from scratch
o One adapted from similar classification task
o Comparing if similar task CNN will also be good for other purpose (ours, vivax parasite, this dataset) -> need for other, specifically trained models?
Why scaling between 0 & 1 of the image pixels (normalizing)?
o Equal importance of features: prevents that certain features with large number dominate the learning process (more important for learning) -> biased towards features with larger magnitude
o Enhancing convergence speed (convergence = reaching a stable & optimal state)
How did you handle unbalanced dataset?
o Undersampling: randomly choosing 5.000 uninfected images
o Oversampling: ADASYNE
What is Data Augmentation?
o Increasing number of training data by creating new samples by applying transformations of existing data
o We used: geometric transformations: rotation & flipping (horizontal & vertical)
o Not: scaling or cropping: losing important details of cell images -> boundary boxes close to each other
o Used: ImageDataGenerator from TensorFlow
o Only for CNN (SVM serves as baseline: beneficial to keep the model as simple as possible to have a clear foundation for comparison with other more advanced approaches) -> could have tried with data augmentation as well
Why Data Augmentation?
o more variations and diversity into the training data -> model generalizes better
o mittigating overfitting: larger & more diverse dataset -> learns the characteristics insteady of memorizing training sample -> better generalization
Why not use Data Augmentation for SVM?
o Image Generator in Tensorflow not possible to implement in SVM flow
o Baseline model -> should have “true” images
What optimizer?
Adam: Adaptive Moment estimation
- Extension of gradient descent
- Combines two algorithms: computes first-oder momentum (moving average of gradients) and second-oder momentum (moving average of squared gradients) of loss function
-> adapts learning rate for each parameter based on its historical gradients & momentum
- GD with momentum = taking into consideration the ‘exponentially weighted average’ of the gradients -> average = converge faster
- RMSP algorithm = taking into consideration the “exponential moving average” (average change in data over time)
ADAM - elements
- Adaptive Learning Rate (dynamically adapts it for each parameter in network -> smooth & fast convergence)
- Momentum (helps accelerate convergence by adding a fraction of the previous gradient’s direction to the current gradient update -> overcome local minima)
- Bias correction (to address
Why used Adam?
Used in similar study
Pro: efficiency & robustness, converge faster, works well on noisy & sparse data
o Difference to normal stochastic GD
Learning grate adaptive (GD: maintains single learning rate during training)
- What is a momentum?
o technique where term is a added to the parameter updates that accounts for previous direction of movement
o e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient
What are different ways to choose the learning rate in gradient decent?
o Fixed learning rate: constant learning rate is set before training
o Learning rate schedules: adjust learning rate over time
Step decay: Learning rate is reduced by a certain factor after a fixed number of epochs or iterations
Exponential decay: The learning rate decreases exponentially over time.
Performance-Based Schedules: The learning rate is adjusted based on the performance on a validation set
o Adaptive learning rates: dynamically adjust the learning rate based on the behavior of the optimization process.
!!Adam combines adaptive learning rates with momentum. It adapts the learning rate for each parameter based on both the first-order (gradient) and second-order (gradient squared) moments of the gradients.
o Learning Rate Search: e.g. grid search
What does a learning rate of 0.001 mean?
o represents the step size at which a machine learning algorithm updates the model parameters during training = magnitude of adjusments made to model based on calculated gradients
o hyperparameter
o 0.0001 = low learning rate
o Small step sizes, doesn’t overshoot minimum but takes long time to converge
What does binary cross entropy loss mean?
o Cost function that measure the difference between the predicted probability distribution and the true probability distribution for a binary classification problem
o calculates the average of the logarithmic loss for each instance, where a higher loss is assigned to incorrect predictions and a lower loss to correct predictions
- loss: H. (q) =-1/N * sum(yi * log(p(yi)) + (1-yi) * log(1-p(yi))
- yi = 1 or 0 true outcome
- p(yi) probability that this outcome
- log: If p = 1 0 -log(p) outcome, if p smaller larger outcome
What is overfitting?
performance high on training data but low on validation set (low bias, but high variance)
o observing the learning curves reveals a pattern of overfitting when the training loss consistently decreases, but the validation loss begins to rise or remains stagnant.
o Goal: small gap between training & validation curve -> generalization good
What done to prevent overfitting?
- CNN: Drop out layer, early stopping
- SVM: PCA: removes noise in the data and keeps only the most important features. Less dimensions reduces the risk of overfitting.
Grid search & cross-validation
Look at each result of each fold of cross-validation -> if approxi. Same no overfitting
how do you choose number of folds (cross-validation)?
o higher number of folds -> more accurate estimate of performance but require more computational resources and time
o Smaller datasets may benefit from a higher number of folds, while larger datasets may be adequately assessed with fewer folds
o We used 3 folds (cv=3) to reduce the CPU time needed (we could have used 5)
What is GPU?
o GPU stands for Graphics Processing Unit
- is a specialized electronic circuit designed to quickly process and render graphics
- widely used for accelerating computations in areas such as ml
How did you tune hp in CNNs? And which ones?
o With babysitting
o number of hidden layers, units per layer, learning rate, epochs, batch size, early stop patience
What are epochs?
o Number of times entire training dataset is passed through NN during training process
o Each epoch consists of: 1 forwards, 1 backward pass (update of weights) for all training examples
o Result: NN can learn & refine parameters gradually
What is the activation function ReLu and why does it work well for images?
o Replaces all negative values (pixels) with 0, keep all positive values the same
o ReLu are easy to compute, computationally efficient, gradients are cleanly defined and are constant (except for a piecewise non-linearity)
o introduce non-linearity into the network, allowing it to learn complex patterns and gradients
Why PCA only for SVM
CNN detected weights and features as deep learning
SVM only weights as machine learning
What are kernal functions? SVM
To classify non linearly sepearble data
transform the images into a high-dimensional space and consequently find the optimal decision boundaries in this new high-dimensional space
Functions include: linear, polynomial, radial basis function
Which kernal function did we use & why?
Radial basis function: combines multiple polynomial kernels multiple times of different degrees to project the non-linearly
Detect via grid search
Why grid search to select hyperparameters
- Random Search & Try-Error: needs more time, random and thus expected to provide worse hp, not structured search
- PCA -> less dimensions -> grid search computationally ok
What are relevant hyperparameters of SVM?
C = controls balance between maximizing margin & minimizing training error
* Small -> wide „street“ seperating 2 classes but datapoints might be within boundary -> larger training error but better generalization
* Large -> small „street“ -> lower trianing error but lower generalization
Gamma = influences how decision boundary adapts to different datapoints = defines how many dp considered for hyperplane
* Small -> considers many dp
* Large -> conisders few dp
What where your optimal hyperparameters? What did they tell you? SVM
C = 100
* Relatively high -> small street, lower training error, but lower generalization
gamma = 0.0001
* relatively low -> decision boundary less adjusted to individual datapoints, considers a broader range of training instances in determining the decision boundary -> considers more instances -> more generalizable
Accuracy
o Accuracy: measures overall correctones of model, ratio of correctly predicted instances to total number of instances
Precision
o Precision: measures model ability to predict positive instances correctly -> presents ratio of false positives
Recall
o Recall: ratio of true positives to the sum of true positives and false negatives > presents ratio of false negative
Which CNN models did you choose? And what are the differences?
CNN1: self-developed model <-> CNN2: adapted from similar paper
Differences between CNN models
o CNN1:
padding (because of cropping)
more layers (5 conv., 5 maxpooling, 2 dropout, 2 dense)
dropout
smaller kernel size (for malaria important -> consider local features, don’t miss info)
max pooling (<-> average) (better because max focuses on most important / striking features in surrounding)
2 (<-> 3) dense layers
Dense layers with 512 units (<-> 256), more information -> capture more complex patterns & nuances in the input (also due to larger image size)
dense layers have activation function (introduction of non-linearity)
Potential explanation based on Differences from CNN2 paper:
o Other parasite
o different images (background etc.)
o no cropping
o smaller images (44x44, we have 128 x 128 -> can have more pooling -> using their architecture will not work (still large images with few pooling layers)
Which kernel size should you use?
a. In our case: used small one (3,3) -> because malaria part can be small
b. Also don’t want to risk to miss some information, pattern that indicates malaria
Why always relu?
o introduce non-linearity into the network, allowing it to learn complex patterns and gradients
o computationally efficient
o avoiding vanishing gradients
o Images typically consist of pixel intensities ranging from 0 to 255 -> black = 0, dont need negative values
o Non-linearity: For negative inputs (x < 0) -> output 0. This non-linear behavior introduces an element of discontinuity, breaking the linearity of the network
o For positive: still linear -> can capture both non-linearity & linearity
What is padding? Why used? -
0 boarders around image pixels, info on boarder, we had boundary boxes -> wanted to classify well
Why did CNN 2 use average and max pooling and which one is better
- Max pooling better for image classification
- features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map (hence, the term feature map),
- more informative to look at maximal presence of different features than at their average presence
- Full contrast
Why not KNN?
- Does not scale well: As dataset grows, KNN becomes increasingly inefficientcompromising overall model performance. (Scaling problem)
- prone to overfitting
- Curse of dimensionality: volume of space grows exponentially with dimensions, Need more points to ‘fill’ a high-dimensional volume
Why not Random Forest?
- RF better for multi-class classification
- not so well with high-dimensional data compared to SVM
- research showed that RF performs better if 10 to 100 features, SVM if > 100 features (paper that compares which papers used RF or SVM & how they performed)
- could be a good option as well (but literature said that SVM outperformed for Malaria detection)
Logistic regression
- linear model that’s why can’t really work
- risk of overfitting if many independent variable
- limited in capturing complexity
In can why use dropout and early stopping?
address different aspects of overfitting
* Early stopping: control the capacity of the model by stopping training before it starts overfitting
* Dropout: introduces randomness during training, preventing the network from relying too heavily on any specific set of features or neurons