Project Flashcards
USP of our paper
o Identification of vivax parasite: little research but most diversified geographically & no vaccine exists yet on market <-> falciparum
o Zhao et al. (2020) tested falciparum trained model to identified vivax -> overfitting
o Binary classification task with this dataset (others only multi) -> because info about malaria existence most important with high accuracy (stage can be identified next) -> results show that CNN1 model same or outperformed multilabel classificaiton with the same BBBC dataset 98,3% (Li et al. 98,3% & Meng et al 94,17%)
Importance
rapid diagnosis & treatment best method to prevent severe outcomes & deaths
o No vaccine
o PCR costly & lack of equipment & health personnel
o Lack of health experts in low resource countries for manual microscopy identification
-> ML = cost-efficient, less expert knowledge necessary (as machine detects) & less equipment (no PCR, but just microscope & ml tool)
Business Case
o Our business case focuses on leveraging machine learning to improve malaria detection, specifically targeting the P. vivax strain. With over 250 million cases and 600,000 deaths reported annually, malaria remains a significant global health challenge affecting 84 countries. By addressing the neglected P. vivax strain and utilizing advanced machine learning techniques, we aim to provide accurate and cost-effective malaria detection solutions
o Target businesses: clinics (in low resource countries), pharmaceutical companies, public healthcare agencies or NGOs, researchers
o Advantages: cost- & expert-efficient, enhanced diagnosis accuracy & research advancement
o Market: 84 countries effected by malaria (developing countries)
Why SVM?
o used in literature similar tasks, compare to other results as well
o machine learning comparison to deep learning
o Ability to handle high-dimensional data using kernel trick
(= random forest, <-> KNN, Naïve Bayes, Linear Models (logisitc & linear regression))
o Robust against overfitting (by tuning C -> regularization parameter deciding size of hyperplane margin, if high -> low training error but lower generalization)
o Effective for binary classification (due to separation by hyperplane)
o Memory efficient (image with many pixel): Only have to save support vectors for decision boundaries (<-> KNN)
Why CNN? Why did use CNN and not fully connected network?
o Need to flatten image to 1D vector. Any spatial information is lost! (Position and layout of elements in two-dimensional space), want to identify infected cell part where it doesn’t matter at what location this is exactly
o Solution: convolutional layers -> only partially connect previous layer -> reduced number of connections needed
o Ability learn & identify intricate patterns & structures
o Learn hierarchical representations
o Widely used for image classification
Why different CNNs?
o One build from scratch
o One adapted from similar classification task
o Comparing if similar task CNN will also be good for other purpose (ours, vivax parasite, this dataset) -> need for other, specifically trained models?
Why scaling between 0 & 1 of the image pixels (normalizing)?
o Equal importance of features: prevents that certain features with large number dominate the learning process (more important for learning) -> biased towards features with larger magnitude
o Enhancing convergence speed (convergence = reaching a stable & optimal state)
How did you handle unbalanced dataset?
o Undersampling: randomly choosing 5.000 uninfected images
o Oversampling: ADASYNE
What is Data Augmentation?
o Increasing number of training data by creating new samples by applying transformations of existing data
o We used: geometric transformations: rotation & flipping (horizontal & vertical)
o Not: scaling or cropping: losing important details of cell images -> boundary boxes close to each other
o Used: ImageDataGenerator from TensorFlow
o Only for CNN (SVM serves as baseline: beneficial to keep the model as simple as possible to have a clear foundation for comparison with other more advanced approaches) -> could have tried with data augmentation as well
Why Data Augmentation?
o more variations and diversity into the training data -> model generalizes better
o mittigating overfitting: larger & more diverse dataset -> learns the characteristics insteady of memorizing training sample -> better generalization
Why not use Data Augmentation for SVM?
o Image Generator in Tensorflow not possible to implement in SVM flow
o Baseline model -> should have “true” images
What optimizer?
Adam: Adaptive Moment estimation
- Extension of gradient descent
- Combines two algorithms: computes first-oder momentum (moving average of gradients) and second-oder momentum (moving average of squared gradients) of loss function
-> adapts learning rate for each parameter based on its historical gradients & momentum
- GD with momentum = taking into consideration the ‘exponentially weighted average’ of the gradients -> average = converge faster
- RMSP algorithm = taking into consideration the “exponential moving average” (average change in data over time)
ADAM - elements
- Adaptive Learning Rate (dynamically adapts it for each parameter in network -> smooth & fast convergence)
- Momentum (helps accelerate convergence by adding a fraction of the previous gradient’s direction to the current gradient update -> overcome local minima)
- Bias correction (to address
Why used Adam?
Used in similar study
Pro: efficiency & robustness, converge faster, works well on noisy & sparse data
o Difference to normal stochastic GD
Learning grate adaptive (GD: maintains single learning rate during training)
- What is a momentum?
o technique where term is a added to the parameter updates that accounts for previous direction of movement
o e.g momentum value of 0.9: 90% of the previous direction is retained, and only 10% is influenced by the current gradient
What are different ways to choose the learning rate in gradient decent?
o Fixed learning rate: constant learning rate is set before training
o Learning rate schedules: adjust learning rate over time
Step decay: Learning rate is reduced by a certain factor after a fixed number of epochs or iterations
Exponential decay: The learning rate decreases exponentially over time.
Performance-Based Schedules: The learning rate is adjusted based on the performance on a validation set
o Adaptive learning rates: dynamically adjust the learning rate based on the behavior of the optimization process.
!!Adam combines adaptive learning rates with momentum. It adapts the learning rate for each parameter based on both the first-order (gradient) and second-order (gradient squared) moments of the gradients.
o Learning Rate Search: e.g. grid search
What does a learning rate of 0.001 mean?
o represents the step size at which a machine learning algorithm updates the model parameters during training = magnitude of adjusments made to model based on calculated gradients
o hyperparameter
o 0.0001 = low learning rate
o Small step sizes, doesn’t overshoot minimum but takes long time to converge
What does binary cross entropy loss mean?
o Cost function that measure the difference between the predicted probability distribution and the true probability distribution for a binary classification problem
o calculates the average of the logarithmic loss for each instance, where a higher loss is assigned to incorrect predictions and a lower loss to correct predictions
- loss: H. (q) =-1/N * sum(yi * log(p(yi)) + (1-yi) * log(1-p(yi))
- yi = 1 or 0 true outcome
- p(yi) probability that this outcome
- log: If p = 1 0 -log(p) outcome, if p smaller larger outcome