AI Flashcards
Network Medicine is based on…
network science, physics, applied mathematics and statistics, computer science, biology, and medicine
*
Patients are unique
*
Patients with the same clinical picture do not share necessarily the same disease pathophenotype
*
Networks of molecular interactions (interactome) to identify unknown disease phenotypes and pathogenic event
*
Network of Networks
Network Medicine is …?
*
Different biological networks capture the complex interactions between genes, proteins, RNA molecules, metabolites and genetic variants in the cells of organisms
*
These networks, also interchangeably known as graphs, are representations in which the complex system components are simplified as nodes that are connected by links (edges)
*
Network medicine is largely discovery driven, rather than hypothesis driven, uncovering previously unknown relationships and leading to the identification of new biomarkers
Network-based studies have to primarily identify two things…?
*
what are the critical entities in the system under investigation (nodes)
*
what is the nature of the interactions between theseentities (edges)
what are the kinds of grraph is network medicine ?
Binary vs Weighted
Directed vs undirected
how is the Identification of disease associated network components within the interactome done ?
*
Consideration of the topological properties of the nodes and assess the functional role of their hubness which is the property of having a higher number of connections
*
Identification of new disease genes in the network by using “guilt-by-association“a property not based on direct evidence but association with other disease genes
*
Prioritization of candidate disease genes, molecular interaction networks assists in the identifification of sub-networks mechanistically linked to disease phenotypes
how does the Co-expression based network modeling to identify disease biomarkers work ?
*
Patterns of transcript abundance are studied in the context of the disease after construction of Gene Co-expression Networks (GCNs)
*
Combination of important seed genes with an organic network of co-expression patterns derived from the gene expression data from the same system
*
GCNs identify the functionally coordinated participation of genes in response to an external stimulus or condition
*
GCNs can be signed or unsigned, weighted or unweighted, and may either be constructed using microarray or RNA-Seq data
how are we Inferring ( forming an opinion ) Phenotype Specific Gene Regulatory Networks?
*
Separate networks can be built for each phenotype which may be case-control, disease-specific, tissue or cell-specific, sex-specific, or for different disease subtypes
*
Network comparison model stems from the axiom of “differential networking” over “differential expression”
*
The comparison of networks helps to uncover the specific rewiring of pathways, such as those induced by disease, pharmacological treatment, or environmental stimuli and more
The Future Needs in NM?
*Define as much as possible the biological heterogeneity to increase the precision of risk prediction and the personalization of prevention and intervention strategies
*Help the researchers to better understand the human physiological and clinical relevance (to avoid reverse technological processes) and to focus on the relevance for the patients needs
*Integrate data of different nature in a way able to rapidly reduce the dimensionality in order to distill implementable results in drug discovery/healthcare management
what is the use of NM?
Disease Understanding: Network medicine enables researchers to characterize diseases as perturbations in complex biological networks rather than isolated anomalies. By mapping out the interactions among genes, proteins, and other molecular entities, network medicine provides insights into disease mechanisms, progression, and heterogeneity. This holistic approach aids in identifying novel biomarkers and therapeutic targets.
Personalized Medicine: By integrating patient-specific data, such as genomics, transcriptomics, and clinical information, with network-based models, personalized treatment strategies can be devised. Network analysis helps in identifying patient subgroups with similar molecular profiles and predicting individual responses to drugs, allowing for tailored therapeutic interventions.
Drug Discovery and Repurposing: Network medicine facilitates the identification of drug targets and the repurposing of existing drugs for new indications. By analyzing drug-protein interaction networks and their effects on disease-associated pathways, researchers can identify candidate compounds with therapeutic potential and optimize drug combinations for synergistic effects.
Systems Pharmacology: Network medicine provides a systems-level understanding of drug actions and their effects on biological pathways. By integrating pharmacological data with molecular networks, researchers can predict drug efficacy, side effects, and interactions, aiding in the design of safer and more effective treatments.
Biomarker Discovery: Network-based approaches help in the identification of molecular signatures and biomarkers associated with disease diagnosis, prognosis, and treatment response. By analyzing the connectivity and dynamics of biomolecular networks, researchers can uncover diagnostic markers for early disease detection and monitor disease progression.
Biological Network Visualization and Interpretation: Network visualization tools and software platforms allow researchers to visually explore and interpret complex biological networks. By representing molecular interactions as graphical networks, researchers can identify key nodes (e.g., hubs, bottlenecks) and pathways implicated in disease pathogenesis, facilitating hypothesis generation and experimental validation.
what are Artificial
Intelligence ( AI)& Machine Learning (ML) ?
*
AI : the theory and development of computer systems able to perform
tasks that normally require human intelligence, such as visual
perception, speech recognition, decision making, and translation
between languages.
*
ML : The use and development of computer systems that are able to
learn and adapt without following explicit instructions, by using
algorithms and statistical models to analyze and draw inferences from
patterns in data.
-
-> Artificial intelligence is simulated intellectual tasks. Machine Learning is algorithms
trained on data to learn patterns to make predictions.
Machine
learning use cases in life science
Genomics
Genomics
*
Variant calling
*
Genetic sequence
of a cancer e.g.
druggable targets
*
Functional
predictions
OMICS &
life
science
*
Risk factors (e.g.,
hypertension)
*
Integration of
Multiomics
*
Protein structure
predictions
*
DDI networks
*
Drug Discovery
Diagnostics
*
Images of
patients e.g. eye,
skin, hair
*
CT pictures e.g. of
the head , cancer
*
X ray films
*
Real time video
of a colonoscopy
Healthcare
Diagnostics
*
Alerts &
diagnostics from
ral time EHR data
*
Predictive health
management
*
Healthcare
provider
sentiment
analysis
what is the big difference between deep learning and machine learning ?
feature extraction is done manually in machine learning whereas in deep learning we don’t give it the features , it learns how to classify by itself
can we have both acuracy and interpretability in ML?
Trade
off between accuracy and interpretability for ML models
how does chat gpt work ?
The chat gpt splits the words to models
It predicts what word comes after the other
Possible
token levels
*
Sentence
*
Words
*
Subword
*
Character
how does supervised learning work ?
Supervised learning we give training data that is categorized
so then it can say if its good or bad for example ( binary )
What if we have more than one input ?
It can draw a line in two dimensions and categorise the elements
how does Unsupervised
learning work ?
“the data comes only with inputs
x but not output labels y,
and the algorithm has to find some structure or some
pattern or something interesting in the data.”
Questions
, apply supervised or unsupervised learning algorithm
*
Given email labeld as spam /not spam , learn a spam filter
*
Given a set of published papers found on pubmed , group them
into sets of articles about the same research topic
*
Given a databse of expression data of patients , automatically
discover signals and group patients into different response
groups
*
Given a datasdet of patients diagnosed as either having
diabets or not, learn to classify new patients as having
diabetes or not
Supervised
Unsupervised
Unsupervised
Supervised
what is the basic principle of supervised regression learning ?
training set - learning algorithm = Feature - model- Prediction (Estimated y)
What
is f?
𝑓(𝑥)=𝑤𝑥+𝑏
Linear
regression with one
variable/ feature
=Univariate linear regression
Needed:
Matrix of features
Matrix of coefficient
Principle of machine learning algorithms
3 step process
Infer / Predict
Error / Loss
Train / Learn
-Predict : MOVE
-Error: BAD or GOOD
-Learn :Oh,
this was a
terrible
idea
-Reinforcment :
Well done , do it again
Model:
Decreasing or increasing the weights
what does the cost function do ?
Squared error cost function
calculates the distance( Mean Squared Error) from the correct value and then :
𝑓(𝑥)=𝑤𝑥+𝑏
Optimize w and b to get lowest Mean Squared Error ( sometimes this can be a loval minimum and thats a problem )
what is overfitting ?
Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. When data scientists use machine learning models for making predictions, they first train the model on a known data set. It is too fitted to the training data xw+x^2w + x^3w….+ b
why is alphafold not perfect ?
Functional predictions of variants
Prediction of “
AlphaFold has not been validated for
predicting the effect of mutations . In particular,
AlphaFold is not expected to produce an unfolded
protein structure given a sequence containing a
destabilising point mutation.”
Best
assessment of whether a variant has structural or
functional impact also requires contextual knowledge
but
You can predict the function of variants with alphafold misssence
can we predict
CYP2D6 phenotype with Machine learning ?
yes
and we ca do Functional assessment of
pharmacogenomic variants
Predicting with Machine learning for CYP2D6 we can skip annotation as star alleles and allocating numeric values of 1, 0 and we also get great results
There is other ways to predict also using star alleles
what is Ensemble
or Metalearner?
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better
predictive performance than could be obtained by any of the constituent algorithms.
You can use multiple machine learning and the algorith decides which ones are better- Superlearner pachage in R
appllications ?
The hemoglobin levels and the amount of blood transfused can be estimated with less error than before because of ML
Supervised machine
learning methods trained using SNPs
and total baseline depression scores predicted remission
and response at 8 weeks with area under the receiver
operating curve (AUC) 0.7
70%
prediction acccuracy
Assesment
of drug drug interactions in polypharmacy using graph
convolutional networks
AI performs as well as doctors in university tests
wha is the difference btween unsupervised machine learning and deep learning ?
Unsupervised Machine Learning:
In unsupervised learning, the algorithm is given a dataset without explicit instructions on what to do with it. The algorithm must find patterns, structure, or relationships within the data on its own.
Unsupervised learning techniques include clustering, dimensionality reduction, and association rule learning.
Clustering algorithms like K-means or hierarchical clustering group similar data points together based on their inherent patterns or similarities.
Dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) aim to reduce the number of features in a dataset while preserving its important characteristics.
Unsupervised learning is often used for tasks such as anomaly detection, data compression, and exploratory data analysis.
Deep Learning:
Deep learning is a subset of machine learning that involves neural networks with multiple layers (deep neural networks). These networks can automatically learn hierarchical representations of data.
Deep learning models are typically trained using large amounts of labeled data, and they learn to extract features directly from the raw data without the need for manual feature engineering.
Deep learning has shown remarkable success in various tasks such as image recognition, natural language processing, speech recognition, and recommendation systems.
Common architectures in deep learning include Convolutional Neural Networks (CNNs) for image-related tasks, Recurrent Neural Networks (RNNs) for sequence data, and Transformers for natural language processing tasks.
While deep learning models can be used for unsupervised learning tasks (e.g., autoencoders for dimensionality reduction or generative adversarial networks for generating synthetic data), they are more commonly associated with supervised learning where they learn to map inputs to outputs.
key points of machine learning and AI in pharmacogenomics?
Personalized Medicine: Machine learning and AI techniques enable the development of personalized medicine approaches in pharmacogenomics. By analyzing an individual’s genetic information, along with other relevant clinical data, these techniques can predict a patient’s response to a particular drug or dosage regimen.
Predictive Modeling: Machine learning algorithms can build predictive models that identify genetic markers or signatures associated with drug response or adverse reactions. These models can be used to stratify patient populations and guide treatment decisions, ultimately leading to more effective and safer drug therapies.
Drug Discovery and Development: AI algorithms can accelerate drug discovery and development processes by analyzing vast amounts of genomic and chemical data. These techniques help identify potential drug targets, predict drug-drug interactions, optimize drug candidates, and design more effective clinical trials.
Genomic Data Analysis: Machine learning methods are instrumental in analyzing large-scale genomic datasets, including genome-wide association studies (GWAS) and next-generation sequencing data. These techniques can uncover genetic variants associated with drug metabolism, pharmacokinetics, and pharmacodynamics.
Drug Repurposing: AI-driven approaches facilitate drug repurposing efforts by identifying new therapeutic indications for existing drugs based on their genomic and pharmacological profiles. This approach can expedite the development of novel treatments for various diseases.
Adverse Drug Reaction Prediction: Machine learning models can predict the likelihood of adverse drug reactions based on genetic factors, enabling proactive measures to mitigate risks and improve patient safety.
Clinical Decision Support Systems: AI-powered clinical decision support systems integrate genomic data with electronic health records (EHRs) to provide healthcare professionals with personalized treatment recommendations and dosage adjustments tailored to individual patients.
Supervised Machine Learning (SML) methods we learned ?
Linear regression, K-nearest neighbours (KNN), Random Forest
- Unsupervised Machine Learning (UML)
K-means clustering, Hierarchical clustering, Principal component
AI, Machine Learning, Neural Network and
Deep Learning. What’s the difference?
Machine learning (ML) is a subfield of AI, or
a path to AI
Algorithms to learn insights and recognise
patterns from data
Deep Learning and Neural Networks are
methods of ML
Deep Learning structures algorithms in
Neural Networks, with the aim of teaching
them to take decisions
Supervised Machine Learning (SML) , how does it work ?
In SML, algorithms learn from labelled data
* Regression is used to understand the relationship between dependent and
independent variables
* Classification assign test data into categories based on specific variables
Simple Linear (and logistic) regression , when can we apply it ?
Used to predict (forecast) the value of
the dependent variable based on the
independent variable
* Linear regression is applied on
continuous variables, whilst logistic
regression on discrete
Simple linear regression, how does it work ?
- Residuals can be used to validate the model by making sure that they are
independent and normally distributed - As independent variables increases, multiple linear regression is applied
𝑦 = 𝑎 + 𝑏𝑥+ ∈
Multiple linear regression, how does it work?
- Builds a model to describe Y in the
best way using Xn - Use independent variables to predict
the dependent variable. Example:
à Total Cholesterol = a + b1BMI +
b2Time exercising +
b3*Shoe size… + ∈ - But is shoe size relevant?
𝑦 = 𝑎 + 𝑏!𝑥! + 𝑏”𝑥” + 𝑏#𝑥# + …+ ∈
Multiple linear regression assumptions ?
- Parametric test based on assumptions:
- Linear relationship between Y and X
- Xi are not highly correlated with each other
-The variance of the residuals is constant - Independence of observations
- Residuals are normally distributed
how can we test a Multiple linear regression model?
- Model can be tested with Root Mean
Square Error (RMSE), the standard
deviation of the residuals: adding all the residuals squared, deviding by the number sample size, quare rooting all
how to use Multiple linear regression for prediction ?
- Create a random 80/20 split of the data, generating
training data (80%) and test data (20%) - Train a regression model on the training data
- Apply the model on the test data
- Calculate RMSE of the training data (in-sample RMSE)
and test data (out-of-sample RMSE)
* Compare the RMSE. Indicates how well the model
performs on new data.
* More complex model à Decreasing RMSE à Overfitting
Linear regression models pros and cons ?
Pros:
* Can be used on continuous
(linear) and discrete (logistic)
data
* Determine influence of
independent variables on the
dependent
* Identifying outliers
Cons:
* No mixed data (continuous &
discrete
* Many assumptions
* Requires complete data and no
missing data
K-nearest neighbors (KNN) , how does it work ?
- Non-parametric algorithm i.e. no
strong assumptions - Often used for classification,
predicting the group of a data point - Applies majority voting based on:
Distance metrics
Number of K’s
- Calculate the distances, usually with Euclidean distance
- Find the nearest neighbours by ranking the distances
- Majority vote on the predicted class label based on the
K nearest neighbours
K is the number of nearest neighbors taken into account
KNN pros and cons ?
Pros:
* It is easy to implement
* No need to train a model
* Versatile, distance algorithms can handle different types of data
Cons:
* Data should be of the same scale which can be difficult with large datasets
* Setting the K can be challenging
Tips:
* Test different K’s
* K should be odd numbers to avoid any draws
Decision tree and random forest, how does it work ?
Random forest is based on
decision tree’s
* Generates many decision tree’s
creates the random forest to
classify unlabeled dataà A single tree is not accurate
* Can use both categorical and
continuous variables
Random forest
1. Create a bootstrapped dataset that is the
same size of the original
à Randomly selected data, where duplicates
are allowed
2. Create a decision tree using the
bootstrapped data using a random subset
of variables
3. Repeat 1 and 2 multiple times
4. Impute your unlabeled data and let the
random forests’ many classifiers label
5. Majority vote classifies the unlabeled data
Random forest validation with Out-of-Bag
- The Random forest model can be
validated using the Out-of-bag
error - The Random forest is used to
predict labels of data not
selected for the bootstrapped
data (test set)
Random forest pros and cons ?
Pros:
* Can be used on many types and
mixes of data
* Can be applied on both
classification and regression
problems
* Can be applied on data with
missing values
* No overfitting and curse of
dimensionality
Cons:
* Very complex and you can’t follow
the decision of the tree
* Training the model takes time and
computing power
types oreasons to use Unsupervised machine learning (UML) and how does it work ?
- In UML, algorithms are used to analyze and cluster unlabelled data
àData grouping based on patterns
àSimilarities and differences of the data - Clustering is applied on raw data and groups it based on similarities and
differences between the structure and/or patterns of the data - Dimensionality reduction can be applied to reduce complexity of data
whilst preserving the structure to reduce ”noise” and overfitting ML
algorithms.
K-means clustering ?
- Not to be confused with KNN
- Groups similar datapoints in
clusters - K is the number of cluster and
means generated
- Set the number of K’s
With Elbow plot - Generates K random centroids
- Creates K clusters by assigning each
data point to closest centroid - Calculates new centroids for each
cluster - Reassigns points with new centroids
If new assignments, repeat 4
If no new assignments, terminate
algorithmElbow plot determines number of K’s
* First step of K-means clustering is
to set the K
* The Elbow method is common
* Distortions is the sum of squared
distances of data points from
cluster centers
-Decreases as K increases.
-0 when K = number of points
Hierarchical clustering, how does it work ?
- Groups similar data points to clusters
- Defines clusters that are distinct from
each other and datapoints within are
similar - Creates cluster by ordering clusters:
- Bottom-up (Agglomerative)
- Top-down (Divisive)
- The length of the branch in the dendogram show
how similar the data points are.
à Long branch = dissimilar, short branch = similar
Hierarchical clustering pros and cons ?
Pros:
* Easy to use
* The dendrogram gives information
about the data structure
* Can be used to set number of
clusters
Cons:
* Sensitive to outliers
* Does not work well with missing
data or mixed data
* In complex data, difficult to
determine number of relevant
clusters
Principal component analysis (PCA)?
Common and versatile method used for:
* Analysing the structure of data
features
* Pre-processing for other ML
algorithms
* Visualisation
Summarises large multi-dimensional
datasets to smaller number of dimensions
(ideally 2) that can be visualised
- Plot the data. Gene 1
& 2 is higher in sample
1 & 2… - Calculate the average of
gene 1 and 2 (and n) to find the
center of the data. - Center the data at
the origin (0,0)
Find the line, through the origin, with the best fit. The best fit is defined by PCA
projecting the distance of the point to the line and minimizing it.
The line is called Principal Component 1 (PC1)
The eigenvectors are
calculated.
Higher loading indicated
more influence on the PC
i.e. Gene 1 (0.82) influence
more than Gene 2 (0.57).
Multi-dimensions and PC n
* PC2 is perpendicular to PC1. PC3 is
perpendicular to PC1 and PC2 etc.
* PCs are the same number as genes
* PC1 explains most of the variance in the
data. P2 the second most etc.
* Projection in 2D, so two PC’s are projected
- The datapoints are projected onto PC.
- Hopefully, we see some clustering…
PCA pros and cons ?
Pros:
* Can remove noise (correlated
features)
* Improve ML algorithms by
removing noise
à Reduces overfitting
* Visualisation
Cons:
* PCA turns independent variables
to PC’s which can be hard to
interpretate
* Requires standardised data and
therefore does not work well on
mixed data
Karolinska Institutet 03/02/2023 35
tSNE and UMAP are advancements of
PCA, projecting the data better
making clustering easier
how is actually ML being used in medicine and
pharmacology?
- ML algorithms are used together
- Nested in networks or parts of
pipelines - Used as tools, from a ML toolbox
- Important to know when and
why to use it
GestaltMatcher and Face2Gene is an example of the use of which ML type ?
Supervised classifiers are
often used in image
analysis, for example when
diagnosing rare diseases.
Here, KNN is nested into
a Deep Neural Network.
Datapoints in the KNN is
other phenotype patients
what does DESeq2 do?
- Most used method in analysing bulk RNAsequencing
data - Other methods are limma and edgeR. Commom
aim is to find differentially expressed genes
(proteins, lipids etc.)