Lecture 3 Flashcards

-Possess basic concepts of machine learning -Know what a molecular descriptor is

1
Q
  • What is Machine Learning and what do we use it for in chemistry?
A
  • The practice of using algorithms to parse data, learn from it, and then make a prediction about something.
  • In chemistry it can be used for ML-based interatomic potentials to compute FES accurately/quickly
  • Can also be used for drug design and predicting the properties of molecules before synthesising them.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(IMP)

  • ML methods are good at data, but poor are predictions.
  • In other words, ML can give accurate prediction … … … … (filling the gaps), but poor at predicting data of it.
  • The … /… of data accessible and how we it is much more important than the algorithm we feed it in to.
A

(IMP)

  • ML methods are good at interpolating data, but poor are extrapolating predictions.
  • In other words, ML can give accurate prediction within a data set (filling the gaps), but poor at predicting data outside of it.
  • The amount/quality of data accessible and how we describe it is much more important than the algorithm we feed it in to.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the pipeline of ML?

A
  • This describes supervised learning (learns according to given target values)
  • (1) Dataset formed from input data
  • (2) Clean, prepare and manipulate data by forming descriptors
  • (3) Train model using ML algorithm and parameter optimisation
  • (4) Test data; are predictions in line with expected data
  • (5) Improve by expanding data set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is there a need for ML-based interatomic potentials in simulations?

A
  • Classical forcefields lack the detail to accurately reproduce complex systems, as functional forms can be very limiting.
  • The timescales these systems exist in are also far too large for quantum calculations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • How would we build a ML-based interatomic potential?
A
  • Build a dataset of small configurations of the system, where energy and forces are accurately calculated initially (e.g. DFT – expensive part)
  • Represent these configurations in terms of local atomic environments (LAE)
  • Use ML algorithm to reconstruct PES of system and use the optimised data to predict energy and forces of large configurations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  • What are local atomic environments?
A
  • Evaluates an atom’s surroundings through a cutoff radius, rc.
  • A larger rc will be more computationally expensive to process and will result in a larger potential.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • What are the pros and cons of ML-based interatomic potentials?
A
  • Pros: fast, long large scale simulations with quasi quantum chemistry accuracy (if data set is good
  • Cons: Not easy to craft dataset, can take years to improve iteratively.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Why is there a need to ML in drug discovery?
A
  • Growing demand for improved/new drugs
  • Huge number of combinations available that can’t all be synthesised in the lab
  • ML aids decision for which compounds are likely to be useful
  • Massively reduces pipeline of chemical trials
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the process in which ML is used in drug discovery

A
  • Dataset of small molecules attained
  • Descriptors used to encode dataset of structures that can be processed by an algorithm.
  • ML algorithm used to optimise dataset
  • Desired property (e.g. solubility) predicted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • What is the main problem with ML?
A
  • Understanding the interactions underpinning the PES of a system (ML potentials) or structure function relation between drug and potency (ML drug discovery) is very difficult as ML is a black box with many hidden layers between input and output.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • What is the main problem with ML?
A
  • Understanding the interactions underpinning the PES of a system (ML potentials) or structure function relation between drug and potency (ML drug discovery) is very difficult as ML is a black box with many hidden layers between input and output.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • What is a descriptor?
A
  • A mathematical object (vector) that contains information about the system, encoded to be readily fed into a ML algorithm. Often, must satisfy specific properties like permutation invariance (same descriptor when exchanging identical atoms)
  • Is the most crucial step in ML drug design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • Give an example of a simple and a complex descriptor
A
  • Simple: molecular weight
  • Complex: Largest eigenvalue of adjacency matrix, where 1 is connected and 0 is not, giving a unique # corresponding to a unique matrix.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(IMP) Write out the largest matrix of the following molecule 1

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(IMP) Write out the largest matrix of the following molecule 2

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • What are some descriptors we could use to describe molecular polarizability?
A
  • Total # of VE’s in molecules
  • # of C’s in the molecule
  • Hybridization of C’s in molecule
  • # of atoms in molecule
  • # of H’s in molecule