Lecture 3 Flashcards
-Possess basic concepts of machine learning -Know what a molecular descriptor is
1
Q
- What is Machine Learning and what do we use it for in chemistry?
A
- The practice of using algorithms to parse data, learn from it, and then make a prediction about something.
- In chemistry it can be used for ML-based interatomic potentials to compute FES accurately/quickly
- Can also be used for drug design and predicting the properties of molecules before synthesising them.
2
Q
(IMP)
- ML methods are good at … data, but poor are … predictions.
- In other words, ML can give accurate prediction … … … … (filling the gaps), but poor at predicting data … of it.
- The … /… of data accessible and how we … it is much more important than the algorithm we feed it in to.
A
(IMP)
- ML methods are good at interpolating data, but poor are extrapolating predictions.
- In other words, ML can give accurate prediction within a data set (filling the gaps), but poor at predicting data outside of it.
- The amount/quality of data accessible and how we describe it is much more important than the algorithm we feed it in to.
3
Q
What is the pipeline of ML?
A
- This describes supervised learning (learns according to given target values)
- (1) Dataset formed from input data
- (2) Clean, prepare and manipulate data by forming descriptors
- (3) Train model using ML algorithm and parameter optimisation
- (4) Test data; are predictions in line with expected data
- (5) Improve by expanding data set.
4
Q
Why is there a need for ML-based interatomic potentials in simulations?
A
- Classical forcefields lack the detail to accurately reproduce complex systems, as functional forms can be very limiting.
- The timescales these systems exist in are also far too large for quantum calculations.
5
Q
- How would we build a ML-based interatomic potential?
A
- Build a dataset of small configurations of the system, where energy and forces are accurately calculated initially (e.g. DFT – expensive part)
- Represent these configurations in terms of local atomic environments (LAE)
- Use ML algorithm to reconstruct PES of system and use the optimised data to predict energy and forces of large configurations.
6
Q
- What are local atomic environments?
A
- Evaluates an atom’s surroundings through a cutoff radius, rc.
- A larger rc will be more computationally expensive to process and will result in a larger potential.
7
Q
- What are the pros and cons of ML-based interatomic potentials?
A
- Pros: fast, long large scale simulations with quasi quantum chemistry accuracy (if data set is good
- Cons: Not easy to craft dataset, can take years to improve iteratively.
8
Q
- Why is there a need to ML in drug discovery?
A
- Growing demand for improved/new drugs
- Huge number of combinations available that can’t all be synthesised in the lab
- ML aids decision for which compounds are likely to be useful
- Massively reduces pipeline of chemical trials
9
Q
What is the process in which ML is used in drug discovery
A
- Dataset of small molecules attained
- Descriptors used to encode dataset of structures that can be processed by an algorithm.
- ML algorithm used to optimise dataset
- Desired property (e.g. solubility) predicted
10
Q
- What is the main problem with ML?
A
- Understanding the interactions underpinning the PES of a system (ML potentials) or structure function relation between drug and potency (ML drug discovery) is very difficult as ML is a black box with many hidden layers between input and output.
11
Q
- What is the main problem with ML?
A
- Understanding the interactions underpinning the PES of a system (ML potentials) or structure function relation between drug and potency (ML drug discovery) is very difficult as ML is a black box with many hidden layers between input and output.
12
Q
- What is a descriptor?
A
- A mathematical object (vector) that contains information about the system, encoded to be readily fed into a ML algorithm. Often, must satisfy specific properties like permutation invariance (same descriptor when exchanging identical atoms)
- Is the most crucial step in ML drug design
13
Q
- Give an example of a simple and a complex descriptor
A
- Simple: molecular weight
- Complex: Largest eigenvalue of adjacency matrix, where 1 is connected and 0 is not, giving a unique # corresponding to a unique matrix.
14
Q
(IMP) Write out the largest matrix of the following molecule 1
A
15
Q
(IMP) Write out the largest matrix of the following molecule 2
A
???