Lecture 4 Flashcards
-have an understandinf of the basics of: -Neural networks -Gaussian processes
- What are two common types of ML algorithms and how do they differ, briefly?
- Neural networks and Gaussian processes
- NN’s more suited to specific applications, however, can be equivalent under certain conditions
(IMP) Label this neural network


- How is data fed in to a neural network?
- Dataset generated (e.g. N molecules and their solubility values)
- Descriptor values assigned to molecules (e.g. largest eigenvalue λimax of the adjacency matrix of the ith molecule)
- Values represent input neurons/nodes forming the input layer
- What are weights in a neural network and how do they aid the formation of further layers?
- Weights are numbers assigned to input nodes to form the first hidden layer.
- The value of y of the third node in the first hidden layer is a linear combination of descriptors and all connected weights

(IMP) What is the purpose of the hidden layers?
- Connect the results of our input nodes and weights and combines them further additional layers to fully optimise all input values.
- More hidden layers, more nodes, more flexible functional form is, up to the point where overfitting begins
(PPQ) What is the role of the activation function in artificial NN’s, give an example of one to support your choice?
- Activation function are non-linear functions (e.g. the sigmoidal function) that transform the linear combinations into non-linear objects from one hidden layer to the next.
- These linear combinations (e.g. relation between descriptors and solubility) are smoothed out, giving non-linear capabilities

Give the general equation describing the value a node in the second hidden layer in a NN
- Activation function only present in the hidden layers (=1 for input)

Our output values (e.g. molecular solubility) from our neural network are very poor initially, why is this?
- The weights are chosen randomly, and all further propagation in our network depends on the linear combination of these weights with the input.
(IMP) How can we solve the issue of poor output values due to initial input weights?
- Use Backpropagation, where a cost function, fcost is calculated.
- This is the sum of output layer errors and target values (e.g. experimental solubility), written in terms of weights.
- Its derivatives are used to improve initial guess of weights to assign to minimise fcost on next iteration.

- An … represents each set of output values generated.
- At the end of each … we compute the … … and use its derivatives to optimise the …
- We stop this when our model is good enough that the … value of our molecule of choice generates an accurate enough output value.
- An epoch represents each set of output values generated.
- At the end of each epoch we compute the cost function and use its derivatives to optimise the weights.
- We stop this when our model is good enough that the descriptor value of our molecule of choice generates an accurate enough output value.
What are Gaussian processes?
- Mathematical objects which can be used to fit data through regressions via the generalization to infinite dimensions of a normal Gaussian distribution.
- Describe the features of 2D (bi-variate) normal distribution
- Covariance tells us how similar the two dimensions are with respect to one another
- The mean tells us the average point within the distribution

- How can Gaussian processes be improved as we did in NN’s weights?
- Bayesian inference improves a prior GP distribution guess according to the info provide in the dataset (comditioning)
- Similar to assigning weights in NN, where our model is also dependent on some parameters e.g. elements of covariance matrix

(IMP) What are Kernel functions (K)?
- The covariance matrix of our GP defines the shape of ensemble of Gaussians in space.
- A functional form for it must be written in terms of hyper-parameters that can be optimised.
- This mathematical expression is required as the covariance is an arbitrary set of numbers in a matrix
- For each element i,j of the covariance matrix can write an expression called a kernel, which is a function of the xi, xj descriptor point yi, yj in our dataset.

(IMP) Choosing the Kernel functional form can be very challenging. Give an example of a common choice
- The radial basis function (RBF) kernel
- A measure of the similarity between the two descriptors (i.e. between two molecules), as it is a function of the distance between them

(IMP)
- The hyper-parameter L is the quantity … according to the … … … to obtain our ML model using … .
- It quantifies the … … by which two descriptors are close or not, giving the … of the resulting GP.
- Like weights in NN’s
(IMP)
- The hyper-parameter L is the quantity optimised according to the log marginal likelihood to obtain our ML model using GP’s.
- It quantifies the length scale by which two descriptors are close or not, giving the smoothness of the resulting GP.
- Like weights in NN’s
- Discuss the resulting differences in the hyper parameters used for theses GP’s

- L=0.3
- Model guesses points perfectly, however use of too many gaussians results in large errors in an attempt to be too precise.
- L=1
- Goes through all points perfectly, and does not overfit
- L=3
- Curve too smooth, leading to poor averaging as a result of underfitting

- (IMP) What is overfitting in ML
- After many regressions our error with respect to the cost function of our log marginal likelihood is very small giving numerically sounds results relative to our input data.
- However, will reach a point where so much fitting is done that a resulting graph like L=0.3 forms which gives very poor predictions in practice.
(IMP) How can we solve the issue of overfitting?
- Split our data into training and test sets.
- The training set (~80%) is used to build a model and the test set (~20%) is used to evaluate its predictive capabilities.
(IMP) What would overfitting indicate about training/test data
- If error in training set is very small and the error in its predictive capabilities is high, the data is overfit.
- The opposite would indicate underfit data.
(IMP) Why is it easier to spot overfitting in NNs than GPs?
- NN split in to epochs, so when divergence in error between error in test and training data occurs, can move back to epoch before this.
- Overfitting in GPs more difficult to detect/fix.
- Good numerical … of ML model with respect to the training set does NOT ensure the … of its … capabilities (IMP -sketch)
- Even if going through all points (…≈ 0),… in between these points may be large,
- Good numerical accuracy of ML model with respect to the training set does NOT ensure the accuracy of its predictive capabilities (IMP -sketch)
- Even if going through all points (fcost ≈ 0), error in between these points may be large,
