Lecture 4 Flashcards

-have an understandinf of the basics of: -Neural networks -Gaussian processes

1
Q
  • What are two common types of ML algorithms and how do they differ, briefly?
A
  • Neural networks and Gaussian processes
  • NN’s more suited to specific applications, however, can be equivalent under certain conditions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(IMP) Label this neural network

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  • How is data fed in to a neural network?
A
  • Dataset generated (e.g. N molecules and their solubility values)
  • Descriptor values assigned to molecules (e.g. largest eigenvalue λimax of the adjacency matrix of the ith molecule)
  • Values represent input neurons/nodes forming the input layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • What are weights in a neural network and how do they aid the formation of further layers?
A
  • Weights are numbers assigned to input nodes to form the first hidden layer.
  • The value of y of the third node in the first hidden layer is a linear combination of descriptors and all connected weights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(IMP) What is the purpose of the hidden layers?

A
  • Connect the results of our input nodes and weights and combines them further additional layers to fully optimise all input values.
  • More hidden layers, more nodes, more flexible functional form is, up to the point where overfitting begins
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(PPQ) What is the role of the activation function in artificial NN’s, give an example of one to support your choice?

A
  • Activation function are non-linear functions (e.g. the sigmoidal function) that transform the linear combinations into non-linear objects from one hidden layer to the next.
  • These linear combinations (e.g. relation between descriptors and solubility) are smoothed out, giving non-linear capabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give the general equation describing the value a node in the second hidden layer in a NN

A
  • Activation function only present in the hidden layers (=1 for input)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Our output values (e.g. molecular solubility) from our neural network are very poor initially, why is this?

A
  • The weights are chosen randomly, and all further propagation in our network depends on the linear combination of these weights with the input.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(IMP) How can we solve the issue of poor output values due to initial input weights?

A
  • Use Backpropagation, where a cost function, fcost is calculated.
  • This is the sum of output layer errors and target values (e.g. experimental solubility), written in terms of weights.
  • Its derivatives are used to improve initial guess of weights to assign to minimise fcost­ on next iteration.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • An represents each set of output values generated.
  • At the end of each we compute the and use its derivatives to optimise the
  • We stop this when our model is good enough that the value of our molecule of choice generates an accurate enough output value.
A
  • An epoch represents each set of output values generated.
  • At the end of each epoch we compute the cost function and use its derivatives to optimise the weights.
  • We stop this when our model is good enough that the descriptor value of our molecule of choice generates an accurate enough output value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are Gaussian processes?

A
  • Mathematical objects which can be used to fit data through regressions via the generalization to infinite dimensions of a normal Gaussian distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • Describe the features of 2D (bi-variate) normal distribution
A
  • Covariance tells us how similar the two dimensions are with respect to one another
  • The mean tells us the average point within the distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  • How can Gaussian processes be improved as we did in NN’s weights?
A
  • Bayesian inference improves a prior GP distribution guess according to the info provide in the dataset (comditioning)
  • Similar to assigning weights in NN, where our model is also dependent on some parameters e.g. elements of covariance matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(IMP) What are Kernel functions (K)?

A
  • The covariance matrix of our GP defines the shape of ensemble of Gaussians in space.
  • A functional form for it must be written in terms of hyper-parameters that can be optimised.
  • This mathematical expression is required as the covariance is an arbitrary set of numbers in a matrix
  • For each element i,j of the covariance matrix can write an expression called a kernel, which is a function of the xi, xj descriptor point yi, yj in our dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(IMP) Choosing the Kernel functional form can be very challenging. Give an example of a common choice

A
  • The radial basis function (RBF) kernel
  • A measure of the similarity between the two descriptors (i.e. between two molecules), as it is a function of the distance between them
17
Q

(IMP)

  • The hyper-parameter L is the quantity according to the to obtain our ML model using .
  • It quantifies the by which two descriptors are close or not, giving the of the resulting GP.
  • Like weights in NN’s
A

(IMP)

  • The hyper-parameter L is the quantity optimised according to the log marginal likelihood to obtain our ML model using GP’s.
  • It quantifies the length scale by which two descriptors are close or not, giving the smoothness of the resulting GP.
  • Like weights in NN’s
18
Q
  • Discuss the resulting differences in the hyper parameters used for theses GP’s
A
  • L=0.3
  • Model guesses points perfectly, however use of too many gaussians results in large errors in an attempt to be too precise.
  • L=1
  • Goes through all points perfectly, and does not overfit
  • L=3
  • Curve too smooth, leading to poor averaging as a result of underfitting
19
Q
  • (IMP) What is overfitting in ML
A
  • After many regressions our error with respect to the cost function of our log marginal likelihood is very small giving numerically sounds results relative to our input data.
  • However, will reach a point where so much fitting is done that a resulting graph like L=0.3 forms which gives very poor predictions in practice.
20
Q

(IMP) How can we solve the issue of overfitting?

A
  • Split our data into training and test sets.
  • The training set (~80%) is used to build a model and the test set (~20%) is used to evaluate its predictive capabilities.
21
Q

(IMP) What would overfitting indicate about training/test data

A
  • If error in training set is very small and the error in its predictive capabilities is high, the data is overfit.
  • The opposite would indicate underfit data.
22
Q

(IMP) Why is it easier to spot overfitting in NNs than GPs?

A
  • NN split in to epochs, so when divergence in error between error in test and training data occurs, can move back to epoch before this.
  • Overfitting in GPs more difficult to detect/fix.
23
Q
  • Good numerical of ML model with respect to the training set does NOT ensure the of its capabilities (IMP -sketch)
  • Even if going through all points (≈ 0), in between these points may be large,
A
  • Good numerical accuracy of ML model with respect to the training set does NOT ensure the accuracy of its predictive capabilities (IMP -sketch)
  • Even if going through all points (fcost ≈ 0), error in between these points may be large,