QSAR Flashcards

Question

Molecular Descriptors 2D and 3D Verloop substituent parameters

Answer 1

* Verloop preposed a set of multi-dimensional steric parameters to help explain the steric influence of substituents in the interaction of organic compounds with macromolecules or drug receptors * Verloop parameters calculation assume that all atoms have Van der Waals radii and use these to define the substituents space requirements * The 5 verloop parameters define a box that can be used to characterize the shape and volume of the substituent

Answer 2

* L, the length parameter- the maximum length of the substituent along the axis of the bond between the first atom of the substituent and the part molecule * B1, the width parameter- the smallest width of the substituent in any direction perpendicular to L * B2,3,4 are determined by measuring the width of the substituent, as follows * In the direction opposite to the axis defined by B1 * In the 2 direction perpendicular to this axis and the original bond axis * The 5 verloop parameters define a box that can be used to characterize the shape and volume of the substituent

Answer 3

* Many of the descriptors which can be calculated from the 2D structure rely upon the molecular graph representation bevause of the need for rapid calculations * Several of these descriptors have been developed which characterise some aspect of molecular shape, connectivity or atom distribution as a single number

Answer 4

* Use multiple regression (an extension of linear regression) * This calculates an equation describing the relationship between a single dependent y variable and several explanatory x variables * It is very important to choose variables (calculated properties) that are not correlated * Multiple regression is the usual approach but there are others

Answer 5

* **Multiple linear regression analysis (MLRA)** * Free-Wilson analysis * Cluster analysis * Pattern recognition * Factor analysis * Discrimination analysis * Principal component analysis (PCA) * Partial least square (PLS) analysis * Comparative molecular Field analysis (CoMFA) * Artificial neural network (ANN) * Evolutionary algorithms, such as genetic function approximation (GFA)

Answer 6

* **Simple multiple regression**- all the input x variables (circulated properties) are used in the equation to predict y (the bio-activity) * **Stepwise multiple regression**- a selection algorithm is used to choose a subset of input x variables

Answer 7

* Multiple regression calculates an equation describing the relationship between a single dependent y variable and several explanatory x variables * a₁,a₂ etc and c are constants chosen to give the smallest possible sum of least squares difference between true y values and the y' values predicted using this equation * y= biological activity * x1= calculated molecular properties

Answer 8

* Having derived an equation for predicting y from a series of independent variables, one needs to know how reliable predictions made with this equation are likely to be * The multiple correlation co-efficient r² describes how closely the equation fits the data * If the regression equation dascribes the data perfectly then r² will be 1.0

Answer 9

* The major drawback of regression analysis is the danger of overfitting * This is the risk that an apparently good regression equation will be found, based on a chance numerical relationship between the y variable and one or more the x variables, rather than a genuine predictive relationship * The QSAR equation will fit the training data very well but be useless in predicting the activity of a compound not in training set

Answer 10

* When an overfitted model is used predictively, the predicted values for untested compounds will not be an accurate prediction of the true values (when these are eventually determined) * Thus the regression equation has NO predictive power * Use a of cross-validation technique to estimate the true predictive power of energy regression model * The best way to avoid an overfitted regression equation is to use just a few carefully selected (non-correlated) x variables, and use as many data points as possible (at least 5 per term in the equation)

Answer 11

* Cross validation provides a rigorous internal check on the models derived using regression, discriminant or patial least squares analysis * It is used to give an estimate of the true predictive power of the model i.e. how reliable predicted values for untested compounds are likely to be * **leave out one row-** Each row is left out in turn, so that the value of each row is predicted from all others * **Leave out groups of rows-** Groups of rows are left out, excluding a thrid of the data from each model in a fixed pattern

Answer 12

* By default, TSAR leaves out groups of rows in a fixed pattern, using three cross validation groups of rows * A third of the data is deleted and the values of these rows predicted using the rest of the data * This is repeated for the second and then the third groups * The model is judged based on these predictions

Answer 13

* R²(CV) is derived from cross validation. It is the cross validated equivalent of r² * This is a key measure of the predictive prower of the model * The closer the value is the 1.0 the better the predictive power * For a good model r²(CV) should be only slightly lower than r² * If r²(CV) \<2 then there is probably overfitting