Químio-Informática - Tutotrial Descriptors/MLR Flashcards
O objetivo deste tutorial era…
In this tutorial you will learn to build models that learn from experimental data to become able to predict a property from the molecular structure.
How to calculate molecular descriptors of using CDK Descriptor?
1.2.a. Choose as “Input File” (by clicking “Browse” the file with the SMILES, saved in 1.1) and as “Output file” a new file where the descriptors will be recorded (eg. 1.2.xls).
1.2.b. In Menu “Options” → “Output Method” choose “Tab delimited” (so that the columns of descriptors are separated by tab separators).
1.2.c. Choose (tick) all descriptors except:
in “Electronic” → Charged Partial Surface Areas (because 3D structures would be needed)
in “geometrical” → Gravitational Index, Length Over Breadth and Moments of Inertia (as they would need 3D structures)
in “constitutional” → ALogP and XLogP (because they were optimized for very similar
properties to the solubility)
in “hybrid” → WHIM (because they need 3D structures)
(all descriptors are chosen, only the above are deselected)
Once we got the values that we wanted from the Excel file generated when we generated molecular descriptors what software we will use to build a linear reagression?
Now open the Weka software to process descriptors and build a linear regression. Select “Explorer”. Click the “Open file …” and select the file you recorded in 1.3 selecting “CSV data files” button under “Files of Type”. Select
the option “Invoke options dialog” at the top-right of the dialog. Click “Open”. In the new window, in “fieldSeparator”, type “,” or “;” depending on how columns are separated in your .csv file (you may check by opening your .csv file
with Notepad – Bloco de Notas). Click “OK”.
If you click on the tab “View” at the top, you can see how each descriptor relates to the solubility.
How to build a QSPR model from the sub set of descriptors generated in Weka.
Click the tab “Select attributes” at the top. Click the “Start”
button. With the default option for evaluating descriptors subsets
(CfsSubsetEval), the program evaluates each subset based on a formula that takes into account the correlation of each with the solubility and cross-correlation of descriptors within the subset. In the right pane the selected descriptors will be reported
How to do a training set in Weka
Select the tab “Classify”. Under “Classifier” click the “Choose” button and choose “classifiers → functions → LinearRegression”. Choose “Use training set” under “Test options”. Click the “Start” button. In the right pane you can see the result. If you right-click on the last line of the left pane and choose “Visualize classifier errors”, you can see a plot of the predicted vs. experimental values.
Now that you built a model to predict the solubility from molecular descriptors, you will evaluate how the model can make predictions for compounds never seen before.
Back in the Weka window, under “Test options”, choose “Supplied test set” and click on “Set …”. Choose “Open file …” and select the file you saved with the test set.
Right-click on the last row of the “Result list” panel and select “Re
evaluate current model on test set …”. If a message appears click “OK”. You can see the result in the right panel.