Blog post #2 by Kristi Xing

Image
Diagram of a chemical with the structure written in SMILES-string format below it
An example of a chemical in a drug I am working with in picture form as well as in SMILES-string format.

After getting up to speed on the foundations of machine learning and their applications in biology, I have been working on writing Python scripts using scikit-learn and RD-Kit in order to fit and train various machine learning models. Some we are using include the basic linear model, Random Forest model, and Support Vector Machines. I have been working on splitting various datasets into training and testing sets, and am currently creating a script to fit these various models to different training sets. After these models are fine-tuned on various unlabeled training sets (known as unsupervised learning), we then test these models on a uniform test set to evaluate their performance. Currently, we are testing how well these models perform at tasks such as predicting blood-brain barrier penetration or solubility of drugs. If we prove that our method is performing better than methods that exist currently, we can possibly turn this into a paper which is very exciting.