Image
‘Book Club’ meeting for reading textbook on natural language processing
‘Book Club’ meeting for reading textbook on natural language processing. I am the girl in the black tank top in the back.

This summer, I am working at Dr. Andrew Ellington’s lab at the University of Texas at Austin, in the field of computational biology. My research is mainly computational, using AI-models such as neural networks to make predictions about protein-folding, drug-protein interactions, and functions of proteins or chemicals. There are many ways to ‘tokenize,’ or convert into a computer-readable format, macromolecules for machine learning, including inputting sequences of amino acids, nucleotides, etc., but in this project, we are using a string-representation of chemicals called “SMILES” strings. From these strings, we can extract, or ‘featurize’ the molecule to convert SMILES into a numeric vector that indicates the presence of various structures, such as functional groups. 

Our goal is to develop a dataset of predicted chemicals that are significantly different functionally from whatever task you are using your model to predict on, in order to make fine-tuning these models more accurate. For example, we will test on tasks such as blood-brain barrier permeability, solubility, toxicity in clinical settings, etc. This has immense potential for improving AI-models to be able to better predict a chemical’s propensity for damage, and help prevent testing of toxic molecules before they get to trial on animals or humans. By being able to quickly predict the function of a chemical, drug-testing will become safer and more efficient.

Some goals I have for the upcoming weeks include understanding the foundational requirements of the AI-models we will be working with, using Python to write code on which we can convert SMILES strings to feature vectors and run models on them, and identify tasks and datasets that we will use for our finished experiment. To get a better foundational understanding, I have been reading a textbook on single language processing, which is the model we are using. I will also work on several functions that will go through the various steps to convert our dataset into a computer-readable format and feed it into our model, and look for datasets to use.