Students from the University of California, Merced worked with mentors at Lawrence Livermore National Laboratory (LLNL) to identify drug compounds that could be used to treat COVID-19 during a two-week Data Science Challenge (DSC) that concluded on June 6.
For the first time in the DSC series since the COVID-19 pandemic began in 2020, Lab mentors visited the college campus to provide in-person guidance for five teams of UC Merced students. The students numbered 23 in all, including recent graduates, undergrads and Ph.D. candidates in computer science, engineering, mathematics and biology, with doctoral students serving as team leads. The teams worked together and with Lab computer scientists both on-campus and online on real-world drug discovery problems, using machine learning (ML) and other advanced tools to find small molecule inhibitors of SARS-COV-2, the virus that causes COVID-19.
Kicking off the challenge on May 23, LLNL computer scientist Hyojin Kim provided students with an overview of computationally driven drug design at LLNL and how Lab scientists use high performance computing and ML to identify promising drug candidates from millions of molecular compounds to accelerate the long process of new drug discovery.
Kim discussed ML-based approaches to scoring binding potential between ligands and protein targets in the virus and shared tips on designing neural networks to virtually screen drug compounds. Kim also introduced the students to the challenge’s two tasks: to find inhibitors targeting the main protease receptors of SARS-COV-2 and predict binding affinity using both molecular descriptors and 3D structures of protein-ligand data.
The students then went to work daily for the next two weeks on the binding problems with help from their Lab mentors, both online and on-campus LLNL computer scientists. Brian Gallagher, Kim, Cindy Gonzales, Amar Saini, Mary Silva and Omar DeGuchy took turns visiting the UC Merced campus.
As the 2022 DSC director, Gallagher was participating in his fourth overall challenge and first as an in-person mentor. He continued to be impressed with the caliber of UC Merced students and said that the ability to meet with the students face-to-face added an entirely new dimension to this year’s event.
“There’s a level of connection with the students that is much harder to generate in a virtual setting,” Gallagher said. “I found that the students were a lot more talkative and asked a lot more questions.”
For the first task, the students screened nearly 2,000 small molecule compounds — each with more than 200 features — using ML classifier algorithms they developed through numerous approaches such as random forest, decision trees and support vector machines. They scored the protein-ligand bindings for effectiveness and evaluated the predictive accuracy of their classifier models.
Their second, more difficult task required the students to predict binding using 3D atomic representations of molecules, a more complicated set of data. While the mentors’ recommended convolutional neural networks, the students tried several different methods and compared the performance results.
A UC Merced alumnus, Lab computer scientist Saini said he wanted to give back to his alma mater, rotating among the teams to advise and suggest deep learning approaches for solving the problem.
“I was excited to see the different approaches used by the students, especially from the undergrads,” Saini said. “We were able to show them new and modern approaches to problems that sometimes aren't taught in university courses. The students’ work was amazing given the two-week time frame. From some teams, I learned about libraries I don't really use day-to-day.”
Mentor and Lab computer scientist Silva spent two days on campus with the students, where she discussed how to land a Lab internship, what it’s like to be a LLNL data scientist and the research areas the Lab offers in data science.
“I really enjoyed getting to work with the students, they had very diverse backgrounds, and for some, it’s their first experience working with teammates outside their department,” Silva said. “The fact that the challenge had different levels to tackle was good for the students — if they wanted to explore more advanced machine learning methods, they could move on to Task 2. If they wanted to spend more time visualizing the data, transforming it, or developing a library of initial models, the student could stay on Task 1. The groups did really well in dividing the tasks over the two-week period and had something to present on both.”
For the final briefing session on June 6, the teams presented their work to the mentors and fellow students and explained what they had learned about neural networks, ML tools and algorithm development, as well as the successes and hurdles they encountered along the way. For many of them, the challenge was their first exposure to creating, optimizing and evaluating machine learning models, as well as coding in Python and libraries such as PyTorch, Pandas and Sklearn.
“I really didn’t know any machine learning coming into this, so I would go line-by-line, code-by-code and try to understand what the code was doing,” said UC Merced student Phoebe Adamyan, who has a degree in bioengineering and biomedical engineering.
Team lead Pradyuma Lanka, a UC Merced Ph.D. student in physiological science, said he enjoyed the opportunity to work on a “challenging and pressing” problem with real-world implications, and also gained management skills through the experience.
“The sense I got was the tasks were hard, especially Task 2, but the underlying problem we are trying to solve — drug screening using machine learning — is much, much harder,” Lanka said. “I’m glad I was able to work with a team from diverse backgrounds; each of us had a varying level of experience with machine learning algorithms and Python. I got to meet wonderful people both at UC Merced and LLNL. and I got to hear about the research experiences as well as the life experiences, which was really valuable for me.”
During the event’s first week, LLNL computer scientist Jonathan Allen presented a seminar on how the Lab performs computationally driven drug discovery, using ML and other tools to virtually screen drug molecules against main protease targets in SARS-COV-2. Developing new antivirals has been on “life support” due to the time it takes (6 to 10 years) to bring a drug to market, Allen explained, but machine learning models — combined with HPC — could speed up the process and help the nation prepare for the next pandemic.
“With this growth in makeable chemistry, combined with these models, there is an opportunity to improve therapeutics in a way that hasn’t been possible in the previous 30 years,” Allen said. “We’re just scratching the surface right now...You want to have as many tools in the toolbox as you can have available.”
The DSC also featured a seminar by LLNL electrical engineer Laura Kegelmeyer on using machine learning to manage damage on optics at the National Ignition Facility (NIF). The following week, the students were treated to a virtual tour of NIF. DSC organizers are discussing plans for an in-person visit by the students to LLNL later this year.
A total of seven UC Merced students from the DSC have received full-time summer internships at the Lab, and several have had internships for multiple summers. UC Merced professor Suzanne Sindi has participated in the annual DSC since its inception in 2019.
“After two years of having the students work remotely, it was very exciting to see everyone engaged,” Sindi said. “The students accomplished so much in such a short time — the graduate student mentors were excellent partners in helping to coordinate daily activities. It was wonderful to see students learn new mathematics and explore new careers; mainly, I am just so happy the students learned so much, both about what data science is and then about some incredible opportunities at LLNL.”
In addition to her mentorship, LLNL’s Gonzales played a key role in reviewing student applications, testing challenge problems and coordinating mentors. LLNL’s Jennifer Bellig administered and co-organized the event, scheduling meetings, engaging in pre-Challenge communication with students, working with UC Merced to publicize the program and conducting a virtual information session.
As of press time, a second LLNL Data Science Challenge was taking place at the University of California, Riverside, which runs until July 1. The UC Riverside students will be working on the same COVID-19 tasks. The challenge was organized by LLNL and UC Riverside professor Vagelis Papalexakis.
LLNL’s Data Science Institute and the Center for Applied Scientific Computing sponsored the Data Science Challenges. To learn more about DSC or how to become a mentor for future challenges, visit the DSC website.
thomas244 [at] llnl.gov