Scientists have developed a new open-source software tool called OpenFold that uses artificial intelligence and harnesses the power of supercomputers to predict protein structures. The image overlays OpenFold and AlphaFold2 predictions onto the experimental structure of the Streptomyces tokunonesis TokK protein, showing that OpenFold matches the accuracy of AlphaFold2. Credit: Nature Methods (2024). DOI: 10.1038/s41592-024-02272-z
Form follows function. This is especially true for proteins, the building blocks of life. The way a molecule folds and shapes reveals its life-supporting functions.
Scientists have developed a new open-source software tool called OpenFold that uses artificial intelligence (AI) and harnesses the power of supercomputers to predict protein structures.
The research could aid in the development of new drugs and a better understanding of abnormal proteins linked to neurodegenerative diseases such as Parkinson's and Alzheimer's.
OpenFold builds on the success of AlphaFold2, which was developed by Google DeepMind and has been used by more than 2 million researchers since 2021 for protein prediction for vaccine development, cancer treatments, and more.
“AlphaFold2 was a milestone for science,” said Najm Bouattah, a senior research scientist at the intersection of AI and biology at Harvard Medical School. “We developed a fully open-source version, OpenFold, that is now helping academia and industry move the field forward.”
Bouatta co-authored a research paper in Nature Methods announcing OpenFold, a fast, memory-efficient and trainable implementation of AlphaFold2.
He started the project with his colleague Mohammed Al-Quraishi, formerly of Harvard and now at Columbia University, which has grown into the OpenFold Consortium, a coalition of startups working in partnership with academia.
“Very bright students from Harvard and Columbia also contributed to this work. Gustav Ardlitz did a fantastic job. They all did a fantastic job in implementing the code,” Buatta said.
A core aspect of AI is large-scale language models (LLMs), which take large amounts of text and generate new, meaningful text from it. For example, ChatGPT has the human-like ability to answer queries based on large amounts of text data.
“To train a system like OpenFold, you would need roughly 100 graphics processing units (GPUs). To put that in perspective, to train state-of-the-art ChatGPT, you would need thousands of GPUs,” Bouatta said.
One of the earliest applications of OpenFold comes from Meta AI, formerly Facebook, which recently released an atlas of more than 600 million previously uncharacterized proteins from bacteria, viruses and other microbes.
“They used OpenFold to integrate a 'protein language model,' which is very similar to ChatGPT, but the language is the amino acids that make up proteins,” Buatta said.
“In a sense, the information in life is organized in a language,” Buatta explained, citing the letters ACGT, which stand for the four bases in DNA: adenine, cytosine, guanine and thymine. “This is the language that nature has chosen to build these advanced organisms.”
In addition, there is a second layer of language for proteins: letters that represent the 20 amino acids that make up all proteins in the human body and characterize the function of the protein.
Genome sequencing has generated a huge amount of data about the letters of life, but until now there has been no 'dictionary' that can extract those letters and use them to represent protein shapes in 3D and model the sites that bind small molecules.
“Machine learning allows us to take strings of amino acids that represent any kind of protein possible, run advanced algorithms to return elaborate three-dimensional structures that are close to what you get in experiments. The OpenFold algorithm is very sophisticated and uses new developments that we're familiar with from ChatGPT and others,” Buatta said, referring to concepts developed by Google Transformer and elements of key ChatGPT algorithms.
A key advantage of OpenFold is that it allows scientists to train models using their own data, something that's not possible with the publicly available version of AlphaFold2. “Being able to train systems with OpenFold opens up huge avenues for research in both academia and industry,” Bouatta said.
In the coming months, Bouatta plans to release a modality for OpenFold with the ability to characterize protein-ligand complexes, which are complex orientations of small molecules bound to proteins.
“This is how the mechanism by which the drug works is achieved, so it's particularly important to understand this,” he explained.
TACC awarded the OpenFold team an allocation of its Frontera and Lonestar6 supercomputers, specifically GPU nodes that will power AI applications around the world.
“TACC has been a very good collaborator,” Buatta says, “and we're grateful to them for giving us access to these resources, which have allowed us to deploy machine learning and AI at the scale we needed.”
“The combination of supercomputers and AI is fundamentally changing how we approach biology. The power of supercomputers is that they can predict 100 million structures in just a few months. Once you train the system, you can get a structure in a matter of seconds. But they don't replace experimentation, because you still need to go back to the lab to test your ideas.”
Integrating AI systems like OpenFold with traditional physics-based systems can help scientists understand life at its most fundamental level, paving the way for treatments for neurodegenerative diseases.
“Supercomputers are our modern microscopes for biology and drug discovery,” Buatta concluded. “If we continue to put more resources into using AI/computational approaches with supercomputers, we will self-improve our ability to understand life and treat disease.”
Further information: Gustaf Ahdritz et al., “OpenFold: Retraining AlphaFold2 provides new insights into learning mechanisms and generalization capabilities.” Nature Methods (2024). DOI: 10.1038/s41592-024-02272-z
Courtesy of The University of Texas at Austin
Source: AI, Computation, and the Folds of Life: Supercomputers Help Train Software Tools for the Protein Modeling Community (August 13, 2024) Retrieved August 13, 2024 from https://phys.org/news/2024-08-ai-life-supercomputers-software-tool.html
This document is subject to copyright. It may not be reproduced without written permission, except for fair dealing for the purposes of personal study or research. The content is provided for informational purposes only.