This is an old revision of the document!
Table of Contents
Running AlphaFold at scale
By Joran Martijn, March 2026
A brief history of AlphaFold
The challenge of predicting protein structures from amino acid sequences has long been a significant issue in biochemistry, and many groups have taken a crack at it for several decades.
In 2016 Google's DeepMind began exploring protein folding as part of its research into using AI for biological problems. AlphaFold made its first notable appearance in the 13th Critical Assessment of Protein Structure Prediction (CASP13) in December 2018. It showcased its ability to predict protein structures with remarkable accuracy, outperforming other methods.
In 2020, DeepMind released AlphaFold2, which significantly improved upon the original model. It utilized new neural network architectures and training techniques, demonstrating astonishing accuracy in predictions. During CASP14, AlphaFold2 solved protein structures at a level comparable to experimental techniques.
In July 2021, DeepMind published the AlphaFold Protein Structure Database, providing over 350,000 predicted protein structures for the scientific community. This database has since been a vital resource for researchers worldwide.
AlphaFold3 was released in 2024. It's main development relative to its predecessor was that it can predict structures of protein complexes with non-protein molecules (DNA, RNA, ions, lipids, Fe-S clusters, small molecule ligands, post-transcriptional modifications, chemical modifications of nucleic acids). It can also predict structures without proteins, such as ss- and dsDNA and ssRNA chains.
AlphaFold theory
The main tenet of AlphaFold is to use evolutionary signals in protein sequences to inform the reconstruction of the three-dimensional structure of proteins. The idea is that amino acid residues of the same protein that co-evolve together, are very likely to be close to each other in 3D space. Conversely, pairs that do not co-evolve are thought to be further away from each other.
The exact details are a bit murky but I was able to understand the following:
- AF is a multicomponent AI system that uses deep-learning to train its neural network models
- It was trained on experimentally verified structures from the PDB
AlphaFold consists of two phases:
- 1A. Compare input protein sequence with those in publically available databases
- 1B. Construct a multiple sequence alignment
- 1C. Construct a pair representations, where for each residue-vs-residue in an all-vs-all comparison the degree of co-evolution is quantified
The neural network (called Evoformer in AF2, Pairformer in AF3) allows for continuous flow between the MSA and the pair representations. It interprets and updates both these things.
- 2. A structure module in AF2, or 'diffusion module' in AF3, that uses the information extracted from phase 1 to infer the 3D protein structure
Running AlphaFold on Perun
Currently we have AlphaFold3 installed on Perun.
Get permission to use AlphaFold3
Before you can use it, you need to get a ~1GB file that contains the “model parameters”. Each new user needs to get this file separately. You can get the file by submitting a form, essentially declaring that you are not going to share those parameters with the public. You must use your institutional e-mail address (i.e. your @dal.ca address) and also provide your gmail if you have it. Each model parameter contains a unique identifier linked to a user, so each model parameter file is unique. You can fill in and submit the form here.
Once you get approved, normally you would download the file and place it in your filesystem at the appropriate place so you can start using AlphaFold. For Perun however, you don't need to download that file, you only need to let Balagopal know that you have been approved and he will add you on a list, so that you can start using AlphaFold on Perun.
Searching the protein sequence databases
As explained above, AlphaFold does two things. First it compares the query sequence to sequence databases, and then it actually predicts the 3D structure. The sequence database searching is the computational limiting step and takes a lot more time to run. The structure module is relatively quick.
All the sequence databases (“Big Fantastic Database” (BFD)*, MGnify protein database (MGY), UniProt, SwissProt, PDB, Rfam, RNACentral, NT non-coding RNA) together take up a lot of space (currently ~432 GB on Perun). It is therefore recommended to have at least 1TB of free disk space. On Perun, the databases are stored at /db1/alphafold3/public_databases.
*BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass.
Since there is quite a bit of reading and writing going on in the sequence search phase, it is highly recommended to use a SSD drive. ``/db1/`` is indeed an SSD drive.
