User Tools

Site Tools


running_alphafold_at_scale

Running AlphaFold at scale

By Joran Martijn, March 2026

A brief history of AlphaFold

The challenge of predicting protein structures from amino acid sequences has long been a significant issue in biochemistry, and many groups have taken a crack at it for several decades.

In 2016 Google's DeepMind began exploring protein folding as part of its research into using AI for biological problems. AlphaFold made its first notable appearance in the 13th Critical Assessment of Protein Structure Prediction (CASP13) in December 2018. It showcased its ability to predict protein structures with remarkable accuracy, outperforming other methods.

In 2020, DeepMind released AlphaFold2, which significantly improved upon the original model. It utilized new neural network architectures and training techniques, demonstrating astonishing accuracy in predictions. During CASP14, AlphaFold2 solved protein structures at a level comparable to experimental techniques.

In July 2021, DeepMind published the AlphaFold Protein Structure Database, providing over 350,000 predicted protein structures for the scientific community. This database has since been a vital resource for researchers worldwide.

AlphaFold3 was released in 2024. It's main development relative to its predecessor was that it can predict structures of protein complexes with non-protein molecules (DNA, RNA, ions, lipids, Fe-S clusters, small molecule ligands, post-transcriptional modifications, chemical modifications of nucleic acids). It can also predict structures without proteins, such as ss- and dsDNA and ssRNA chains.

AlphaFold theory

The main tenet of AlphaFold is to use evolutionary signals in protein sequences to inform the reconstruction of the three-dimensional structure of proteins. The idea is that amino acid residues of the same protein that co-evolve together, are very likely to be close to each other in 3D space. Conversely, pairs that do not co-evolve are thought to be further away from each other.

The exact details are a bit murky but I was able to understand the following:

  • AF is a multicomponent AI system that uses deep-learning to train its neural network models
  • It was trained on experimentally verified structures from the PDB

AlphaFold consists of two phases:

  • 1A. Compare input protein sequence with those in publically available databases
  • 1B. Construct a multiple sequence alignment
  • 1C. Construct a pair representations, where for each residue-vs-residue in an all-vs-all comparison the degree of co-evolution is quantified

The neural network (called Evoformer in AF2, Pairformer in AF3) allows for continuous flow between the MSA and the pair representations. It interprets and updates both these things.

  • 2. A structure module in AF2, or 'diffusion module' in AF3, that uses the information extracted from phase 1 to infer the 3D protein structure

Running AlphaFold on Perun

Currently we have AlphaFold3 installed on Perun.

Get permission to use AlphaFold3

Before you can use it, you need to get a ~1GB file that contains the “model parameters”. Each new user needs to get this file separately. You can get the file by submitting a form, essentially declaring that you are not going to share those parameters with the public. You must use your institutional e-mail address (i.e. your @dal.ca address) and also provide your gmail if you have it. Each model parameter contains a unique identifier linked to a user, so each model parameter file is unique. You can fill in and submit the form here.

Once you get approved, normally you would download the file and place it in your filesystem at the appropriate place so you can start using AlphaFold. For Perun however, you don't need to download that file, you only need to let Balagopal know that you have been approved and he will add you on a list, so that you can start using AlphaFold on Perun.

Searching the protein sequence databases

As explained above, AlphaFold does two things. First it compares the query sequence to sequence databases, and then it actually predicts the 3D structure. The sequence database searching is the computational limiting step and takes a lot more time to run. The structure module is relatively quick.

The sequence databases

All the sequence databases (“Big Fantastic Database” (BFD)*, MGnify protein database (MGY), UniProt, SwissProt, PDB, Rfam, RNACentral, NT non-coding RNA) together take up a lot of space (currently ~432 GB on Perun). It is therefore recommended to have at least 1TB of free disk space. On Perun, the databases are stored at /db1/alphafold3/public_databases.

*BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass.

Since there is quite a bit of reading and writing going on in the sequence search phase, it is highly recommended to use a SSD drive. /db1/ is indeed an SSD drive.

It is most efficient to run the sequence search phase (run_alphafold.py –norun_inference) using CPUs, and the structure prediction (run_alphafold.py –norun_data_pipeline) using a GPU.

Preparing AlphaFold3 input query files

AlphaFold3 is a little awkward with how it wants its input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called fasta2alphafoldjson.py and it is available on the Gospel of Andrew GitHub page. Run it simply as follows:

./fasta2alphafoldjson.py [input.fasta] > [output.json]

NOTE: Currently the script is hardcoded to make JSON entries of the 'alphafold3' dialect. This dialect is appropriate if you want to feed AlphaFold a single query protein. In my setup, this works because I will break up a multi query job into many single query jobs. If however, you want to run a multiquery job as is, you'll need the JSON entries to be of the 'alphafoldserver' dialect. So you'll have to edit the code slightly by replacing “dialect”: “alphafold3” to “dialect”: “alphafoldserver”.

If your [input.fasta] was a FASTA file with many protein sequences, your [output.json] will also contain all those protein sequences. Because we want to split our multi-query job into many individual-query jobs, we'll have to split up that JSON into many individual JSON files as well. We can do so as follows:

mkdir -p batch_split_jsons
jq -c '.[]' [output.json] | awk '{print $0 > "batch_split_jsons/input_" NR ".json"}'

jq splits up the JSON file, and then awk will give each new single JSON file a number. Each new JSON file now contains a single protein sequence and has been assigned a number.

Submitting AlphaFold as an array job

If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence database searches as an array job on Perun. An array-job is a single job, that in turn schedules the submission of many other jobs. Here the idea is to submit a single general “AlphaFold sequence search job” which then schedules and submits all the individual search jobs for each query protein separately.

#!/bin/bash

#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 10
#$ -q 768G-batch,2T-batch


# $SGE_TASK_ID returns a number, which is its current task number
# for example if you do
# qsub -t 1-1000
# then each job in the array will have a unique SGE_TASK_ID, varying from 1 to 1000
# here, we are using that ID to get the corresponding file
SINGLE_JSON_FILE="batch_split_jsons/input_${SGE_TASK_ID}.json"


# setting up environment
# first ensure that we are using the right python installation for run_alphafold.py
# this is needed to make the 'source activate' below work
export PATH=/scratch5/software/miniconda3/bin:$PATH

# then append path to include the run_alphafold.py script
# NOTE: run_alphafold.py was edited slightly to include the #!/usr/bin/env python shebang at the top
export PATH=/scratch5/software/alphafold3:$PATH

# activate environment
source activate alphafold3

# dir with all the sequence databases
DB_DIR='/db1/alphafold3/public_databases'
# where to find the model parameter file
MODEL_DIR='/scratch5/db/alphafold3-model'

# input protein or proteins
#
# if you want to submit a single protein, use a JSON of the 'alphafold3' dialect
# if you want to submit multiple protein queries, you can use a JSON of the 'alphafoldserver' dialect
# or you can specify an INPUT_DIR where each query is a single JSON file of the 'alphafold3' dialect

# sequence search output directory
SEQ_OUTPUT_DIR='batch_sequence_search_out'
# threads to use
THREADS=10

mkdir -p $SEQ_OUTPUT_DIR

# do sequence search with fast CPUs
run_alphafold.py \
    --norun_inference \
    --json_path $SINGLE_JSON_FILE \
    --output_dir $SEQ_OUTPUT_DIR \
    --db_dir $DB_DIR \
    --nhmmer_n_cpu $THREADS \
    --jackhmmer_n_cpu $THREADS

What is this magic $SGE_TASK_ID variable??

SINGLE_JSON_FILE="batch_split_jsons/input_${SGE_TASK_ID}.json"

$SGE_TASK_ID is a special variable that is useful when you submit an array job. Array jobs are submitted by passing the -t flag to qsub. For example, qsub -t 1-1000 tells qsub to submit an array job of a 1000 individual jobs. When let's say, the 52nd individual job is submitted by the array job, the $SGE_TASK_ID special variable gets assigned the number 52. So, in the example above, we are assigning the variable SINGLE_JSON_FILE to batch_split_jsons/input_52.json. That file is then finally passed on to run_alphafold.py –norun_inference at the bottom of the script.

So, to submit this part of AlphaFold as an array job, do:

qsub -t 1-1000 -tc 10 -p -128 run_alphafold_cpu.arrayjob.sh
  • Adjust the 1000 to the number of proteins that you wished to be analyzed.
  • The -tc means it will submit a maximum of 10 individual jobs at any one time, to not overwhelm Perun
  • The -p -128 gives the individual jobs a lower priority. This ensures that other peoples jobs, which won't take as long, have a higher priority and thus don't have to wait for your whole array job to finish

Once AlphaFold starts running, it should generate for each query protein a output directory with the name of the protein, under batch_sequence_search_out. For example, batch_sequence_search_out/pyv62_000500. In that directory, you will find an output JSON file, for example pyv62_000500_data.json. This file looks something like:

{
  "dialect": "alphafold3",
  "version": 2,
  "name": "PYV62_000500",
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQGLLLMPPNMLKC>
        "modifications": [],
        "unpairedMsa": ">query\nMAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQG>
        "pairedMsa": ">query\nMAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQGLL>
        "templates": [
          {
            "mmcif": "data_6CIN\n#\n_entry.id 6CIN\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
            "queryIndices": [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,>
            "templateIndices": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,>
          },
          {
            "mmcif": "data_1B0P\n#\n_entry.id 1B0P\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
            "queryIndices": [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,>
            "templateIndices": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 3>
          },
          {
            "mmcif": "data_5C4I\n#\n_entry.id 5C4I\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
            "queryIndices": [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,>
            "templateIndices": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,>
          },
          {
            "mmcif": "data_5C4I\n#\n_entry.id 5C4I\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
            "queryIndices": [1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1>
            "templateIndices": [90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 1>
          }
        ]

NOTE that this view of the file is truncated with >. The lines are much longer in the actual file. It would be impossible to show the entire file here in the wiki. These files can get quite big. The one I pasted above was 75 MB. I've seen up to 327 MB.

As you can see it contains the MSA (although I don't yet understand what is meant with paired and unpaired MSA) and a bunch of Templates, which I'm not sure either on what those mean. In any case, it is this JSON file that will be the input for the Structure Module, i.e. the second phase of AlphaFold.

AlphaFold uses jackhmmer under the hood

run_alphafold.py –norun_inference invokes jackhmmer of the HMMER suite of programs to search all the different sequence databases. jackhmmer is akin to PSI-BLAST, meaning it uses its search results to build an HMM profile, which is then used to search the databases again, which then updates the HMM profile, etc etc. However, if I understand correctly, jackhmmer is run with -N1, meaning it actually does not build an HMM profile. It is therefore equivalent to phmmer, which simply compares a protein sequence to a protein sequence database and thus akin to BLASTP.

Some technical details useful to know if you run into any errors:

  • Per protein query, multiple independent jackhmmer processes are launched. One for each distinct protein sequence database
  • The protein query sequence gets copied to the local disk of the node that is executing the jackhmmer process. These local disks are mounted on /tmp1/. These disks are not massive space wise
  • The multiple sequence alignment that AlphaFold needs to infer its co-evolving sites is also stored on these local disks (in Stockholm format). Sometimes these MSAs can grow extremely large and the local disk can run out of space.
  • The sequence databases are located on the /db1 SSD. This means there is a lot of back-and-forth traffic over the network between the /db1 and the local disk
  • jackhmmer does NOT load the entire sequence database into RAM. Instead, it is treated as a stream. The database is also not indexed, and it is also not indexed on the fly. Therefore, I/O is often the bottleneck. The CPU deals very quickly with fetched data, but then sits idle waiting for new data to arrive.
  • I sometimes get out of memory problems, its still unclear to me why exactly

Running the second phase of AlphaFold, actually predicting the 3D structure

We ran the sequence search phase as an array job, and it was using CPUs, since the task of sequence searching has been optimized for CPUs. However, structure prediction is optimized for using GPUs. We have since recently acquired an NVIDIA GPU and it is now part of Perun as perun24.

This part thankfully doesn't take as much time, but depending on the number of proteins, it can still take a couple of days. Currently it is pretty stupid, as in, its just doing the structure prediction for each query protein one by one in a for loop. There may be a way to parallalize the workload and thereby make it more time efficient, but I'm not sure yet on how to do that.

run_alphafold.py –norun_data_pipeline takes in the directory that contains the sequence search result JSON as input.

#!/bin/bash

#$ -S /bin/bash
#$ -cwd
#$ -q GPU@perun24

# setting up environment
# first ensure that we are using the right python installation for run_alphafold.py
# this is needed to make the 'source activate' below work
export PATH=/scratch5/software/miniconda3/bin:$PATH
# then append path to include the run_alphafold.py script
# NOTE: run_alphafold.py was edited slightly to include the #!/usr/bin/env python shebang at the top
export PATH=/scratch5/software/alphafold3:$PATH

# activate environment
source activate alphafold3

# dir with all the sequence databases
DB_DIR='/scratch5/software/alphafold3/public_databases'
# where to find the model parameter file
MODEL_DIR='/scratch5/db/alphafold3-model'

SEQ_OUTPUT_DIR='batch_sequence_search_out/'
# structure inference output directory
STRUC_OUTPUT_DIR='structure_module_out'

# do structure inference with GPU
for QUERY in $SEQ_OUTPUT_DIR/*; do
    run_alphafold.py \
        --db_dir $DB_DIR \
        --model_dir $MODEL_DIR \
        --input_dir=$QUERY \
        --output_dir=$STRUC_OUTPUT_DIR \
        --norun_data_pipeline
done

Simply submit the script with qsub run_alphafold_gpu.sh. It will generate under $STRUC_OUTPUT_DIR an output directory for each query protein. That directory will contain many files and subdirectories, but the main file we're after is called pyv62_000500_model.cif. This is your final predicted structure!

Evaluating the final output

[ID]_model.cif

CIF stands for Crystallographic Information File

You can view the structures using softwares like PyMOL and ChimeraX

[ID]_summary_confidences.json

Contains information regarding the expected overall accuracy of the predicted structure:

The ptm or predicted Template Modeling score

  • Between 0 and 1, with 1 being the perfect score.
  • This is a measure of accuracy of the entire structure

The iptm or interface pTM score.

  • Also between 0 and 1. Null, if a monomer.
  • This is a measure of confidence in all predicted interfaces between subunits in the multimer, or measure of accuracy of relative positions of subunits to one another

fraction disordered

  • Also between 0 and 1.
  • What fraction of the structure is disordered?

has_clash

  • True or False
  • True if >50% of atoms of a chain “clash”

ranking_score

  • Ranges from -100 to 1.5 ?
  • Calculated as follows: 0.8 * ipTM + 2 * pTM + 0.5 * disorder - 100 * has_clash
  • This calculation is then used to rank the multiple structure predictions

There are more metrics to discuss, but I don't have the time right now to continue on them

running_alphafold_at_scale.txt · Last modified: by 172.20.80.5