User Tools

Site Tools


running_alphafold_at_scale

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
running_alphafold_at_scale [2026/03/10 12:58] – [Searching the protein sequence databases] 172.20.54.200running_alphafold_at_scale [2026/03/11 11:51] (current) 172.20.80.5
Line 60: Line 60:
 === Preparing AlphaFold3 input query files === === Preparing AlphaFold3 input query files ===
  
-AlphaFold3 is a little awkward with how it wants it input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called ''fasta2alphafoldjson.py'' and it is available on the [[https://github.com/RogerLab/gospel_of_andrew|Gospel of Andrew]] GitHub page. Run it simply as follows:+AlphaFold3 is a little awkward with how it wants its input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called ''fasta2alphafoldjson.py'' and it is available on the [[https://github.com/RogerLab/gospel_of_andrew|Gospel of Andrew]] GitHub page. Run it simply as follows:
  
 <code> <code>
Line 79: Line 79:
 === Submitting AlphaFold as an array job === === Submitting AlphaFold as an array job ===
  
-If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence databases searches as an **array job** on Perun. An array-job is a single job, that in turn schedules the submission of many other jobs. Here the idea is to submit a single general "AlphaFold sequence search job" which then schedules and submits all the individual search jobs for each query protein separately.+If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence database searches as an **array job** on Perun. An array-job is a single job, that in turn schedules the submission of many other jobs. Here the idea is to submit a single general "AlphaFold sequence search job" which then schedules and submits all the individual search jobs for each query protein separately.
  
 <code> <code>
Line 145: Line 145:
  
 ''$SGE_TASK_ID'' is a special variable that is useful when you submit an array job. Array jobs are submitted by passing the ''-t'' flag to ''qsub''. For example, ''qsub -t 1-1000'' tells qsub to submit an array job of a 1000 individual jobs. When let's say, the 52nd individual job is submitted by the array job, the ''$SGE_TASK_ID'' special variable gets assigned the number 52. So, in the example above, we are assigning the variable ''SINGLE_JSON_FILE'' to ''batch_split_jsons/input_52.json''. That file is then finally passed on to ''run_alphafold.py --norun_inference'' at the bottom of the script. ''$SGE_TASK_ID'' is a special variable that is useful when you submit an array job. Array jobs are submitted by passing the ''-t'' flag to ''qsub''. For example, ''qsub -t 1-1000'' tells qsub to submit an array job of a 1000 individual jobs. When let's say, the 52nd individual job is submitted by the array job, the ''$SGE_TASK_ID'' special variable gets assigned the number 52. So, in the example above, we are assigning the variable ''SINGLE_JSON_FILE'' to ''batch_split_jsons/input_52.json''. That file is then finally passed on to ''run_alphafold.py --norun_inference'' at the bottom of the script.
 +
 +So, to submit this part of AlphaFold as an array job, do:
 +
 +<code>
 +qsub -t 1-1000 -tc 10 -p -128 run_alphafold_cpu.arrayjob.sh
 +</code>
 +
 +  * Adjust the 1000 to the number of proteins that you wished to be analyzed.
 +  * The ''-tc'' means it will submit a maximum of 10 individual jobs at any one time, to not overwhelm Perun
 +  * The ''-p -128'' gives the individual jobs a lower priority. This ensures that other peoples jobs, which won't take as long, have a higher priority and thus don't have to wait for your whole array job to finish
 +
 +Once AlphaFold starts running, it should generate for each query protein a output directory with the name of the protein, under ''batch_sequence_search_out''. For example, ''batch_sequence_search_out/pyv62_000500''. In that directory, you will find an output JSON file, for example ''pyv62_000500_data.json''. This file looks something like:
 +
 +<code>
 +{
 +  "dialect": "alphafold3",
 +  "version": 2,
 +  "name": "PYV62_000500",
 +  "sequences": [
 +    {
 +      "protein": {
 +        "id": "A",
 +        "sequence": "MAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQGLLLMPPNMLKC>
 +        "modifications": [],
 +        "unpairedMsa": ">query\nMAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQG>
 +        "pairedMsa": ">query\nMAEVHRKLVSQVFPQTSIAMPLAGCGVQNVLRSQSAAVASVTKAAARTFSDMVTIDGNTAASHVAFHMSDLSCLFPITPATPMGENVDAWAAQGKKNIYGVVPGVYQLNSEAGASGSLHGALLAGALAASYTSSQGLL>
 +        "templates": [
 +          {
 +            "mmcif": "data_6CIN\n#\n_entry.id 6CIN\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
 +            "queryIndices": [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,>
 +            "templateIndices": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,>
 +          },
 +          {
 +            "mmcif": "data_1B0P\n#\n_entry.id 1B0P\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
 +            "queryIndices": [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,>
 +            "templateIndices": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 3>
 +          },
 +          {
 +            "mmcif": "data_5C4I\n#\n_entry.id 5C4I\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
 +            "queryIndices": [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,>
 +            "templateIndices": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,>
 +          },
 +          {
 +            "mmcif": "data_5C4I\n#\n_entry.id 5C4I\n#\nloop_\n_chem_comp.formula\n_chem_comp.formula_weight\n_chem_comp.id\n_chem_comp.mon_nstd_flag\n_chem_comp.name\n_>
 +            "queryIndices": [1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1>
 +            "templateIndices": [90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 1>
 +          }
 +        ]
 +</code>
 +
 +NOTE that this view of the file is truncated with >. The lines are much longer in the actual file. It would be impossible to show the entire file here in the wiki. These files can get quite big. The one I pasted above was 75 MB. I've seen up to 327 MB.
 +
 +As you can see it contains the MSA (although I don't yet understand what is meant with paired and unpaired MSA) and a bunch of Templates, which I'm not sure either on what those mean. In any case, it is this JSON file that will be the input for the Structure Module, i.e. the second phase of AlphaFold.
 +
 +=== AlphaFold uses jackhmmer under the hood ===
  
 ''run_alphafold.py --norun_inference'' invokes **jackhmmer** of the HMMER suite of programs to search all the different sequence databases. **jackhmmer** is akin to PSI-BLAST, meaning it uses its search results to build an HMM profile, which is then used to search the databases again, which then updates the HMM profile, etc etc. However, if I understand correctly, jackhmmer is run with ''-N1'', meaning it actually does not build an HMM profile. It is therefore equivalent to **phmmer**, which simply compares a protein sequence to a protein sequence database and thus akin to BLASTP. ''run_alphafold.py --norun_inference'' invokes **jackhmmer** of the HMMER suite of programs to search all the different sequence databases. **jackhmmer** is akin to PSI-BLAST, meaning it uses its search results to build an HMM profile, which is then used to search the databases again, which then updates the HMM profile, etc etc. However, if I understand correctly, jackhmmer is run with ''-N1'', meaning it actually does not build an HMM profile. It is therefore equivalent to **phmmer**, which simply compares a protein sequence to a protein sequence database and thus akin to BLASTP.
Line 153: Line 208:
   * The multiple sequence alignment that AlphaFold needs to infer its co-evolving sites is also stored on these local disks (in Stockholm format). Sometimes these MSAs can grow extremely large and the local disk can run out of space.   * The multiple sequence alignment that AlphaFold needs to infer its co-evolving sites is also stored on these local disks (in Stockholm format). Sometimes these MSAs can grow extremely large and the local disk can run out of space.
   * The sequence databases are located on the /db1 SSD. This means there is a lot of back-and-forth traffic over the network between the /db1 and the local disk   * The sequence databases are located on the /db1 SSD. This means there is a lot of back-and-forth traffic over the network between the /db1 and the local disk
 +  * jackhmmer does NOT load the entire sequence database into RAM. Instead, it is treated as a stream. The database is also not indexed, and it is also not indexed on the fly. Therefore, **I/O is often the bottleneck.** The CPU deals very quickly with fetched data, but then sits idle waiting for new data to arrive.
 +  * I sometimes get out of memory problems, its still unclear to me why exactly
 +
 + 
 +=== Running the second phase of AlphaFold, actually predicting the 3D structure ===
 +
 +We ran the sequence search phase as an array job, and it was using CPUs, since the task of sequence searching has been optimized for CPUs. However, structure prediction is optimized for using GPUs. We have since recently acquired an NVIDIA GPU and it is now part of Perun as perun24.
 +
 +This part thankfully doesn't take as much time, but depending on the number of proteins, it can still take a couple of days. Currently it is pretty stupid, as in, its just doing the structure prediction for each query protein one by one in a for loop. There may be a way to parallalize the workload and thereby make it more time efficient, but I'm not sure yet on how to do that.
 +
 +''run_alphafold.py --norun_data_pipeline'' takes in the directory that contains the sequence search result JSON as input.
 +
 +<code>
 +#!/bin/bash
 +
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -q GPU@perun24
 +
 +# setting up environment
 +# first ensure that we are using the right python installation for run_alphafold.py
 +# this is needed to make the 'source activate' below work
 +export PATH=/scratch5/software/miniconda3/bin:$PATH
 +# then append path to include the run_alphafold.py script
 +# NOTE: run_alphafold.py was edited slightly to include the #!/usr/bin/env python shebang at the top
 +export PATH=/scratch5/software/alphafold3:$PATH
 +
 +# activate environment
 +source activate alphafold3
 +
 +# dir with all the sequence databases
 +DB_DIR='/scratch5/software/alphafold3/public_databases'
 +# where to find the model parameter file
 +MODEL_DIR='/scratch5/db/alphafold3-model'
 +
 +SEQ_OUTPUT_DIR='batch_sequence_search_out/'
 +# structure inference output directory
 +STRUC_OUTPUT_DIR='structure_module_out'
 +
 +# do structure inference with GPU
 +for QUERY in $SEQ_OUTPUT_DIR/*; do
 +    run_alphafold.py \
 +        --db_dir $DB_DIR \
 +        --model_dir $MODEL_DIR \
 +        --input_dir=$QUERY \
 +        --output_dir=$STRUC_OUTPUT_DIR \
 +        --norun_data_pipeline
 +done
 +</code>
 +
 +Simply submit the script with ''qsub run_alphafold_gpu.sh''. It will generate under ''$STRUC_OUTPUT_DIR'' an output directory for each query protein. That directory will contain many files and subdirectories, but the main file we're after is called ''pyv62_000500_model.cif''. This is your final predicted structure!
 +
 +=== Evaluating the final output ===
 +
 +== [ID]_model.cif ==
 +CIF stands for Crystallographic Information File
 +
 +You can view the structures using softwares like PyMOL and ChimeraX
 +
 +== [ID]_summary_confidences.json ==
 +
 +Contains information regarding the expected overall accuracy of the predicted structure:
  
 +The **ptm** or predicted Template Modeling score
 +  * Between 0 and 1, with 1 being the perfect score.
 +  * This is a measure of accuracy of the entire structure
  
 +The **iptm** or interface pTM score. 
 +  * Also between 0 and 1. Null, if a monomer. 
 +  * This is a measure of confidence in all predicted interfaces between subunits in the multimer, or measure of accuracy of relative positions of subunits to one another
 +  
 +**fraction disordered**
 +  * Also between 0 and 1. 
 +  * What fraction of the structure is disordered?
 + 
 +**has_clash**
 +  * True or False
 +  * True if >50% of atoms of a chain "clash"
  
 +**ranking_score**
 +  * Ranges from -100 to 1.5 ? 
 +  * Calculated as follows: 0.8 * ipTM + 2 * pTM + 0.5 * disorder - 100 * has_clash
 +  * This calculation is then used to rank the multiple structure predictions
  
 +There are more metrics to discuss, but I don't have the time right now to continue on them
running_alphafold_at_scale.1773158309.txt.gz · Last modified: by 172.20.54.200