running_alphafold_at_scale
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| running_alphafold_at_scale [2026/03/10 11:19] – [Running AlphaFold on Perun] 172.20.54.200 | running_alphafold_at_scale [2026/03/11 11:51] (current) – 172.20.80.5 | ||
|---|---|---|---|
| Line 34: | Line 34: | ||
| - | ==== Running AlphaFold on Perun ==== | + | ===== Running AlphaFold on Perun ===== |
| Currently we have AlphaFold3 installed on Perun. | Currently we have AlphaFold3 installed on Perun. | ||
| - | === Get permission to use AlphaFold3 === | + | ==== Get permission to use AlphaFold3 |
| Before you can use it, you need to get a ~1GB file that contains the "model parameters" | Before you can use it, you need to get a ~1GB file that contains the "model parameters" | ||
| Line 44: | Line 44: | ||
| Once you get approved, normally you would download the file and place it in your filesystem at the appropriate place so you can start using AlphaFold. For Perun however, you don't need to download that file, you only need to let Balagopal know that you have been approved and he will add you on a list, so that you can start using AlphaFold on Perun. | Once you get approved, normally you would download the file and place it in your filesystem at the appropriate place so you can start using AlphaFold. For Perun however, you don't need to download that file, you only need to let Balagopal know that you have been approved and he will add you on a list, so that you can start using AlphaFold on Perun. | ||
| - | === Searching the protein sequence databases === | + | ==== Searching the protein sequence databases |
| As explained above, AlphaFold does two things. First it compares the query sequence to sequence databases, and then it actually predicts the 3D structure. The sequence database searching is the computational limiting step and takes a lot more time to run. The structure module is relatively quick. | As explained above, AlphaFold does two things. First it compares the query sequence to sequence databases, and then it actually predicts the 3D structure. The sequence database searching is the computational limiting step and takes a lot more time to run. The structure module is relatively quick. | ||
| + | |||
| + | === The sequence databases === | ||
| All the sequence databases ("Big Fantastic Database" | All the sequence databases ("Big Fantastic Database" | ||
| Line 52: | Line 54: | ||
| *BFD was created by clustering 2.5 billion protein sequences from Uniprot/ | *BFD was created by clustering 2.5 billion protein sequences from Uniprot/ | ||
| - | Since there is quite a bit of reading and writing going on in the sequence search phase, it is highly recommended to use a SSD drive. | + | Since there is quite a bit of reading and writing going on in the sequence search phase, it is highly recommended to use a SSD drive. |
| + | |||
| + | It is most efficient to run the sequence search phase ('' | ||
| + | |||
| + | === Preparing AlphaFold3 input query files === | ||
| + | |||
| + | AlphaFold3 is a little awkward with how it wants its input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called '' | ||
| + | |||
| + | < | ||
| + | ./ | ||
| + | </ | ||
| + | |||
| + | NOTE: Currently the script is hardcoded to make JSON entries of the ' | ||
| + | |||
| + | If your '' | ||
| + | |||
| + | < | ||
| + | mkdir -p batch_split_jsons | ||
| + | jq -c ' | ||
| + | </ | ||
| + | |||
| + | '' | ||
| + | |||
| + | === Submitting AlphaFold as an array job === | ||
| + | |||
| + | If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence database searches as an **array job** on Perun. An array-job is a single job, that in turn schedules the submission of many other jobs. Here the idea is to submit a single general " | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 10 | ||
| + | #$ -q 768G-batch, | ||
| + | |||
| + | |||
| + | # $SGE_TASK_ID returns a number, which is its current task number | ||
| + | # for example if you do | ||
| + | # qsub -t 1-1000 | ||
| + | # then each job in the array will have a unique SGE_TASK_ID, | ||
| + | # here, we are using that ID to get the corresponding file | ||
| + | SINGLE_JSON_FILE=" | ||
| + | |||
| + | |||
| + | # setting up environment | ||
| + | # first ensure that we are using the right python installation for run_alphafold.py | ||
| + | # this is needed to make the ' | ||
| + | export PATH=/ | ||
| + | |||
| + | # then append path to include the run_alphafold.py script | ||
| + | # NOTE: run_alphafold.py was edited slightly to include the # | ||
| + | export PATH=/ | ||
| + | |||
| + | # activate environment | ||
| + | source activate alphafold3 | ||
| + | |||
| + | # dir with all the sequence databases | ||
| + | DB_DIR='/ | ||
| + | # where to find the model parameter file | ||
| + | MODEL_DIR='/ | ||
| + | |||
| + | # input protein or proteins | ||
| + | # | ||
| + | # if you want to submit a single protein, use a JSON of the ' | ||
| + | # if you want to submit multiple protein queries, you can use a JSON of the ' | ||
| + | # or you can specify an INPUT_DIR where each query is a single JSON file of the ' | ||
| + | |||
| + | # sequence search output directory | ||
| + | SEQ_OUTPUT_DIR=' | ||
| + | # threads to use | ||
| + | THREADS=10 | ||
| + | |||
| + | mkdir -p $SEQ_OUTPUT_DIR | ||
| + | |||
| + | # do sequence search with fast CPUs | ||
| + | run_alphafold.py \ | ||
| + | --norun_inference \ | ||
| + | --json_path $SINGLE_JSON_FILE \ | ||
| + | --output_dir $SEQ_OUTPUT_DIR \ | ||
| + | --db_dir $DB_DIR \ | ||
| + | --nhmmer_n_cpu $THREADS \ | ||
| + | --jackhmmer_n_cpu $THREADS | ||
| + | </ | ||
| + | |||
| + | What is this magic '' | ||
| + | |||
| + | < | ||
| + | SINGLE_JSON_FILE=" | ||
| + | </ | ||
| + | |||
| + | '' | ||
| + | |||
| + | So, to submit this part of AlphaFold as an array job, do: | ||
| + | |||
| + | < | ||
| + | qsub -t 1-1000 -tc 10 -p -128 run_alphafold_cpu.arrayjob.sh | ||
| + | </ | ||
| + | |||
| + | * Adjust the 1000 to the number of proteins that you wished to be analyzed. | ||
| + | * The '' | ||
| + | * The '' | ||
| + | |||
| + | Once AlphaFold starts running, it should generate for each query protein a output directory with the name of the protein, under '' | ||
| + | |||
| + | < | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | }, | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | }, | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | }, | ||
| + | { | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | } | ||
| + | ] | ||
| + | </ | ||
| + | |||
| + | NOTE that this view of the file is truncated with >. The lines are much longer in the actual file. It would be impossible to show the entire file here in the wiki. These files can get quite big. The one I pasted above was 75 MB. I've seen up to 327 MB. | ||
| + | |||
| + | As you can see it contains the MSA (although I don't yet understand what is meant with paired and unpaired MSA) and a bunch of Templates, which I'm not sure either on what those mean. In any case, it is this JSON file that will be the input for the Structure Module, i.e. the second phase of AlphaFold. | ||
| + | |||
| + | === AlphaFold uses jackhmmer under the hood === | ||
| + | |||
| + | '' | ||
| + | |||
| + | Some technical details useful to know if you run into any errors: | ||
| + | * Per protein query, multiple independent jackhmmer processes are launched. One for each distinct protein sequence database | ||
| + | * The protein query sequence gets copied to the local disk of the node that is executing the jackhmmer process. These local disks are mounted on ''/ | ||
| + | * The multiple sequence alignment that AlphaFold needs to infer its co-evolving sites is also stored on these local disks (in Stockholm format). Sometimes these MSAs can grow extremely large and the local disk can run out of space. | ||
| + | * The sequence databases are located on the /db1 SSD. This means there is a lot of back-and-forth traffic over the network between the /db1 and the local disk | ||
| + | * jackhmmer does NOT load the entire sequence database into RAM. Instead, it is treated as a stream. The database is also not indexed, and it is also not indexed on the fly. Therefore, **I/O is often the bottleneck.** The CPU deals very quickly with fetched data, but then sits idle waiting for new data to arrive. | ||
| + | * I sometimes get out of memory problems, its still unclear to me why exactly | ||
| + | |||
| + | |||
| + | === Running the second phase of AlphaFold, actually predicting the 3D structure === | ||
| + | |||
| + | We ran the sequence search phase as an array job, and it was using CPUs, since the task of sequence searching has been optimized for CPUs. However, structure prediction is optimized for using GPUs. We have since recently acquired an NVIDIA GPU and it is now part of Perun as perun24. | ||
| + | |||
| + | This part thankfully doesn' | ||
| + | |||
| + | '' | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -q GPU@perun24 | ||
| + | |||
| + | # setting up environment | ||
| + | # first ensure that we are using the right python installation for run_alphafold.py | ||
| + | # this is needed to make the ' | ||
| + | export PATH=/ | ||
| + | # then append path to include the run_alphafold.py script | ||
| + | # NOTE: run_alphafold.py was edited slightly to include the # | ||
| + | export PATH=/ | ||
| + | |||
| + | # activate environment | ||
| + | source activate alphafold3 | ||
| + | |||
| + | # dir with all the sequence databases | ||
| + | DB_DIR='/ | ||
| + | # where to find the model parameter file | ||
| + | MODEL_DIR='/ | ||
| + | |||
| + | SEQ_OUTPUT_DIR=' | ||
| + | # structure inference output directory | ||
| + | STRUC_OUTPUT_DIR=' | ||
| + | |||
| + | # do structure inference with GPU | ||
| + | for QUERY in $SEQ_OUTPUT_DIR/ | ||
| + | run_alphafold.py \ | ||
| + | --db_dir $DB_DIR \ | ||
| + | --model_dir $MODEL_DIR \ | ||
| + | --input_dir=$QUERY \ | ||
| + | --output_dir=$STRUC_OUTPUT_DIR \ | ||
| + | --norun_data_pipeline | ||
| + | done | ||
| + | </ | ||
| + | |||
| + | Simply submit the script with '' | ||
| + | |||
| + | === Evaluating the final output === | ||
| + | |||
| + | == [ID]_model.cif == | ||
| + | CIF stands for Crystallographic Information File | ||
| + | |||
| + | You can view the structures using softwares like PyMOL and ChimeraX | ||
| + | |||
| + | == [ID]_summary_confidences.json == | ||
| + | |||
| + | Contains information regarding the expected overall accuracy of the predicted structure: | ||
| + | |||
| + | The **ptm** or predicted Template Modeling score | ||
| + | * Between 0 and 1, with 1 being the perfect score. | ||
| + | * This is a measure of accuracy of the entire structure | ||
| + | |||
| + | The **iptm** or interface pTM score. | ||
| + | * Also between 0 and 1. Null, if a monomer. | ||
| + | * This is a measure of confidence in all predicted interfaces between subunits in the multimer, or measure of accuracy of relative positions of subunits to one another | ||
| + | |||
| + | **fraction disordered** | ||
| + | * Also between 0 and 1. | ||
| + | * What fraction of the structure is disordered? | ||
| + | |||
| + | **has_clash** | ||
| + | * True or False | ||
| + | * True if >50% of atoms of a chain " | ||
| + | |||
| + | **ranking_score** | ||
| + | * Ranges from -100 to 1.5 ? | ||
| + | * Calculated as follows: 0.8 * ipTM + 2 * pTM + 0.5 * disorder - 100 * has_clash | ||
| + | * This calculation is then used to rank the multiple structure predictions | ||
| + | There are more metrics to discuss, but I don't have the time right now to continue on them | ||
running_alphafold_at_scale.1773152344.txt.gz · Last modified: by 172.20.54.200 · Currently locked by: 8.217.175.178
