running_alphafold_at_scale
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| running_alphafold_at_scale [2026/03/11 10:33] – [Searching the protein sequence databases] 172.20.54.200 | running_alphafold_at_scale [2026/03/11 11:51] (current) – 172.20.80.5 | ||
|---|---|---|---|
| Line 60: | Line 60: | ||
| === Preparing AlphaFold3 input query files === | === Preparing AlphaFold3 input query files === | ||
| - | AlphaFold3 is a little awkward with how it wants it input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called '' | + | AlphaFold3 is a little awkward with how it wants its input protein sequences formatted. Instead of a simple FASTA file, it requires a JSON file. Thankfully I found a relatively simple way of converting the FASTA into a JSON file using and editing slightly a python script I found online. It is called '' |
| < | < | ||
| Line 79: | Line 79: | ||
| === Submitting AlphaFold as an array job === | === Submitting AlphaFold as an array job === | ||
| - | If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence | + | If you want to run AlphaFold on many proteins, it may be pragmatic to submit these sequence |
| < | < | ||
| Line 195: | Line 195: | ||
| </ | </ | ||
| - | NOTE that this view of the file is truncated with >. The lines are much longer in the actual file. It would be impossible to show the entire file here in the wiki. | + | NOTE that this view of the file is truncated with >. The lines are much longer in the actual file. It would be impossible to show the entire file here in the wiki. These files can get quite big. The one I pasted above was 75 MB. I've seen up to 327 MB. |
| + | |||
| + | As you can see it contains the MSA (although I don't yet understand what is meant with paired and unpaired MSA) and a bunch of Templates, which I'm not sure either on what those mean. In any case, it is this JSON file that will be the input for the Structure Module, i.e. the second phase of AlphaFold. | ||
| === AlphaFold uses jackhmmer under the hood === | === AlphaFold uses jackhmmer under the hood === | ||
| Line 210: | Line 212: | ||
| + | === Running the second phase of AlphaFold, actually predicting the 3D structure === | ||
| + | We ran the sequence search phase as an array job, and it was using CPUs, since the task of sequence searching has been optimized for CPUs. However, structure prediction is optimized for using GPUs. We have since recently acquired an NVIDIA GPU and it is now part of Perun as perun24. | ||
| + | |||
| + | This part thankfully doesn' | ||
| + | |||
| + | '' | ||
| + | |||
| + | < | ||
| + | #!/bin/bash | ||
| + | |||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -q GPU@perun24 | ||
| + | |||
| + | # setting up environment | ||
| + | # first ensure that we are using the right python installation for run_alphafold.py | ||
| + | # this is needed to make the ' | ||
| + | export PATH=/ | ||
| + | # then append path to include the run_alphafold.py script | ||
| + | # NOTE: run_alphafold.py was edited slightly to include the # | ||
| + | export PATH=/ | ||
| + | |||
| + | # activate environment | ||
| + | source activate alphafold3 | ||
| + | |||
| + | # dir with all the sequence databases | ||
| + | DB_DIR='/ | ||
| + | # where to find the model parameter file | ||
| + | MODEL_DIR='/ | ||
| + | |||
| + | SEQ_OUTPUT_DIR=' | ||
| + | # structure inference output directory | ||
| + | STRUC_OUTPUT_DIR=' | ||
| + | |||
| + | # do structure inference with GPU | ||
| + | for QUERY in $SEQ_OUTPUT_DIR/ | ||
| + | run_alphafold.py \ | ||
| + | --db_dir $DB_DIR \ | ||
| + | --model_dir $MODEL_DIR \ | ||
| + | --input_dir=$QUERY \ | ||
| + | --output_dir=$STRUC_OUTPUT_DIR \ | ||
| + | --norun_data_pipeline | ||
| + | done | ||
| + | </ | ||
| + | |||
| + | Simply submit the script with '' | ||
| + | |||
| + | === Evaluating the final output === | ||
| + | |||
| + | == [ID]_model.cif == | ||
| + | CIF stands for Crystallographic Information File | ||
| + | |||
| + | You can view the structures using softwares like PyMOL and ChimeraX | ||
| + | |||
| + | == [ID]_summary_confidences.json == | ||
| + | |||
| + | Contains information regarding the expected overall accuracy of the predicted structure: | ||
| + | |||
| + | The **ptm** or predicted Template Modeling score | ||
| + | * Between 0 and 1, with 1 being the perfect score. | ||
| + | * This is a measure of accuracy of the entire structure | ||
| + | |||
| + | The **iptm** or interface pTM score. | ||
| + | * Also between 0 and 1. Null, if a monomer. | ||
| + | * This is a measure of confidence in all predicted interfaces between subunits in the multimer, or measure of accuracy of relative positions of subunits to one another | ||
| + | | ||
| + | **fraction disordered** | ||
| + | * Also between 0 and 1. | ||
| + | * What fraction of the structure is disordered? | ||
| + | |||
| + | **has_clash** | ||
| + | * True or False | ||
| + | * True if >50% of atoms of a chain " | ||
| + | **ranking_score** | ||
| + | * Ranges from -100 to 1.5 ? | ||
| + | * Calculated as follows: 0.8 * ipTM + 2 * pTM + 0.5 * disorder - 100 * has_clash | ||
| + | * This calculation is then used to rank the multiple structure predictions | ||
| + | There are more metrics to discuss, but I don't have the time right now to continue on them | ||
running_alphafold_at_scale.1773236017.txt.gz · Last modified: by 172.20.54.200
