This is an old revision of the document!

Running AlphaFold at scale

By Joran Martijn, March 2026

A brief history of AlphaFold

The challenge of predicting protein structures from amino acid sequences has long been a significant issue in biochemistry, and many groups have taken a crack at it for several decades.

In 2016 Google's DeepMind began exploring protein folding as part of its research into using AI for biological problems. AlphaFold made its first notable appearance in the 13th Critical Assessment of Protein Structure Prediction (CASP13) in December 2018. It showcased its ability to predict protein structures with remarkable accuracy, outperforming other methods.

In 2020, DeepMind released AlphaFold2, which significantly improved upon the original model. It utilized new neural network architectures and training techniques, demonstrating astonishing accuracy in predictions. During CASP14, AlphaFold2 solved protein structures at a level comparable to experimental techniques.

In July 2021, DeepMind published the AlphaFold Protein Structure Database, providing over 350,000 predicted protein structures for the scientific community. This database has since been a vital resource for researchers worldwide.

AlphaFold3 was released in 2024. It's main development relative to its predecessor was that it can predict structures of protein complexes with non-protein molecules (DNA, RNA, ions, lipids, Fe-S clusters, small molecule ligands, post-transcriptional modifications, chemical modifications of nucleic acids). It can also predict structures without proteins, such as ss- and dsDNA and ssRNA chains.

AlphaFold theory

The main tenet of AlphaFold is to use evolutionary signals in protein sequences to inform the reconstruction of the three-dimensional structure of proteins. The idea is that amino acid residues of the same protein that co-evolve together, are very likely to be close to each other in 3D space. Conversely, pairs that do not co-evolve are thought to be further away from each other.

The exact details are a bit murky but I was able to understand the following:

AF is a multicomponent AI system that uses deep-learning to train its neural network models
It was trained on experimentally verified structures from the PDB

AlphaFold consists of two phases:

1A. Compare input protein sequence with those in *publically available databases*
1B. Construct a multiple sequence alignment
1C. Construct a pair representations, where for each residue-vs-residue in an all-vs-all comparison the degree of co-evolution is quantified

The neural network (called Evoformer in AF2, Pairformer in AF3) allows for continuous flow between the MSA and the pair representations. It interprets and updates both these things.

2. A structure module in AF2, or 'diffusion module' in AF3, that uses the information extracted from phase 1 to infer the 3D protein structure

cgeb2001's DokuWiki!

Table of Contents

Running AlphaFold at scale

A brief history of AlphaFold

AlphaFold theory

Running AlphaFold on Perun