The Arcane is Yielding to the Computational
Why Biology is Becoming an Engineering Discipline—and What Comes Next
Construction, manufacturing, and chemistry are no longer disciplines of science—they are disciplines of engineering.
We have developed precision, physics-based simulations and computational techniques that allow us to predict, with remarkable accuracy, the performance, stress factors, and long-term behavior of the materials and structures we design.
The same cannot yet be said for biology.
The Stupidity of Small Molecule Drug Discovery
Historically, the way we developed small-molecule drugs was brute-force trial and error, a process known as small molecule drug discovery.
Step one: Pick a biological target—an enzyme, receptor, or protein involved in a disease.
Step two: Use automation to screen hundreds of thousands (sometimes millions) of chemical compounds to see if any of them interact favorably with the target.
Most pharmaceutical companies maintain massive chemical libraries for this purpose. Pfizer has 4 million compounds in its database. AstraZeneca has 1.7 million.
While this process is magnificent in scale, it is also fundamentally non-systematic.
We close our eyes, pick a lead candidate, cross our fingers, and take it through clinical trials.
The result? A small arsenal of blunt weapons against various diseases—arrived at through almost pure trial and error.
Today, the process is slightly less random and marginally more systematic, but it is still far from engineered precision.
Rational Drug Design
Modern small-molecule drug development follows a structured approach known as rational drug design. Instead of relying on brute-force screening, scientists now use molecular insights, computational modeling, and iterative refinement to develop more precise therapies.
The process begins by identifying a biological target—typically a protein, enzyme, or receptor involved in a disease. The objective is to either activate or inhibit this target in a way that meaningfully improves patient outcomes. Once a target is selected, its molecular structure must be mapped to facilitate precise drug design. This is accomplished using X-ray crystallography, which determines atomic positions by analyzing how X-rays scatter through crystallized proteins; Nuclear Magnetic Resonance (NMR) spectroscopy, which uses magnetic fields to infer molecular structure in solution; or cryo-electron microscopy (cryo-EM), which captures high-resolution 3D images of large and complex biomolecules without requiring crystallization. Understanding the structure of a target is critical for designing molecules that fit precisely into its active site.
With this structural blueprint in hand, researchers begin the process of designing small molecules that interact with the target in a therapeutically beneficial way. Several approaches exist. High-throughput screening (HTS) allows scientists to test hundreds of thousands to millions of compounds, searching for those that bind effectively. A more selective method, focused screening, involves running assays on a hand-selected subset of molecules with known biological activity. Alternatively, computational modeling techniques such as molecular docking, molecular dynamics, and AI-driven chemistry enable researchers to predict and refine drug-target interactions in silico before conducting physical experiments. The effectiveness of each approach depends on the chemical properties of the target and the availability of data on similar molecules within the therapeutic class.
Once a promising compound is identified, it advances as a lead drug candidate. However, binding to a target in vitro does not guarantee that the molecule will be an effective or safe drug. Before entering clinical trials, the candidate undergoes optimization to refine its efficacy, specificity, and pharmacokinetics. Scientists must determine whether the drug is potent and selective, whether it can be administered orally without being destroyed in first-pass metabolism by the liver, and whether its therapeutic dose is distinct from its toxic dose. Additional factors include its solubility in blood, its propensity to accumulate in fat tissue, and its ability to cross the blood-brain barrier, which is crucial for neurological treatments. Long-term safety assessments seek to rule out risks of toxicity, carcinogenicity, or other adverse effects.
Each of these variables is tested through iterative cycles of design and refinement, progressively enhancing the drug’s properties before it ever reaches human trials. Yet, despite all these advancements, biology remains a science of trial and error.
Why Biology is Still Trial and Error
While chemistry has become an engineering discipline, medicine remains constrained by biological unpredictability.
Biology remains a science of discovery, governed by the scientific method—a slow, step-by-step process of hypothesis, experimentation, and refinement. Each experiment tests a single assumption, generates data, and informs the next iteration, gradually uncovering the fundamental truths of biological systems.
But this process is inherently blind—a one-step-at-a-time, trial-and-error binary search through an infinitely expansive tree of knowledge.
Progress feels fast because each breakthrough has a profound impact on quality of life and global living standards. Yet relative to the vastness of undiscovered biological knowledge, it remains excruciatingly slow. The true bottleneck is not our ingenuity, but the sheer volume of truths yet to be uncovered—and the transformative technologies that remain locked behind them.
Why Computation Wasn’t Enough
For decades, scientists hoped that computational power alone would allow us to escape the slow, methodical process of experimentation. If we could simulate biology at a fundamental level, perhaps we could bypass the scientific method altogether—allowing us to test hypotheses in silico rather than in a lab.
But that vision never materialized.
Computers revolutionized engineering disciplines like construction, manufacturing, and materials science because these fields are governed by well-defined physical laws that can be modeled with high precision using classical physics. We can simulate stress factors on a bridge with finite element analysis, predict fluid dynamics with computational models, and even design new materials at the molecular level using quantum simulations.
Biology, however, is different. Simulating even the simplest biological systems has proven intractable. Modeling individual molecules—proteins, DNA, RNA, or lipids—demands quantum-level accuracy. Classical mechanics is insufficient to fully capture the intricacies of molecular interactions, making precise simulation of biological processes an extraordinary computational challenge.
The fundamental issue isn’t that biology is intrinsically more complex than physics or chemistry—it’s that biological complexity scales exponentially as the number of interacting atoms increases linearly.
Moreover, biology isn’t just about simulating individual molecules. It requires modeling entire networks of molecular interactions, spanning multiple levels of organization—from atoms to molecules, from molecules to cells, and from cells to tissues, organs, and complete organisms. Each layer introduces new emergent properties and feedback loops, further amplifying the computational burden.
This quickly becomes an intractable problem. As mentioned, the compute required to simulate increasingly complex systems also scales exponentially. The computational capacity of all of Earth’s semiconductors combined is in the range of 20-120 exaFlops (10^18 floating point operations per second).
Extrapolating from the compute power required for molecular dynamics and quantum simulations of smaller systems, perfectly simulating a single eukaryotic cell would demand zettascale computing—on the order of 10²¹ floating point operations per second. This surpasses the total compute capacity of the entire planet. Even Earth’s most powerful supercomputers fall several orders of magnitude short.
But perhaps brute-force simulation isn’t the answer.
Instead of reconstructing reality from first principles, can we distill the fundamental laws of biology into an efficient computational model—one that serves our purposes, rather than merely replicating nature?
Just as we have likely cracked the problem of human-like intelligence using omnimodal large language models, we are on the verge of another paradigm shift. We are beginning to reverse-engineer the intelligence embedded in biological systems—the structure, function, and design principles refined over billions of years of chaotic, meandering evolutionary exploration.
Computational Biology and Enabler Technologies
The key to turning biology into an engineering discipline is simple: data, compute, and deep learning.
Once we accumulate enough real-world biological data, we can develop deep learning algorithms that uncover the underlying rules governing life itself.
Take the case of protein folding.
Through enormous experimental effort and collaboration, scientists resolved and cataloged the three-dimensional structures of over 100,000 unique proteins. Though this represents only a fraction of the billions of known protein sequences, it was enough for DeepMind to train AlphaFold—a model capable of accurately predicting protein structure from amino acid sequence alone.
What biology needs is data, compute, and deep learning. That is all.
And the last two—compute and data—are on exponential growth curves with declining costs. Despite decades of warnings from armchair experts that “Winter is Coming” for Moore’s Law, there is no sign of slowdown—only acceleration.
Improvements in deep learning algorithms tend to be incremental, but every so often, we witness a true breakthrough.
We had convolutional neural networks (CNNs) that revolutionized computer vision, transformer models that redefined natural language processing, and diffusion models that ushered in a new era of generative AI.
Biology is next.
Global investment and human capital allocated for the sole purpose of keeping these growth curves going, is itself, on a exponential growth curve.
A large focus of my writing over the coming months will be to explore how these forces—data, compute, and deep learning—are converging to transform biology.
Cracking the Protein Folding Code
The challenge is simple to define yet historically difficult to solve:
Can we predict a protein’s three-dimensional structure using only its amino acid sequence?
The evolution of protein structure prediction mirrors the early days of NLP. Just as NLP once relied on handcrafted heuristics and statistical rules before deep learning came along, protein structure determination followed a similar trajectory—moving from manual experimentation to computational inference.
Before deep learning, scientists resolved, and stored, the structures of proteins using experimental techniques, like X-ray crystallography, NMR spectroscopy and cryo-EM. The growing database of solved structures in the Protein Data Bank became an indispensable for molecular biology, but it also laid the groundwork for a fundamentally new approach: predicting protein structures directly from sequence data.
Without large-scale computational models, structural biologists relied on statistical rules derived from known protein structures, identifying which amino acids tended to form helices, sheets, or turns. Early approaches classified amino acids by polarity and side-chain bulkiness, noting that alternating non-polar (Ala/Leu) and charged (Glu/Lys) residues often formed alpha helices, while glycine and proline frequently appeared in beta turns. These empirical rules enabled rough secondary structure predictions, but they were inherently limited.
However, rule-based approaches failed to account for long-range interactions and complex folding dynamics, which determine a protein’s final shape. Structure is not dictated by individual residues in isolation but by how they interact across the entire sequence, and these heuristics-based models lacked the capacity to capture these dependencies, making them fundamentally inadequate for large-scale protein folding predictions.
Despite incremental improvements, progress stalled—not due to a lack of data, but because biological complexity outpaced classical modeling techniques. Only with deep learning and large-scale protein databases did the field experience a true paradigm shift.
How AlphaFold Works
In writing this newsletter, I finally had an excuse to read through all the AlphaFold papers, study the architectures of attention-based neural networks, convolutional neural networks (CNNs), and RFDiffusion.
But this post would become far too dense if I also attempted to explain deep learning from scratch, so if you’re unfamiliar with concepts like Neural Networks, Back-propagation, and Transformer Models, I highly recommend 3Blue1Brown’s crash course as a starting point. That being said, here’s a quick overview.
Just as NLP represents text as linear sequences of letters, protein sequences can be written as strings of amino acids—there are ~21 amino acids and each is represented by a letter of the alphabet. This linear sequence, known as the primary structure, encodes all the information needed for a protein to fold into its functional three-dimensional shape.
The linear sequence of amino acids folds into secondary structures, such as the alpha helix (think of a coiled slinky), the beta sheet (picture a pleated ribbon) or random turns (irregular loops connecting structured regions).
These secondary structures then fold together to form more complex arrangements, which we call the protein’s tertiary structure. Often, multiple folded proteins assemble into a larger complex, creating a quaternary structure.
A protein’s three-dimensional structure determines its function—how it interacts with other molecules, catalyzes reactions, or forms structural components of cells.
Instead of relying on handcrafted rules, DeepMind and others recognized that protein structures could be learned directly from data. They encoded amino acid sequences as matrices and 3D spatial coordinates as corresponding matrices, allowing deep learning models to uncover structural patterns autonomously.
But predicting how an amino acid sequence folds requires more than just raw sequence data—it demands insight into spatial relationships. How does the model learn which residues are close in 3D space?
One key approach is Multiple Sequence Alignment (MSA). By searching large protein databases for homologous sequences across different species, researchers can align these sequences to identify evolutionary conserved positions and co-evolving residues. If two amino acids mutate together across evolution, it suggests they are likely structurally or functionally linked. This co-evolutionary signal allows the model to infer long-range interactions, helping it predict which residues will be adjacent in the final folded structure. In AlphaFold 2, DeepMind leveraged these MSAs as a critical input, guiding the model toward an accurate contact map or distance matrix.
Another breakthrough was the attention mechanism, introduced by eight researchers at Google in the landmark 2017 paper Attention Is All You Need. This innovation transformed the field of NLP and later proved to be just as revolutionary for protein structure prediction.
Rather than treating each amino acid independently, self-attention enables the model to contextualize each residue based on its interactions with all others. As a result, the model refines its understanding of the protein structure layer by layer, progressively encoding richer structural relationships.
More recent approaches, such as AlphaFold 3 and Meta’s ESMFold, are moving beyond MSAs. Instead of relying on homology-based constraints, these models utilize large-scale protein language models, trained on vast databases of sequences and structures. By learning the fundamental statistical and geometric properties of proteins, they can generalize more effectively—even for proteins without well-defined evolutionary homologs.
At a high level, AlphaFold’s process can be broken down into four key stages:
Input: Sequence and Evolutionary Data
The model takes an amino acid sequence as input. To infer structural constraints, it searches for evolutionarily related sequences from different species, aligning them in a Multiple Sequence Alignment (MSA). This helps the model detect which residues are conserved or co-evolve, providing clues about their spatial relationships.Prediction: Inferring 3D Structure
Using a deep neural network, the model predicts the 3D coordinates of each residue. Instead of relying purely on local sequence information, it leverages an attention mechanism that helps it determine which residues are likely to interact, even if they are far apart in the sequence.
Initial Guess & Error Calculation
At first, the predictions are random and highly inaccurate. The correct experimental structure (from the Protein Data Bank) is used as a reference. A loss function computes the error between predicted and actual atomic positions.
Learning & Optimization
During training, the model adjusts its internal parameters (weights and biases) using gradient descent, gradually improving its predictions over many iterations. With each cycle, it refines its understanding of how proteins fold, using both evolutionary constraints (MSA-based insights) and contextual attention-based learning.
Through this process, AlphaFold learns to infer structure from sequence with ever-increasing accuracy, refining its understanding of protein folding in a generalizable manner.
But this is just the beginning.
In 2018, AlphaFold 1 laid the foundation. Two years later, AlphaFold 2 solved the structures of over 200 million proteins.
By 2021, RoseTTAFold demonstrated a new approach to protein structure prediction, and by late 2022, RFdiffusion introduced generative design capabilities—marking the shift from prediction to de novo protein engineering. Then came AlphaFold 3, Chai-1, and AlphaProteo in 2024, marking an unprecedented acceleration in the field.
Biology will no longer be governed by intuition and trial-and-error.
For the first time in history, life itself is becoming programmable. The arcane is yielding to the computational. What was once the domain of nature alone is now a design space.
Great tools empower people to create things even their makers could never have imagined.
So what will we create? How will we use these tools to rewrite our biology—and engineer immortality?
That’s what I’ll be exploring in the next newsletter.