The Phase Problem
Hamiltonian Biology
There is a measurement question that sits underneath almost every failure mode in computational biology, and it is almost never asked directly. This particular question is not whether we have enough data, and it is not whether our models are expressive enough. The question is whether the measurements we collect are, in principle, sufficient to recover what we claim to be inferring. Whether the inverse problem is well-posed.
In most of modern molecular biology, it is (likely) not.
What an expression measurement actually is
Start with a single number: gene X, sample Y, 5.3 TPM. This is what a transcriptomic measurement gives you. The number is real, reproducible, and informative. It is also, in a precise technical sense, a collapsed quantity. It is the squared magnitude of a superposition.
To see why, consider what produced it. Gene X is not driven by a single regulatory input. Its expression level at any moment is the resultant of many overlapping regulatory programs operating simultaneously: transcription factor binding cascades, chromatin accessibility states, signaling feedback loops, competition between activating and repressing complexes. Each of these programs contributes a component to the final expression level. The measurement records their sum.
In physics, this situation has a name. When multiple waves superpose at a point in space, what a detector records is the intensity, which is the square of the total amplitude at that point. The intensity is real and positive. It contains no sign information. It contains, crucially, no phase information. Two completely different configurations of source waves can produce identical intensity at the detector if their amplitudes sum to the same resultant.
The transcriptomic measurement is an intensity measurement. The “waves” are regulatory programs, propagating through the gene regulatory network via protein binding and transcription factor occupancy rather than through space via field amplitudes, but the mathematical structure is the same. What you measure is the scalar resultant. What produced it, the relative contributions and phase relationships of the underlying regulatory programs, is not directly observable.
Figure 1: The Intensity Collapse
The diagram above makes this concrete. Three regulatory programs, each with its own amplitude and phase, superpose to produce a resultant. The intensity detector (sequencer) records a single scalar: 5.3 TPM. The phase information, which encodes the relative timing, directionality, and causal relationships between the regulatory programs, is not in that number.
This is not a technology limitation. It is not solved by deeper sequencing, by single-cell resolution, or by larger cohorts. The information is structurally absent from the measurement type. More intensity measurements of the same kind give you more constraints on the sum, but they do not give you access to the phases.
The inverse problem
The question computational biology has been trying to answer, for at least two decades, is the inverse problem: given a set of expression measurements, recover the underlying regulatory structure that produced them. Every pathway analysis, every causal inference method, every foundation model trained on transcriptomic data is, in some form, an attempt at this inversion.
The inverse problem of recovering wave sources from intensity measurements alone is a classical problem in optics, known as the phase retrieval problem. It was studied extensively in the context of X-ray crystallography, electron microscopy, and astronomical imaging. The finding from that literature is unambiguous: the problem is, in general, ill-posed. Given intensity measurements alone, there exist multiple distinct source configurations consistent with the data. Without additional constraints or information, the inversion is not unique.
More precisely, if we denote the expression vector as e ∈ ℝⁿ and the underlying regulatory state as a complex-valued superposition ψ in a high-dimensional regulatory space, then the measurement operation is:
eᵢ = |〈φᵢ | ψ〉|²
where φᵢ is the measurement basis for gene i (its regulatory context, effectively), and the squared modulus collapses phase. The forward map from ψ to e is many-to-one. The inverse has no unique solution.
This means that any method attempting to infer causal regulatory structure from expression data alone is operating in an underdetermined problem. It may find a solution, but it cannot, in principle, verify that the solution is unique, or that it corresponds to the true regulatory state rather than one of the many other states that produce the same intensity pattern.
The models do not fail because they are small. They fail because the data they are trained on does not contain the information required to solve the problem they are attempting to solve.
The spectral structure of the interference pattern
Before arriving at the resolution, it is worth dwelling on what is actually structured about the interference pattern, because this is what makes the problem tractable at all.
Regulatory programs do not interact arbitrarily. They propagate through a network, the gene regulatory network, with a topology that is, to first approximation, fixed by the genome and the proteome. This network is a graph, and signals propagating through it have a natural spectral decomposition.
The Graph Fourier Transform (GFT) provides this decomposition. Given a graph G with adjacency matrix A and degree matrix D, the graph Laplacian is L = D - A. Its eigendecomposition L = UΛUᵀ defines a set of orthogonal eigenvectors (graph Fourier modes) ordered by their eigenvalues. Low-eigenvalue modes are smooth, globally coordinated signals that span the whole network: developmental programs, homeostatic regulation, broad inflammatory responses. High-eigenvalue modes are sharp, localized perturbations that activate small clusters of nodes: pathway-specific responses, drug targets, disease-specific alterations.
Figure 2: GFT Spectral Decomposition
The observed expression vector is a superposition of these Fourier modes, each with a complex coefficient encoding amplitude and phase. The GFT decomposes the interference pattern back into its spectral components. A drug perturbation, for example, projects onto a specific region of the eigenspectrum: it lights up a characteristic set of Fourier modes, and the projection coefficients are the causal coordinates of that perturbation in the regulatory space.
This spectral structure is the reason the problem is not completely hopeless. The regulatory space, while high-dimensional, is not unstructured. It has a basis. Perturbations project onto this basis in characteristic ways. The interference pattern, if you can decompose it correctly, carries fingerprints of its sources.
The problem remains that decomposing an interference pattern into its sources from intensity measurements alone is the phase retrieval problem. The spectral basis exists and is real, but recovering the complex coefficients (including their phases) from the real-valued expression vector requires additional information.
The holographic solution
In optics, there is exactly one general solution to the phase retrieval problem. It is holography.
The principle is as follows. Take the unknown wave (the object wave, whose phase you cannot measure directly) and introduce a second wave of known amplitude and phase: the reference beam. Allow the two waves to interfere. The interference pattern, recorded on a detector, now contains phase information, because the cross-term between the reference beam and the object wave encodes the phase difference between them. Since the reference beam’s phase is known, the object wave’s phase can be recovered.
The critical requirement is that the reference beam must be known, controlled, and coherent with the measurement.
Figure 3: Holographic Measurement
The translation to biology is precise. The unknown regulatory state (ψ) is the object wave. A controlled perturbation, an siRNA knockdown of a specific gene, a small molecule at a defined concentration applied to a patient-derived organoid system, is the reference beam. The measured differential expression (Δe = e_perturbed - e_baseline) is the interference pattern. Because the perturbation is known and controlled, the cross-term between the perturbation and the regulatory state encodes the phase relationship between them.
Concretely, if the baseline state is ψ and the perturbation introduces a known change δ in the regulatory space, the perturbed state is ψ + δ, and the differential expression is:
Δeᵢ = |〈φᵢ | ψ + δ〉|² - |〈φᵢ | ψ〉|²
= 2 Re(〈φᵢ | ψ〉 · 〈δ | φᵢ〉*) + |〈φᵢ | δ〉|²
The first term is a cross-term between the regulatory state and the perturbation. It is linear in ψ rather than quadratic, which means it is sensitive to the sign and phase of the regulatory state, not just its magnitude. The cross-term breaks the symmetry that makes the inverse problem ill-posed. With a known reference perturbation δ and a measured Δe, the cross-term can be inverted to recover information about the phase structure of ψ.
This is the formal content of the claim that perturbational data is qualitatively different from observational data. The difference is not one of quantity or resolution. It is an information-theoretic difference in the type of measurement being made. Observational data gives you intensities. Perturbational data, when the perturbation is known and controlled, gives you cross-terms. Cross-terms carry phase.
Why transformers without perturbational data hit a wall
The implication for large foundation models is direct. A transformer trained on observational transcriptomic data, regardless of its size or the size of its training corpus, is fitting a function from intensity vectors to intensity vectors. It is learning the statistical regularities of a set of squared-magnitude measurements. It will learn the correlational structure of the interference pattern, and it may learn it extremely well. What it cannot learn is the phase structure, because the phase structure is not in the data.
The wall is not a capacity wall. It is an information wall. The model will interpolate beautifully within its training distribution and fail systematically on novel perturbations, new drug exposures, rare mutations, and context shifts, precisely because those failures correspond to changes in the phase structure of the regulatory state. Changes in phase that produce identical or similar intensity patterns in one context produce very different intensity patterns in another. The model, having seen only intensities, has no basis for distinguishing these cases.
Figure 4: The Information Regime Map
Most of the field lives in the upper-left quadrant of this map: observational data analyzed correlatively. The upper-right quadrant, perturbational data analyzed correlatively (classical CRISPR screens, for example), introduces the reference beam but discards the phase information after the perturbation, analyzing the result with differential expression statistics rather than with a framework that preserves the cross-term structure. The lower-left quadrant, causal inference from observational data (Mendelian randomization, instrumental variables), attempts to recover causal structure from intensities using additional assumptions. It is structurally limited because those assumptions are proxies for what only a reference beam can actually provide.
The lower-right quadrant is the only regime where the inverse problem is well-posed. Holographic biology is what the next essay will be about.
GFT as proof, not architecture
A clarification that is worth making explicit: the GFT is not being proposed here as an architectural component. It is being proposed as a proof.
If the data structure is correct, if you have genuine perturbation triples (pre-perturbation state, known perturbation, post-perturbation state), then a sufficiently expressive model trained on that data will learn the interference structure. It will learn phase. It will learn the spectral decomposition. It will learn the causal coordinates of each perturbation in the regulatory space. Not because you told it to, but because there is no other way to minimize the prediction loss. The cross-terms are in the data; the model’s objective function demands that it use them.
The GFT provides the mathematical framework that demonstrates why this is guaranteed. It tells you that the regulatory space has a basis, that perturbations project onto this basis in a principled way, and that the cross-terms in the perturbational data are sufficient to recover the projection coefficients (the causal coordinates) up to the resolution permitted by the graph topology. This is not a lucky statistical artifact. It follows from the information content of the measurement.
What the GFT proof buys you is the ability to characterize the limits of what is learnable, to know ahead of time what perturbation density is required to cover the eigenspectrum, which modes are recoverable from which reference beams, and where the remaining ambiguity lies. The model learns emergently what the theory guarantees is learnable. The theory tells you what is not learnable, and why, and what additional measurements would resolve it.
The honesty about what breaks
Physical waves oscillate. Regulatory programs do not oscillate in the same sense; they are better described as signals propagating through a network under kinetic constraints, with timescales set by protein synthesis, degradation, and binding rates rather than by wave frequency. The “phase” of a regulatory program is a metaphor for its relative timing and directionality, not a literal oscillatory phase angle.
The consequence is that the spectral decomposition is not as clean as in physical optics. The graph Laplacian is a real symmetric matrix; its eigenvectors are real, not complex. The “phases” recoverable from perturbational data are real-valued phase differences, not complex exponentials. The information gain from holographic measurement is real but not as complete as in the optical case.
Additionally, the regulatory network is not static. It reconfigures with cell state, disease context, and drug exposure. The GFT eigenbasis computed from one network topology may not be the correct basis for a different cell state. Transformers, with their input-dependent attention mechanism, handle this automatically: they compute an effective graph for each sample. The GFT proof holds approximately, with the quality of the approximation depending on how much the network topology varies across the conditions in the training data.
These are not reasons to abandon the framework. They are the terms under which the framework’s predictions can be tested. A theory that specifies its own failure modes is more useful than one that does not.
What this changes
Understanding the biology means recovering the regulatory state, including its phase structure, not just characterizing the interference pattern it produces. This distinction matters for everything downstream. Predicting how a patient will respond to a drug is a question about how a controlled perturbation, the drug, will interact with the patient’s specific regulatory state, including the causal structure encoded in its phase. A model trained on observational data can tell you what the average patient with this intensity pattern has experienced. It cannot tell you what will happen when you introduce a new reference beam into this patient’s specific interference structure.
The phase problem is not a detail. It is the reason medicine has struggled to move from population-level statistics to individual-level causal prediction. The framework described here does not solve that problem by being clever about architecture. It solves it by insisting on the right kind of measurement.
The question that remains open, and that I think is genuinely unresolved, is how many reference beams are required. How many distinct controlled perturbations, across how many patient-derived systems, are needed to cover the eigenspectrum sufficiently to make the regulatory state reconstruction reliable? The answer depends on the effective dimensionality of the biological phase space, which is itself a quantity that has not been measured. That is not a reason to wait before building. It is a reason to build in a way that generates the evidence needed to answer it.
Hamiltonian Biology covers the mathematical and epistemic foundations of biological measurement. Prior essays in this series: “What the Loss Function Cannot See,” “Why Biology Breaks Foundation Models,” and the inaugural post on the Hamiltonian framing.





