Forward Gradients, Exact Accumulation, and the Memory Geometry of Training
The Activation Tape Problem
Reverse-mode automatic differentiation (backpropagation) has a well-known memory cost. To compute gradients during the backward pass, the system must retain intermediate activations from the forward pass. For a network with layers and batch size , this imposes an auxiliary memory requirement. The activations must persist from their creation during the forward pass until their consumption during the backward pass; their lifetime spans the entire training step.
This is a structural property of reverse-mode AD, not an implementation detail. The backward pass requires the intermediate values as a contextual resource, which in the Fidelity framework’s terminology makes it a coeffect. The activation tape is a memory obligation that the computation imposes on its environment.
Gradient checkpointing and other memory-reduction techniques trade compute for memory, recomputing activations during the backward pass to avoid storing them. These techniques reduce the constant factor but do not change the fundamental structure: reverse-mode AD requires either storing or recomputing intermediate values.
The Forward Gradient
Baydin, Pearlmutter, Syme, Wood, and Torr [1] demonstrated a different approach. The forward gradient is an unbiased estimate of the gradient computed via forward-mode automatic differentiation. For a random perturbation vector , the forward gradient computes the directional derivative:
This is evaluated in a single forward pass. There is no backward pass. There is no activation tape.
The gradient estimate is unbiased: its expectation over random perturbation vectors equals the true gradient. The variance depends on the perturbation distribution and can be controlled through standard variance reduction techniques. The tradeoff is statistical: exact gradients via reverse-mode vs. unbiased estimates via forward-mode, with the forward approach eliminating the memory obligation entirely.
The Coeffect Signature
In the Fidelity framework’s coeffect system, these two approaches have distinct signatures:
| Property | Reverse-Mode | Forward-Mode [1] |
|---|---|---|
| Auxiliary memory | per layer | |
| Gradient quality | Exact (full Jacobian transpose) | Unbiased estimate (directional derivative) |
| Activation tape | Required; lifetime spans full backward pass | Not required |
| Escape analysis | Intermediate values escape layer scope | No intermediate values escape layer scope |
The forward-mode signature is significant for the escape analysis described in Section 3.2 of the DTS/DMM paper. When no intermediate values escape their creating scope, every allocation is stack-eligible. The escape classification for every intermediate value is StackScoped; the allocation strategy is memref.alloca; the lifetime bound is the lexical scope of the layer computation.
This is a verifiable compile-time property. Given a computation graph annotated with AD mode, the Fidelity framework’s lifetime analysis can confirm that forward-mode imposes no lifetime obligations beyond the current layer’s scope. The coeffect system does not need heuristics or runtime checks; it follows from the structure of forward-mode evaluation.
The Quire Connection
Forward-mode computes directional derivatives via inner products. The inner product is an accumulation of products, exactly the operation the posit quire makes exact.
The quire holds intermediate multiply-add results without rounding. For posit32, the quire occupies bits = 64 bytes, exactly one cache line. Rounding occurs once, when the final accumulated result is converted back to a posit value [2]. This single-rounding property is significant for gradient accumulation: the directional derivative computation accumulates across all parameters in a layer, and any per-step rounding error compounds across millions of parameters during training.
The quire’s coeffect profile in this context:
- Allocation: 64 bytes on stack (CPU) or 512-bit fabric pipeline (FPGA)
- Lifetime: bounded by the forward pass through one layer; no escape
- Capability: available on CPU (software emulation, ~50 cycles/FMA) and FPGA (hardware pipeline, 1 cycle/FMA); unavailable on neuromorphic targets without exact accumulation support
The quire’s lifetime aligns with the forward gradient’s memory profile. Both are scoped to a single layer’s computation. Neither requires persistence beyond the layer boundary. The coeffect system tracks both through the same inference pipeline, and the design-time diagnostic shows them together:
gradient_estimate: float<loss · param⁻¹>
AD mode: forward (Baydin et al. [1])
Accumulation: Quire (exact, single rounding)
├─ x86_64: stack, 64 bytes, ~50 cycles/fma, O(1) auxiliary memory
├─ xilinx: 512-bit fabric pipeline, 1 cycle/fma, O(1) auxiliary
└─ Memory: no activation tape, no escape, fully stack-eligibleThe Quire as a Tier 2 Loop Invariant
The quire’s exactness guarantee has a precise reading in Hoare logic, and stating it that way clarifies what verification can establish at compile time and what it cannot.
The accumulation loop that computes maintains an invariant at every iteration: the accumulated value occupies at most bits (512 bits for posit32). The invariant is the loop’s precondition, the loop body preserves it, and the loop’s exit delivers a postcondition that the final value is the exact sum of the products with no intermediate rounding. In Hoare’s sequential composition rule, the invariant is the assertion for which is established locally, and the conclusion is the proof that the loop achieves the postcondition the next computation depends on.
The invariant is decidable in QF_LIA / QF_BV from the input range and the loop iteration bound. For a layer with parameters and a per-product bit-width derived from the chosen posit configuration, the maximum accumulated bit-width is , and the invariant is a single linear inequality that Z3 discharges at design time. The engineer never declares this property; the framework derives it from the layer’s parameter count, the posit width, and the quire width, which are all known at compile time. The Tier 2 obligation is therefore satisfied automatically for any layer whose dimensions are statically bounded.
The Fwd ⊣ Bwd adjunction described in the categorical deep learning entry is the structural reason the same invariant can be checked at every lowering pass: each lowering is a functor that preserves the adjunction, and the quire-width invariant is a property of the adjoint pair that survives the functor’s action. In sheaf-theoretic language, the quire invariant is a section of the accumulation sheaf over the compilation poset, and the dual-pass discharge at each lowering is the local check that the section restricts correctly to the next stage. The compilation sheaf design document treats the broader compositional story; here it is enough to note that the quire’s exactness is a verifiable property at every stage from PSG to native binary, not only at the high-level mathematical specification.
Dimensional Consistency Under Differentiation
The third property is dimensional. The DTS framework’s dimensional algebra is closed under differentiation. If maps values with dimension to values with dimension , then the derivative carries dimension . This follows from the abelian group structure: differentiation is division in the dimensional algebra, and division is closed in .
For gradient computation, this means:
- The gradient of a loss function with dimension with respect to a parameter with dimension carries dimension
- Gradient accumulation across parameters with different dimensions is dimensionally constrained: a gradient with dimension cannot be accumulated with a gradient of dimension without a dimensional error
- The chain rule preserves dimensional consistency through the computation graph; each gradient node inherits a dimension from the chain rule, verified by the same Gaussian elimination that verifies the forward pass
This verification is decidable, requires no annotation beyond the physical dimensions already present in the forward computation, and has zero runtime cost. The inference algorithm from Section 2.2 of the DTS/DMM paper extends to auto-differentiation graphs without modification.
In the sheaf-theoretic framing developed in the compilation sheaf design document, the chain rule’s preservation of dimensional consistency is the global section condition on the differentiation sheaf. The computation graph is a finite directed acyclic poset; the stalks at each node are dimensional annotation bundles in ; the structure maps at each edge are the chain rule applications that derive the gradient annotation at the downstream node from the gradient annotation at the upstream node. A consistent dimensional assignment across the entire gradient graph is exactly a global section of this sheaf. The Tier 1 verification described above is the witnessing mechanism: Gaussian elimination over the structure-map equations decides whether the global section exists, and if so produces the principal one.
The O(1) auxiliary memory profile of forward-mode AD has a parallel sheaf reading on the escape classification sheaf. The escape classification at each node carries a lifetime ordering (stack < arena < heap < static), and the structure maps are the lifetime-promotion rules that propagate constraints from use sites back to creation sites. When every intermediate value in a forward-mode pass is StackScoped (no escape, no activation tape), the escape classification sheaf has trivial : there is no obstruction to extending each layer’s local section to a global section over the entire forward pass, because every value’s lifetime is contained within its creating scope and no structure map needs to widen any classification. Reverse-mode AD’s activation tape produces the opposite situation: intermediate values must outlive their creating scopes to be available during the backward pass, which forces lifetime promotions and creates non-trivial obstructions that the escape classification sheaf must resolve through arena allocation. The “memory profile” difference between forward-mode and reverse-mode is, in this reading, a difference in the cohomology of the escape classification sheaf.
Physics-Informed Loss Terms
The dimensional verification has a concrete application in physics-informed neural networks [3]. A loss term that penalizes violations of Newton’s second law computes and minimizes the squared residual. DTS can verify at compile time that , , and carry dimensions , , and respectively, and that the subtraction is dimensionally consistent.
This is a structural check on the loss function’s definition, not a runtime constraint on the model’s outputs. It ensures that the physics constraints imposed during training are dimensionally well-formed. Existing ML frameworks cannot provide this verification because dimensional information is never encoded in the type system and is not available at any stage of the compilation pipeline.
The Composition
Three independently established properties compose in the PSG:
graph TD
A[DTS: Dimensional Consistency] -->|"Chain rule closed<br>under abelian group"| D[Unified Gradient<br>Computation]
B[Forward-Mode AD: Memory Minimal] -->|"O(1) auxiliary,<br>no activation tape"| D
C[Quire: Exact Accumulation] -->|"Single rounding,<br>512-bit for posit32"| D
D -->|"Dimensionally verified<br>Memory-minimal<br>Numerically exact"| E[Design-Time<br>Diagnostic]
- DTS verifies dimensional consistency of the gradient graph, including through the chain rule, at compile time
- Forward-mode AD eliminates the activation tape coeffect, making gradient computation stack-eligible
- The quire provides exact accumulation for the inner products that forward-mode computes
Each property is established independently in the literature ([1] for forward gradients, [2] for quire accumulation, Section 2 of DTS/DMM for dimensional type systems). Their composition within the PSG is the contribution: a system where gradient computation is simultaneously dimensionally verified, memory-minimal, and numerically exact, with all three properties visible and verifiable at design time.
Representation Selection for Training
The representation selection framework from the companion entry on posit arithmetic applies directly to training workloads. Neural network activations and gradients have well-characterized value distributions, typically concentrated near zero with heavy tails. The dimensional range of activations in a specific layer is inferrable from training statistics or from dimensional constraints on the input domain.
Given this range, the compiler’s representation selection function can choose posit widths that concentrate precision where the values cluster. The quire provides exact gradient accumulation regardless of the chosen posit width, eliminating the rounding errors that compound across millions of parameters.
This connection between DTS (which provides the dimensional range), posit arithmetic (which provides domain-matched precision), and DMM (which tracks the quire’s allocation and lifetime) is an instance of the representation selection framework applied to a specific computational domain. The analysis is the same; the domain is different.
Current Status and Limitations
The forward gradient method has known limitations. The variance of the gradient estimate increases with parameter count, and variance reduction techniques add computational overhead. For very large models, the statistical cost may exceed the memory savings. The optimal tradeoff between reverse-mode and forward-mode is an empirical question that depends on model architecture, hardware memory constraints, and acceptable training time.
The quire’s exact accumulation eliminates one source of numerical error but does not address the statistical noise inherent in the forward gradient estimate. The combination reduces numerical error to zero (via exact accumulation) while accepting statistical error (from the directional derivative estimate). Whether this tradeoff is favorable depends on the specific workload.
The dimensional verification for physics-informed loss terms is sound but narrow: it catches dimensional inconsistencies in the loss function’s definition, not errors in the model’s learned representations. A dimensionally consistent loss term can still produce a model that makes incorrect predictions; dimensional correctness is necessary but not sufficient for physical fidelity.
These limitations are real and should inform expectations. The contribution here is that three independently valuable properties (forward gradients, quires, and DTS) compose naturally within the PSG, and the composition can be verified at compile time. The framework does not claim to solve machine learning; it claims that this specific composition is sound and useful for the workloads where the composition’s properties matter.
References
[1] A. G. Baydin, B. A. Pearlmutter, D. Syme, F. Wood, and P. Torr, “Gradients without Backpropagation,” arXiv:2202.08587, 2022.
[2] Posit Working Group, “Standard for Posit Arithmetic (2022),” posithub.org, 2022.
[3] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational Physics, vol. 378, pp. 686-707, 2019.