Forward Gradients & Exact Accumulation

Forward Gradients, Exact Accumulation, and the Memory Geometry of Training

The Activation Tape Problem

Reverse-mode automatic differentiation (backpropagation) has a well-known memory cost. To compute gradients during the backward pass, the system must retain intermediate activations from the forward pass. For a network with $L$ layers and batch size $B$ , this imposes an $O(L \cdot B)$ auxiliary memory requirement. The activations must persist from their creation during the forward pass until their consumption during the backward pass; their lifetime spans the entire training step.

This is a structural property of reverse-mode AD, not an implementation detail. The backward pass requires the intermediate values as a contextual resource, which in the Fidelity framework’s terminology makes it a coeffect. The activation tape is a memory obligation that the computation imposes on its environment.

Gradient checkpointing and other memory-reduction techniques trade compute for memory, recomputing activations during the backward pass to avoid storing them. These techniques reduce the constant factor but do not change the structure: reverse-mode AD requires either storing or recomputing intermediate values.

The Forward Gradient

Baydin, Pearlmutter, Syme, Wood, and Torr [1] demonstrated a different approach. The forward gradient is an unbiased estimate of the gradient computed via forward-mode automatic differentiation. For a random perturbation vector $v$ , the forward gradient computes the directional derivative:

\nabla_v f(\theta) = \langle \nabla f(\theta), v \rangle

This is evaluated in a single forward pass. There is no backward pass. There is no activation tape.

The single-tangent gradient estimate is unbiased: its expectation over random perturbation vectors equals the true gradient. The variance scales with parameter count, because the expected cosine similarity between a random direction and the true gradient decays at $O(1/n)$ in dimension $n$ . Flügel, Coquelin, Weiel, Debus, Streit, and Götz [4] extended the method to $K$ simultaneous tangents aggregated by orthogonal projection onto their spanned subspace:

P_U(\nabla f) = V (V^\top V)^{-1} V^\top \nabla f, \qquad V = [v_1 \mid \cdots \mid v_K]

The Gram-matrix inverse $(V^\top V)^{-1}$ corrects for non-orthogonal sampling, yielding the best subspace approximation of $\nabla f$ in $\mathrm{span}(V)$ . When $K = n$ the projection recovers the exact gradient; for moderate $K$ the approximation quality increases monotonically, at a compute cost of $O(K \cdot \mathrm{ops}(f))$ . Perturbing layer activations rather than weights reduces variance further, because activations live in lower-dimensional spaces than the weight tensors upstream of them.

The tradeoff is statistical: exact gradients via reverse-mode vs. projected estimates via forward-mode. The forward approach eliminates the memory obligation entirely, and $K$ tunes the quality of the projection at linear compute cost.

The Coeffect Signature

In the Fidelity framework’s coeffect system, these two approaches have distinct signatures:

Property	Reverse-Mode	Forward-Mode [1, 4]
Auxiliary memory	$O(L \cdot B)$	$O(K)$ per layer, independent of $L$ and $B$
Gradient quality	Exact (full Jacobian transpose)	Subspace projection; exact at $K = n$ , monotone in $K$
Activation tape	Required; lifetime spans full backward pass	Not required
Escape analysis	Intermediate values escape layer scope	No intermediate values escape layer scope

The forward-mode signature is significant for the escape analysis described in Section 3.2 of the DTS/DMM paper. When no intermediate values escape their creating scope, every allocation is stack-eligible. The escape classification for every intermediate value is StackScoped; the allocation strategy is memref.alloca; the lifetime bound is the lexical scope of the layer computation.

This is a verifiable compile-time property. Given a computation graph annotated with AD mode, the Fidelity framework’s lifetime analysis can confirm that forward-mode imposes no lifetime obligations beyond the current layer’s scope. The coeffect system does not need heuristics or runtime checks; it follows from the structure of forward-mode evaluation.

The multi-tangent extension preserves this property. $K$ tangent vectors propagate alongside each primal value within a layer’s forward computation, and the activity-perturbation variant injects fresh tangents at each layer’s activation. Each tangent’s lifetime is bounded by its layer’s scope; none escape, so our stack-eligibility proof is unchanged in structure, widened only by the constant factor $K$ .

The Quire Connection

Forward-mode computes directional derivatives via inner products. The inner product $\langle \nabla f(\theta), v \rangle$ is an accumulation of products, exactly the operation the posit quire makes exact.

The quire holds intermediate multiply-add results without rounding. For posit32, the quire occupies $n^2/2 = 512$ bits = 64 bytes, exactly one cache line. Rounding occurs once, when the final accumulated result is converted back to a posit value [2]. This single-rounding property matters for gradient accumulation: the directional derivative computation accumulates across all parameters in a layer, and any per-step rounding error compounds across millions of parameters during training.

The quire’s coeffect profile in this context:

Allocation: 64 bytes on stack (CPU) or 512-bit fabric pipeline (FPGA)
Lifetime: bounded by the forward pass through one layer; no escape
Capability: available on CPU (software emulation, ~50 cycles/FMA) and FPGA (hardware pipeline, 1 cycle/FMA); unavailable on neuromorphic targets without exact accumulation support

The quire’s lifetime aligns with the forward gradient’s memory profile. Both are scoped to a single layer’s computation. Neither requires persistence beyond the layer boundary. The coeffect system tracks both through the same inference pipeline, and the design-time diagnostic shows them together:

gradient_estimate: float<loss · param⁻¹>
  AD mode:      forward, K tangents (Baydin et al. [1]; Flügel et al. [4])
  Accumulation: K independent quires (exact, single rounding each)
  Aggregation:  K×K Gram inverse, O(K²·n) dominated by O(K·ops(f))
  ├─ x86_64:  stack, 64·K bytes, ~50 cycles/fma, O(K) auxiliary memory
  ├─ xilinx:  K × 512-bit fabric pipelines, 1 cycle/fma, O(K) auxiliary
  └─ Memory:  no activation tape, no escape, fully stack-eligible

The Quire as a Tier 2 Loop Invariant

The quire’s exactness guarantee has a precise reading in Hoare logic, and stating it that way clarifies what verification can establish at compile time and what it cannot.

The accumulation loop that computes $\langle \nabla f(\theta), v \rangle$ maintains an invariant at every iteration: the accumulated value occupies at most $n^2/2$ bits (512 bits for posit32). The invariant is the loop’s precondition, the loop body preserves it, and the loop’s exit delivers a postcondition that the final value is the exact sum of the products with no intermediate rounding. In Hoare’s sequential composition rule, the invariant is the assertion $P$ for which $\{P \wedge B\}\, C\, \{P\}$ is established locally, and the conclusion $\{P\}\, \text{while } B \text{ do } C\, \{P \wedge \neg B\}$ is the proof that the loop achieves the postcondition the next computation depends on.

The invariant is decidable in QF_LIA / QF_BV from the input range and the loop iteration bound. For a layer with $k$ parameters and a per-product bit-width derived from the chosen posit configuration, the maximum accumulated bit-width is $k \cdot \text{bits-per-product}$ , and the invariant $k \cdot \text{bits-per-product} \le n^2/2$ is a single linear inequality that Z3 discharges at design time. The engineer never declares this property; the framework derives it from the layer’s parameter count, the posit width, and the quire width, which are all known at compile time. The Tier 2 obligation is therefore satisfied automatically for any layer whose dimensions are statically bounded.

The Fwd ⊣ Bwd adjunction described in the categorical deep learning entry is the structural reason the same invariant can be checked at every lowering pass: each lowering is a functor that preserves the adjunction, and the quire-width invariant is a property of the adjoint pair that survives the functor’s action. In sheaf-theoretic language, the quire invariant is a section of our accumulation sheaf over the compilation poset, and the dual-pass discharge at each lowering is the local check that the section restricts correctly to the next stage. The compilation sheaf design document treats the broader compositional story; here it is enough to note that the quire’s exactness is a verifiable property at every stage from PSG to native binary, not only at the high-level mathematical specification.

Dimensional Consistency Under Differentiation

The third property is dimensional. The DTS framework’s dimensional algebra is closed under differentiation. If $f$ maps values with dimension $d_1$ to values with dimension $d_2$ , then the derivative $\partial f / \partial x$ carries dimension $d_2 \cdot d_1^{-1}$ . This follows from the abelian group structure: differentiation is division in the dimensional algebra, and division is closed in $\mathbb{Z}^n$ .

For gradient computation, this means:

The gradient of a loss function with dimension $\langle\text{loss}\rangle$ with respect to a parameter with dimension $\langle d \rangle$ carries dimension $\langle\text{loss} \cdot d^{-1}\rangle$
Gradient accumulation across parameters with different dimensions is dimensionally constrained: a gradient with dimension $\langle\text{N} / \text{m}\rangle$ cannot be accumulated with a gradient of dimension $\langle\text{J} / \text{s}\rangle$ without a dimensional error
The chain rule preserves dimensional consistency through the computation graph; each gradient node inherits a dimension from the chain rule, verified by the same Gaussian elimination that verifies the forward pass

This verification is decidable, requires no annotation beyond the physical dimensions already present in the forward computation, and has zero runtime cost. The inference algorithm from Section 2.2 of the DTS/DMM paper extends to auto-differentiation graphs without modification.

In the sheaf-theoretic framing developed in the compilation sheaf design document, the chain rule’s preservation of dimensional consistency is the global section condition on the differentiation sheaf. The computation graph is a finite directed acyclic poset; the stalks at each node are dimensional annotation bundles in $\mathbb{Z}^n$ ; the structure maps at each edge are the chain rule applications that derive the gradient annotation at the downstream node from the gradient annotation at the upstream node. A consistent dimensional assignment across the entire gradient graph is exactly a global section of this sheaf. The Tier 1 verification described above is the witnessing mechanism: Gaussian elimination over the structure-map equations decides whether the global section exists, and if so produces the principal one.

The O(1) auxiliary memory profile of forward-mode AD has a parallel sheaf reading on the escape classification sheaf. The escape classification at each node carries a lifetime ordering (stack < arena < heap < static), and the structure maps are the lifetime-promotion rules that propagate constraints from use sites back to creation sites. When every intermediate value in a forward-mode pass is StackScoped (no escape, no activation tape), the escape classification sheaf has trivial $H^1$ : there is no obstruction to extending each layer’s local section to a global section over the entire forward pass, because every value’s lifetime is contained within its creating scope and no structure map needs to widen any classification. Reverse-mode AD’s activation tape produces the opposite situation: intermediate values must outlive their creating scopes to be available during the backward pass, which forces lifetime promotions and creates non-trivial $H^1$ obstructions that the escape classification sheaf must resolve through arena allocation. The “memory profile” difference between forward-mode and reverse-mode is, in this reading, a difference in the cohomology of the escape classification sheaf.

Physics-Informed Loss Terms

The dimensional verification has a concrete application in physics-informed neural networks [3]. A loss term that penalizes violations of Newton’s second law computes $F - ma$ and minimizes the squared residual. DTS can verify at compile time that $F$ , $m$ , and $a$ carry dimensions $\langle\text{N}\rangle$ , $\langle\text{kg}\rangle$ , and $\langle\text{m} \cdot \text{s}^{-2}\rangle$ respectively, and that the subtraction $F - ma$ is dimensionally consistent.

This is a structural check on the loss function’s definition, not a runtime constraint on the model’s outputs. It ensures that the physics constraints imposed during training are dimensionally well-formed. Existing ML frameworks cannot provide this verification because dimensional information is never encoded in the type system and is not available at any stage of the compilation pipeline.

The Composition

Three independently established properties compose in the PSG:

  graph TD
    A[DTS: Dimensional Consistency] -->|"Chain rule closed<br>under abelian group"| D[Unified Gradient<br>Computation]
    B[Forward-Mode AD: Memory Minimal] -->|"O(1) auxiliary,<br>no activation tape"| D
    C[Quire: Exact Accumulation] -->|"Single rounding,<br>512-bit for posit32"| D
    D -->|"Dimensionally verified<br>Memory-minimal<br>Numerically exact"| E[Design-Time<br>Diagnostic]

DTS verifies dimensional consistency of the gradient graph, including through the chain rule, at compile time
Forward-mode AD eliminates the activation tape coeffect, making gradient computation stack-eligible
The quire provides exact accumulation for the inner products that forward-mode computes

Each property is established independently in the literature ([1] for forward gradients, [2] for quire accumulation, Section 2 of DTS/DMM for dimensional type systems). Their composition within the PSG is the contribution: a system where gradient computation is simultaneously dimensionally verified, memory-minimal, and numerically exact, with all three properties visible and verifiable at design time.

Representation Selection for Training

The representation selection framework from the companion entry on posit arithmetic applies directly to training workloads. Neural network activations and gradients have well-characterized value distributions, typically concentrated near zero with heavy tails. The dimensional range of activations in a specific layer is inferrable from training statistics or from dimensional constraints on the input domain.

Given this range, the compiler’s representation selection function can choose posit widths that concentrate precision where the values cluster. The quire provides exact gradient accumulation regardless of the chosen posit width, eliminating the rounding errors that compound across millions of parameters.

This connection between DTS (which provides the dimensional range), posit arithmetic (which provides domain-matched precision), and DMM (which tracks the quire’s allocation and lifetime) is an instance of the representation selection framework applied to a specific computational domain. The analysis is the same; the domain is different.

Current Status and Limitations

The forward gradient method has known limitations. Single-tangent variance grows with parameter count as the expected cosine similarity between a random direction and the true gradient decays at $O(1/n)$ in dimension $n$ . Flügel et al. [4] mitigate this with $K$ -tangent orthogonal projection and report a practical ceiling around a factor of two over backpropagation for moderate $K$ on realistic workloads. For very large models, the statistical cost of small $K$ may still exceed the memory savings, and scaling $K$ toward $n$ surrenders the compute advantage that motivated forward-mode in the first place. The optimal tradeoff between reverse-mode and forward-mode is an empirical question that depends on model architecture, hardware memory constraints, and acceptable training time.

The quire’s exact accumulation eliminates one source of numerical error but does not address the statistical noise inherent in the forward gradient estimate. The combination reduces numerical error to zero (via exact accumulation) while accepting statistical error (from the directional derivative estimate). Whether this tradeoff is favorable depends on the specific workload.

The dimensional verification for physics-informed loss terms is sound but narrow: it catches dimensional inconsistencies in the loss function’s definition, not errors in the model’s learned representations. A dimensionally consistent loss term can still produce a model that makes incorrect predictions; dimensional correctness is necessary but not sufficient for physical fidelity.

These limitations are real and should inform expectations. The contribution here is that three independently valuable properties (forward gradients, quires, and DTS) compose naturally within the PSG, and the composition can be verified at compile time. The framework does not claim to solve machine learning; it claims that this specific composition is sound and useful for the workloads where the composition’s properties matter.

References

[1] A. G. Baydin, B. A. Pearlmutter, D. Syme, F. Wood, and P. Torr, “Gradients without Backpropagation,” arXiv:2202.08587, 2022.

[2] Posit Working Group, “Standard for Posit Arithmetic (2022),” posithub.org, 2022.

[3] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational Physics, vol. 378, pp. 686-707, 2019.

[4] K. Flügel, D. Coquelin, M. Weiel, C. Debus, A. Streit, and M. Götz, “Beyond Backpropagation: Optimization with Multi-Tangent Forward Gradients,” arXiv:2410.17764v2, 2026.

Posit Arithmetic & DTS Target Architectures