Performance¶
Patterns we follow when writing performance-sensitive code. For measuring performance, see Benchmarking; for debugging a specific regression, see Profiling.
The four hot spots¶
- Sampler steps: iterative, run 100–1000×, dominate wall time.
- Score / energy gradients:
autograd.gradcalls are frequent and stack. - Loss forward + backward: called every training batch.
- Host ↔ device traffic:
.item(),.cpu(), repeated.to()stall the GPU.
Optimise in that order. Everything else is noise.
Vectorise, don't loop¶
Work in batch dimensions; avoid Python-level iteration over samples.
Sample many chains in parallel by putting the chain index in the leading dim:
Stay on device¶
Keep tensors on the same device and dtype for the whole pipeline. Use DeviceMixin's self.device / self.dtype inside the library; never hard-code cuda.
Do not sync unnecessarily
.item(), .cpu(), .tolist(), and Python if tensor > 0: all trigger a full GPU sync. Defer them until after the hot loop, ideally until logging at the epoch boundary.
Reuse memory¶
Pre-allocate buffers once, reuse in the loop:
For trajectories, write into a pre-allocated tensor instead of appending to a list and stacking at the end:
In-place ops (x.add_, x.mul_) are safe outside of autograd-tracked paths.
Mixed precision and compilation¶
Both are opt-in at the benchmark / application layer via the same entry point the profiler uses:
Inside the library, wrap large matmul blocks with self.autocast_context() (provided by DeviceMixin) rather than calling torch.autocast directly. this honours the user's configured dtype.
Sampler-specific tips¶
Rough scaling: step size \( \sim d^{-1/3} \), noise scale \( \sigma = \sqrt{2\eta} \).
Rough scaling: leapfrog steps \( \sim \sqrt{d} \), step size \( \sim d^{-1/4} \). Target acceptance 0.6–0.8.
Prefer adaptive integrators (DOPRI5, Heun) for generation; fixed-step for training. Keep the ODE function allocation-free. see the pre-allocation pattern above.
Common pitfalls¶
- Implicit host ↔ device copies:
torch.tensor(x_numpy, device=…)inside a loop. - Redundant
.to()calls:BaseLoss.__call__already moves inputs; subclassforward()should not move them again. - Missing
torch.no_grad(): interpolation targets, momentum init, and random projection generation don't need grad tracking. - Tiny batches on GPU: under-utilises SMs; prefer one big step over many small ones.
- Python-level
isinstanceinside the inner loop: resolve once before the loop.
Next steps¶
- Evidence first: Benchmarking tells you how fast; Profiling tells you where the time goes.
- CUDA kernels: planned. See cuRBLAS for background.
- Multi-GPU scaling: planned.