Benchmarking System¶
TorchEBM uses pytest-benchmark to measure end-to-end wall time, throughput, FLOPS, and memory for every registered component across releases. Results are saved as JSON, compared against a baseline, and published to the TorchEBM Benchmarks Dashboard.
See the Profiling & Benchmarks page for the complementary per-op drill-down tool.
How it works¶
- Zero test boilerplate.
test_bench_auto.pypicks up every component exported fromtorchebm.*.__init__and times its standard workload (forward + backward for losses, step for samplers/integrators,.interpolatefor interpolants, etc.). - Scales. Each benchmark runs at
small,medium,large(batch size / dim / steps, defined inbenchmarks/conftest.py::SCALES). - Modes.
eagerby default;--compileand--ampaddtorch.compileandfloat16autocast variants via the sameregistry.apply_mode()the profiler uses, so results stay directly comparable. - CUDA-aware.
benchmark.pedantic()callstorch.cuda.synchronize()and tracks peak/reserved memory per run.
Quick start¶
Outputs land in benchmarks/results/. The wrapper also keeps a versioned copy v{torchebm_version}_{device}_{timestamp}.json and updates baseline_{device}.json on first run.
Comparing runs¶
For CI-style regression checks:
This exits non-zero if the geometric-mean speedup versus baseline is below 0.95×.
Benchmarks vs. Profiler
Use benchmarks to detect a regression; use profiler.py diff on the same component to explain it op by op.
Enabling in CI / editor¶
Benchmarks are disabled by default (--benchmark-disable in pyproject.toml) so regular pytest tests/ never runs them. They opt in explicitly:
GitHub Actions triggers the GPU runner on workflow_dispatch, merges to main, or any commit containing #bench in the message.
Adding a benchmark¶
You usually do nothing. Any class exported from a torchebm.* subpackage that inherits a known base (BaseLoss, BaseSampler, BaseIntegrator, BaseInterpolant) is auto-discovered.
Add an entry to benchmarks/registry.py::COMPONENT_OVERRIDES only if your component needs non-default construction. The dataclass keys are short and self-documenting; existing entries are the best reference:
Variants (e.g. exact vs approx modes) use the variants key:
To temporarily exclude a benchmark, edit benchmarks/benchmark.toml:
Publishing results¶
Benchmark JSONs and the dashboard live in a separate repo so this one stays lean. After a run:
- Copy the autosaved JSON from
benchmarks/results/Linux-CPython-*/to thetorchebm-benchmarksrepo. - In that repo, run
bash scripts/publish.sh <path-to-json>. it updatesmanifest.json, commits, and pushes. - GitHub Pages auto-deploys the dashboard.
Files in this system¶
| File | Role |
|---|---|
benchmarks/registry.py | Auto-discovers components, defines BenchSpec, builds (fn, info) pairs, exposes apply_mode() shared with the profiler. |
benchmarks/test_bench_auto.py | Parametrizes pytest over (component × scale × mode) and drives benchmark.pedantic. |
benchmarks/conftest.py | CLI options (--bench-*), SCALES, CUDA sync hooks, pedantic setup. |
benchmarks/benchmark.toml | Enable / disable modules and exclude benchmarks by name. |
benchmarks/run.sh | Shell wrapper with --quick, --baseline, --compare, --dashboard, --ci. |
benchmarks/dashboard.py | Generates a local dashboard.html from the result JSONs (the published version lives in the separate benchmarks repo). |
benchmarks/torchebm_benchmarking.md | Full architecture reference (for repo maintainers). |
Note
For internal details. CI workflow, extra_info metadata schema, Plotly dashboard structure. see benchmarks/torchebm_benchmarking.md in the repo.