Profiling¶
TorchEBM ships with two complementary performance tools under benchmarks/:
| Tool | Answers | Tracks change over time? |
|---|---|---|
Benchmarking (benchmarks/run.sh) | How fast is this component?. wall time, throughput, FLOPS | yes, via baselines and the dashboard |
benchmarks/profiler.py (this page) | Where does time and memory go inside it?. per-op CPU/CUDA time, kernels, memory events | yes, via profiler.py diff |
Use benchmarks to detect a regression; use the profiler to explain it.
How it works¶
profiler.py is a thin wrapper around torch.profiler that reuses the benchmark component registry, so the callable you profile is identical to the one pytest-benchmark times.
There is zero instrumentation inside torchebm/*: all profiling lives in benchmarks/. Nothing in the library is aware that it is being profiled.
Quick start¶
Profile one component at a scale defined by the benchmark suite:
Profile a full training step (any importable zero-arg callable or factory):
Compare two runs to find which op regressed:
Outputs¶
Each run creates one directory under benchmarks/profiles/:
The directory is gitignored; profiles are local by design.
Pick one lens at a time
Start with --top 20 to see the hot ops. Add --trace when you need a timeline, --memory when you suspect allocations, --nvtx when you are going deeper with Nsight Systems.
The run command¶
| Flag | Purpose |
|---|---|
--component NAME | Profile a registered component (e.g. LangevinDynamics, ScoreMatching). |
--callable mod:attr | Profile any zero-arg callable. If attr has required args it is invoked once as a factory; the returned callable is profiled. |
--scale {small,medium,large} | Applies to --component. Matches benchmarks/conftest.py::SCALES. |
--device {cpu,cuda} | Auto-falls back to CPU when CUDA is unavailable. |
--dtype {float32,float64,float16,bfloat16} | Target dtype for the component. |
--warmup N / --steps N | Warmup iterations (untraced) and profiled iterations. |
--compile / --amp | Wrap the callable with torch.compile or torch.autocast(float16). same transform pytest-benchmark uses. |
--trace | Export Chrome-format trace. |
--top N | Write top_ops.md + top_ops.json with the top N ops. |
--memory | CUDA only. Record and dump a memory history snapshot. |
--nvtx | CUDA only. Wrap each step in NVTX ranges; prints the matching nsys profile command. |
--line module:fn | Optional. Line-level profile of a specific function (requires pip install line_profiler). |
--all | Shortcut for --trace --top 20 --memory --nvtx. |
--label TAG | Free-form tag appended to the output directory name. |
The diff command¶
diff joins two top_ops.json rowsets by op name, sorts by \( |\Delta t| \), and prints a markdown table:
Columns: op, time in run A, time in run B, absolute Δ, percent Δ, call counts, and a status of changed / new / dropped. diff also accepts paths directly to top_ops.json files.
Available metrics: self_cuda_time_total_us, cuda_time_total_us, self_cpu_time_total_us, cpu_time_total_us, self_cuda_mem_b, self_cpu_mem_b.
Typical workflows¶
benchmarks/profiles/*/top_ops.md and look at the top rows. Adding a new profiling target¶
Components exported from torchebm.* are picked up automatically by benchmarks/registry.py, so nothing to do. new components just appear as valid --component names.
Custom callables only need to be importable and either take no required arguments or act as a factory returning a zero-arg callable:
Then:
See examples/eqm/train_2d.py for a working reference.
Relation to the benchmark suite¶
- The benchmark suite produces versioned JSONs and the dashboard for tracking wall time across releases. your primary regression tracker.
- The profiler produces per-op evidence for a specific run. Reach for it whenever the dashboard shows a regression or you want to know where to optimize next.
Both tools share the same component builders via benchmarks/registry.py::apply_mode, so eager vs. --compile vs. --amp results stay directly comparable between them.
Note
For full system details (conftest fixtures, scale configs, CI deploy), see benchmarks/torchebm_benchmarking.md in the repo.