Performance Optimization¶
This document provides guidance on optimizing the performance of TorchEBM for both development and usage.
Performance Considerations¶
Key Performance Areas
When working with TorchEBM, pay special attention to these performance-critical areas:
- Sampling algorithms: These are iterative and typically the most compute-intensive
- Gradient calculations: Computing energy gradients is fundamental to many algorithms
- Batch processing: Effective vectorization for parallel processing
- GPU utilization: Proper device management and memory usage
Vectorization Techniques¶
Batched Operations¶
TorchEBM extensively uses batching to improve performance:
Parallel Sampling¶
Sample multiple chains in parallel by using batch dimensions:
GPU Acceleration¶
TorchEBM is designed to work efficiently on GPUs:
Device Management¶
# Create energy function and move to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
energy_fn = GaussianEnergy(mean, cov).to(device)
# Create sampler with the same device
sampler = LangevinDynamics(energy_fn, device=device)
# Generate samples (automatically on the correct device)
samples, _ = sampler.sample_chain(dim=2, n_steps=1000, n_samples=10000)
Memory Management¶
Memory management is critical for performance, especially on GPUs:
# Avoid creating new tensors in loops
for step in range(n_steps):
# Bad: Creates new tensors each iteration
x = x - step_size * energy_fn.gradient(x) + noise_scale * torch.randn_like(x)
# Good: In-place operations
grad = energy_fn.gradient(x)
x.sub_(step_size * grad)
x.add_(noise_scale * torch.randn_like(x))
Custom CUDA Kernels¶
For the most performance-critical operations, TorchEBM provides custom CUDA kernels:
# Standard PyTorch implementation
def langevin_step_pytorch(x, energy_fn, step_size, noise_scale):
grad = energy_fn.gradient(x)
noise = torch.randn_like(x) * noise_scale
return x - step_size * grad + noise
# Using custom CUDA kernel when available
from torchebm.cuda import langevin_step_cuda
def langevin_step(x, energy_fn, step_size, noise_scale):
if x.is_cuda and torch.cuda.is_available():
return langevin_step_cuda(x, energy_fn, step_size, noise_scale)
else:
return langevin_step_pytorch(x, energy_fn, step_size, noise_scale)
Sampling Efficiency¶
Sampling efficiency can be improved using several techniques:
-
Step Size Adaptation
Automatically adjust step sizes based on acceptance rates or other metrics.
-
Burn-in Period
Discard initial samples to reduce the impact of initialization.
-
Thinning
Reduce correlation between samples by keeping only every Nth sample.
-
Warm Starting
Initialize sampling from a distribution close to the target.
Profiling and Benchmarking¶
To identify performance bottlenecks, TorchEBM includes profiling utilities:
from torchebm.utils.profiling import profile_sampling
# Profile a sampling run
profiling_results = profile_sampling(
sampler,
dim=10,
n_steps=1000,
n_samples=100
)
# Print results
print(f"Total time: {profiling_results['total_time']:.2f} seconds")
print(f"Time per step: {profiling_results['time_per_step']:.5f} seconds")
print("Component breakdown:")
for component, time_pct in profiling_results['component_times'].items():
print(f" {component}: {time_pct:.1f}%")
Performance Benchmarks¶
Here are some performance benchmarks for common operations:
Operation | CPU Time (ms) | GPU Time (ms) | Speedup |
---|---|---|---|
Langevin step (1,000 samples, dim=10) | 8.2 | 0.41 | 20x |
HMC step (1,000 samples, dim=10) | 15.4 | 0.76 | 20.3x |
Energy gradient (10,000 samples, dim=100) | 42.1 | 1.8 | 23.4x |
Full sampling (10,000 samples, 100 steps) | 820 | 38 | 21.6x |
Performance Tips and Best Practices¶
General Tips¶
- Use the right device: Always move computation to GPU when available
- Batch processing: Process data in batches rather than individually
- Reuse tensors: Avoid creating new tensors in inner loops
- Monitor memory: Use
torch.cuda.memory_summary()
to track memory usage
Sampling Tips¶
- Tune step sizes: Optimal step sizes balance exploration and stability
- Parallel chains: Use multiple chains to improve sample diversity
- Adaptive methods: Use adaptive samplers for complex distributions
- Mixed precision: Consider using mixed precision for larger models
Common Pitfalls
Avoid these common performance issues:
- Unnecessary CPU-GPU transfers: Keep data on the same device
- Small batch sizes: Too small batches underutilize hardware
- Unneeded gradient tracking: Disable gradients when not training
- Excessive logging: Logging every step can significantly slow down sampling
Algorithm-Specific Optimizations¶
Langevin Dynamics¶
# Optimize step size for Langevin dynamics
# Rule of thumb: step_size ≈ O(d^(-1/3)) where d is dimension
step_size = min(0.01, 0.1 * dim**(-1/3))
# Noise scale should be sqrt(2 * step_size) for standard Langevin
noise_scale = np.sqrt(2 * step_size)
Hamiltonian Monte Carlo¶
# Optimize HMC parameters
# Leapfrog steps should scale with dimension
n_leapfrog_steps = max(5, int(np.sqrt(dim)))
# Step size should decrease with dimension
step_size = min(0.01, 0.05 * dim**(-1/4))
Multi-GPU Scaling¶
For extremely large sampling tasks, TorchEBM supports multi-GPU execution:
# Distribution across GPUs using DataParallel
import torch.nn as nn
class ParallelSampler(nn.DataParallel):
def __init__(self, sampler, device_ids=None):
super().__init__(sampler, device_ids=device_ids)
self.module = sampler
def sample_chain(self, dim, n_steps, n_samples):
# Distribute samples across GPUs
return self.forward(dim, n_steps, n_samples)
# Create parallel sampler
devices = list(range(torch.cuda.device_count()))
parallel_sampler = ParallelSampler(sampler, device_ids=devices)
# Generate samples using all available GPUs
samples = parallel_sampler.sample_chain(dim=100, n_steps=1000, n_samples=100000)
Conclusion¶
Performance optimization in TorchEBM involves careful attention to vectorization, GPU acceleration, memory management, and algorithm-specific tuning. By following these guidelines, you can achieve significant speedups in your energy-based modeling workflows.