Performance Optimization¶

This document provides guidance on optimizing the performance of TorchEBM for both development and usage.

Performance Considerations¶

Key Performance Areas

When working with TorchEBM, pay special attention to these performance-critical areas:

Sampling algorithms: These are iterative and typically the most compute-intensive
Gradient calculations: Computing energy gradients is fundamental to many algorithms
Batch processing: Effective vectorization for parallel processing
GPU utilization: Proper device management and memory usage

Vectorization Techniques¶

Batched Operations¶

TorchEBM extensively uses batching to improve performance:

# Instead of looping over samples
for i in range(n_samples):
    energy_i = energy_function(x[i])  # Slow

# Use batched computation
energy = energy_function(x)  # Fast

Parallel Sampling¶

Sample multiple chains in parallel by using batch dimensions:

# Initialize batch of samples
x = torch.randn(n_samples, dim, device=device)

# One sampling step (all chains update together)
x_new, _ = sampler.step(x)

GPU Acceleration¶

TorchEBM is designed to work efficiently on GPUs:

Device Management¶

# Create energy function and move to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
energy_fn = GaussianEnergy(mean, cov).to(device)

# Create sampler with the same device
sampler = LangevinDynamics(energy_fn, device=device)

# Generate samples (automatically on the correct device)
samples, _ = sampler.sample(dim=2, n_steps=1000, n_samples=10000)

Memory Management¶

Memory management is critical for performance, especially on GPUs:

# Avoid creating new tensors in loops
for step in range(n_steps):
    # Bad: Creates new tensors each iteration
    x = x - step_size * energy_fn.gradient(x) + noise_scale * torch.randn_like(x)

    # Good: In-place operations
    grad = energy_fn.gradient(x)
    x.sub_(step_size * grad)
    x.add_(noise_scale * torch.randn_like(x))

Custom CUDA Kernels (to be Added--See Also: cuRBLAS)¶

Sampling Efficiency¶

Sampling efficiency can be improved using several techniques:

Step Size Adaptation

Automatically adjust step sizes based on acceptance rates or other metrics.

if acceptance_rate < 0.3:
    step_size *= 0.9  # Decrease step size
elif acceptance_rate > 0.7:
    step_size *= 1.1  # Increase step size

Burn-in Period

Discard initial samples to reduce the impact of initialization.

# Run burn-in period
x = torch.randn(n_samples, dim)
for _ in range(burn_in_steps):
    x, _ = sampler.step(x)

# Start collecting samples
samples = []
for _ in range(n_steps):
    x, _ = sampler.step(x)
    samples.append(x.clone())

Thinning

Reduce correlation between samples by keeping only every Nth sample.

# Collect samples with thinning
samples = []
for i in range(n_steps):
    x, _ = sampler.step(x)
    if i % thinning == 0:
        samples.append(x.clone())

Warm Starting

Initialize sampling from a distribution close to the target.

# Warm start from approximate distribution
x = approximate_sampler.sample(n_samples, dim)
samples = sampler.sample(
  n_steps=n_steps, 
  initial_samples=x
)

Profiling and Benchmarking (Planned)¶

Performance Benchmarks (Planned)¶

Performance Tips and Best Practices¶

General Tips¶

Use the right device: Always move computation to GPU when available
Batch processing: Process data in batches rather than individually
Reuse tensors: Avoid creating new tensors in inner loops
Monitor memory: Use torch.cuda.memory_summary() to track memory usage

Sampling Tips¶

Tune step sizes: Optimal step sizes balance exploration and stability
Parallel chains: Use multiple chains to improve sample diversity
Adaptive methods: Use adaptive samplers for complex distributions
Mixed precision: Consider using mixed precision for larger models

Common Pitfalls

Avoid these common performance issues:

Unnecessary CPU-GPU transfers: Keep data on the same device
Small batch sizes: Too small batches underutilize hardware
Unneeded gradient tracking: Disable gradients when not training
Excessive logging: Logging every step can significantly slow down sampling

Algorithm-Specific Optimizations¶

Langevin Dynamics¶

# Optimize step size for Langevin dynamics
# Rule of thumb: step_size ≈ O(d^(-1/3)) where d is dimension
step_size = min(0.01, 0.1 * dim**(-1/3))

# Noise scale should be sqrt(2 * step_size) for standard Langevin
noise_scale = np.sqrt(2 * step_size)

Hamiltonian Monte Carlo¶

# Optimize HMC parameters
# Leapfrog steps should scale with dimension
n_leapfrog_steps = max(5, int(np.sqrt(dim)))

# Step size should decrease with dimension
step_size = min(0.01, 0.05 * dim**(-1/4))

Multi-GPU Scaling (Planned)¶

Conclusion¶

Performance optimization in TorchEBM involves careful attention to vectorization, GPU acceleration, memory management, and algorithm-specific tuning. By following these guidelines, you can achieve significant speedups in your energy-based modeling workflows.