Performance Optimization¶

This document provides guidance on optimizing the performance of TorchEBM for both development and usage.

Performance Considerations¶

Key Performance Areas

When working with TorchEBM, pay special attention to these performance-critical areas:

Sampling algorithms: These are iterative and typically the most compute-intensive
Gradient calculations: Computing energy gradients is fundamental to many algorithms
Batch processing: Effective vectorization for parallel processing
GPU utilization: Proper device management and memory usage

Vectorization Techniques¶

Batched Operations¶

TorchEBM extensively uses batching to improve performance:

# Instead of looping over samples
for i in range(n_samples):
    energy_i = energy_function(x[i])  # Slow

# Use batched computation
energy = energy_function(x)  # Fast

Parallel Sampling¶

Sample multiple chains in parallel by using batch dimensions:

# Initialize batch of samples
x = torch.randn(n_samples, dim, device=device)

# One sampling step (all chains update together)
x_new, _ = sampler.step(x)

GPU Acceleration¶

TorchEBM is designed to work efficiently on GPUs:

Device Management¶

# Create energy function and move to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
energy_fn = GaussianEnergy(mean, cov).to(device)

# Create sampler with the same device
sampler = LangevinDynamics(energy_fn, device=device)

# Generate samples (automatically on the correct device)
samples, _ = sampler.sample(dim=2, n_steps=1000, n_samples=10000)

Memory Management¶

Memory management is critical for performance, especially on GPUs:

# Avoid creating new tensors in loops
for step in range(n_steps):
    # Bad: Creates new tensors each iteration
    x = x - step_size * energy_fn.gradient(x) + noise_scale * torch.randn_like(x)

    # Good: In-place operations
    grad = energy_fn.gradient(x)
    x.sub_(step_size * grad)
    x.add_(noise_scale * torch.randn_like(x))

Custom CUDA Kernels¶

For the most performance-critical operations, TorchEBM provides custom CUDA kernels:

# Standard PyTorch implementation
def langevin_step_pytorch(x, energy_fn, step_size, noise_scale):
    grad = energy_fn.gradient(x)
    noise = torch.randn_like(x) * noise_scale
    return x - step_size * grad + noise

# Using custom CUDA kernel when available
from torchebm.cuda import langevin_step_cuda

def langevin_step(x, energy_fn, step_size, noise_scale):
    if x.is_cuda and torch.cuda.is_available():
        return langevin_step_cuda(x, energy_fn, step_size, noise_scale)
    else:
        return langevin_step_pytorch(x, energy_fn, step_size, noise_scale)

Sampling Efficiency¶

Sampling efficiency can be improved using several techniques:

Step Size Adaptation

Automatically adjust step sizes based on acceptance rates or other metrics.

# Adaptive step size example
if acceptance_rate < 0.3:
    step_size *= 0.9  # Decrease step size
elif acceptance_rate > 0.7:
    step_size *= 1.1  # Increase step size

Burn-in Period

Discard initial samples to reduce the impact of initialization.

# Run burn-in period
x = torch.randn(n_samples, dim)
for _ in range(burn_in_steps):
    x, _ = sampler.step(x)

# Start collecting samples
samples = []
for _ in range(n_steps):
    x, _ = sampler.step(x)
    samples.append(x.clone())

Thinning

Reduce correlation between samples by keeping only every Nth sample.

# Collect samples with thinning
samples = []
for i in range(n_steps):
    x, _ = sampler.step(x)
    if i % thinning == 0:
        samples.append(x.clone())

Warm Starting

Initialize sampling from a distribution close to the target.

# Warm start from approximate distribution
x = approximate_sampler.sample(n_samples, dim)
samples = sampler.sample(
    n_steps=n_steps, 
    initial_samples=x
)

Profiling and Benchmarking¶

To identify performance bottlenecks, TorchEBM includes profiling utilities:

from torchebm.utils.profiling import profile_sampling

# Profile a sampling run
profiling_results = profile_sampling(
    sampler, 
    dim=10, 
    n_steps=1000, 
    n_samples=100
)

# Print results
print(f"Total time: {profiling_results['total_time']:.2f} seconds")
print(f"Time per step: {profiling_results['time_per_step']:.5f} seconds")
print("Component breakdown:")
for component, time_pct in profiling_results['component_times'].items():
    print(f"  {component}: {time_pct:.1f}%")

Performance Benchmarks¶

Here are some performance benchmarks for common operations:

Operation	CPU Time (ms)	GPU Time (ms)	Speedup
Langevin step (1,000 samples, dim=10)	8.2	0.41	20x
HMC step (1,000 samples, dim=10)	15.4	0.76	20.3x
Energy gradient (10,000 samples, dim=100)	42.1	1.8	23.4x
Full sampling (10,000 samples, 100 steps)	820	38	21.6x

Performance Tips and Best Practices¶

General Tips¶

Use the right device: Always move computation to GPU when available
Batch processing: Process data in batches rather than individually
Reuse tensors: Avoid creating new tensors in inner loops
Monitor memory: Use torch.cuda.memory_summary() to track memory usage

Sampling Tips¶

Tune step sizes: Optimal step sizes balance exploration and stability
Parallel chains: Use multiple chains to improve sample diversity
Adaptive methods: Use adaptive samplers for complex distributions
Mixed precision: Consider using mixed precision for larger models

Common Pitfalls

Avoid these common performance issues:

Unnecessary CPU-GPU transfers: Keep data on the same device
Small batch sizes: Too small batches underutilize hardware
Unneeded gradient tracking: Disable gradients when not training
Excessive logging: Logging every step can significantly slow down sampling

Algorithm-Specific Optimizations¶

Langevin Dynamics¶

# Optimize step size for Langevin dynamics
# Rule of thumb: step_size ≈ O(d^(-1/3)) where d is dimension
step_size = min(0.01, 0.1 * dim**(-1/3))

# Noise scale should be sqrt(2 * step_size) for standard Langevin
noise_scale = np.sqrt(2 * step_size)

Hamiltonian Monte Carlo¶

# Optimize HMC parameters
# Leapfrog steps should scale with dimension
n_leapfrog_steps = max(5, int(np.sqrt(dim)))

# Step size should decrease with dimension
step_size = min(0.01, 0.05 * dim**(-1/4))

Multi-GPU Scaling¶

For extremely large sampling tasks, TorchEBM supports multi-GPU execution:

# Distribution across GPUs using DataParallel
import torch.nn as nn

class ParallelSampler(nn.DataParallel):
    def __init__(self, sampler, device_ids=None):
        super().__init__(sampler, device_ids=device_ids)
        self.module = sampler

    def sample_chain(self, dim, n_steps, n_samples):
        # Distribute samples across GPUs
        return self.forward(dim, n_steps, n_samples)

# Create parallel sampler
devices = list(range(torch.cuda.device_count()))
parallel_sampler = ParallelSampler(sampler, device_ids=devices)

# Generate samples using all available GPUs
samples = parallel_sampler.sample_chain(dim=100, n_steps=1000, n_samples=100000)

Conclusion¶

Performance optimization in TorchEBM involves careful attention to vectorization, GPU acceleration, memory management, and algorithm-specific tuning. By following these guidelines, you can achieve significant speedups in your energy-based modeling workflows.