CUDA Optimizations¶

Performance Engineering

TorchEBM leverages CUDA to accelerate performance-critical operations. This guide explains the CUDA optimization strategies and how to implement new CUDA kernels.

Overview¶

CUDA optimizations in TorchEBM focus on accelerating three main performance bottlenecks:

:material-gradient:{ .lg .middle } Score Function Computation

Computing gradients of energy functions can be computationally intensive, especially for large batches or complex energy functions.
Sampling Operations

Sampling algorithms like Langevin dynamics require many iterations of score computation and updates.
Energy Evaluation

Evaluating energy functions on large batches of samples during training or inference.

CUDA Architecture¶

TorchEBM's CUDA implementation follows a layered architecture:

torchebm/
└── cuda/
    ├── __init__.py             # Package exports
    ├── ops.py                  # Python interface to CUDA operations
    ├── utils.py                # CUDA utilities
    ├── bindings.cpp            # PyTorch C++ bindings
    └── kernels/                # CUDA kernel implementations
        ├── score_function.cu   # Score function kernel
        ├── langevin_step.cu    # Langevin dynamics step kernel
        ├── energy_kernels.cu   # Energy function kernels
        └── include/            # Header files
            ├── common.cuh      # Common utilities
            └── ...

PyTorch C++ Extension¶

TorchEBM's CUDA functionality is built on PyTorch's C++ extension mechanism:

# In setup.py
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

setup(
    name="torchebm",
    ext_modules=[
        CUDAExtension(
            "torchebm.cuda.kernels",
            sources=[
                "torchebm/cuda/bindings.cpp",
                "torchebm/cuda/kernels/score_function.cu",
                "torchebm/cuda/kernels/langevin_step.cu",
                "torchebm/cuda/kernels/energy_kernels.cu",
            ],
            include_dirs=["torchebm/cuda/kernels/include"],
            extra_compile_args={"cxx": ["-O3"], "nvcc": ["-O3"]}
        )
    ],
    cmdclass={"build_ext": BuildExtension}
)

Score Function Optimization¶

The score function (gradient of energy) computation is optimized with CUDA:

Python Interface¶

def cuda_score(energy_fn, x, create_graph=False):
    """CUDA-optimized score function computation.

    Args:
        energy_fn: Energy function
        x: Input tensor of batch_shape (batch_size, dim)
        create_graph: Whether to create gradient graph

    Returns:
        Score tensor of batch_shape (batch_size, dim)
    """
    # Check if energy function has custom CUDA implementation
    if hasattr(energy_fn, "cuda_score_impl") and torch.cuda.is_available():
        return energy_fn.cuda_score_impl(x, create_graph)

    # Fall back to standard implementation for common energy functions
    if isinstance(energy_fn, GaussianEnergy) and torch.cuda.is_available():
        return _gaussian_score_cuda(energy_fn, x)

    # Fall back to autograd
    return score_function(energy_fn, x, create_graph)

CUDA Kernel¶

// In score_function.cu
__global__ void gaussian_score_kernel(
    const float* x,        // Input samples (batch_size * dim)
    const float* mean,     // Mean vector (dim)
    const float* precision,// Precision matrix (dim * dim)
    float* score,          // Output score (batch_size * dim)
    int batch_size,        // Batch size
    int dim                // Dimensionality
) {
    // Get sample index from CUDA thread
    int sample_idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Check if this thread processes a valid sample
    if (sample_idx < batch_size) {
        // Compute centered sample (x - mean)
        float centered[MAX_DIM];  // Use shared memory for better performance
        for (int d = 0; d < dim; ++d) {
            centered[d] = x[sample_idx * dim + d] - mean[d];
        }

        // Compute Precision * (x - mean)
        for (int d = 0; d < dim; ++d) {
            float sum = 0.0f;
            for (int j = 0; j < dim; ++j) {
                sum += precision[d * dim + j] * centered[j];
            }
            // Score is -Precision * (x - mean)
            score[sample_idx * dim + d] = -sum;
        }
    }
}

// C++ binding function
torch::Tensor gaussian_score_cuda(
    torch::Tensor x,
    torch::Tensor mean,
    torch::Tensor precision
) {
    // Get dimensions
    int batch_size = x.size(0);
    int dim = x.size(1);

    // Create output tensor
    auto score = torch::empty_like(x);

    // Configure CUDA kernel
    const int threads_per_block = 256;
    const int blocks = (batch_size + threads_per_block - 1) / threads_per_block;

    // Launch kernel
    gaussian_score_kernel<<<blocks, threads_per_block>>>(
        x.data_ptr<float>(),
        mean.data_ptr<float>(),
        precision.data_ptr<float>(),
        score.data_ptr<float>(),
        batch_size,
        dim
    );

    return score;
}

Langevin Dynamics Optimization¶

Langevin dynamics sampling is accelerated using CUDA kernels:

Python Interface¶

class CUDALangevinDynamics(LangevinDynamics):
    """CUDA-optimized Langevin dynamics sampler."""

    def __init__(self, energy_function, step_size=0.01, noise_scale=1.0):
        super().__init__(energy_function, step_size, noise_scale)

    def sample_step(self, x):
        """Perform one step of Langevin dynamics with CUDA optimization."""
        if not torch.cuda.is_available() or not x.is_cuda:
            # Fall back to CPU implementation
            return super().sample_step(x)

        # Use optimized CUDA implementation
        return langevin_step_cuda(
            x,
            self.energy_function,
            self.step_size,
            self.noise_scale
        )

CUDA Kernel¶

// In langevin_step.cu
__global__ void langevin_step_kernel(
    const float* x,        // Input samples (batch_size * dim)
    const float* score,    // Score function values (batch_size * dim)
    float* x_new,          // Updated samples (batch_size * dim)
    float step_size,       // Step size parameter
    float noise_scale,     // Noise scale parameter
    float* noise,          // Random noise (batch_size * dim)
    int batch_size,        // Batch size
    int dim                // Dimensionality
) {
    // Get global thread ID
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Check bounds
    if (idx < batch_size * dim) {
        // Compute Langevin update
        // x_new = x - step_size * score + sqrt(2 * step_size * noise_scale) * noise
        float noise_factor = sqrt(2.0f * step_size * noise_scale);
        x_new[idx] = x[idx] - step_size * score[idx] + noise_factor * noise[idx];
    }
}

// C++ binding function
torch::Tensor langevin_step_cuda(
    torch::Tensor x,
    torch::Tensor score,
    float step_size,
    float noise_scale
) {
    // Get dimensions
    int batch_size = x.size(0);
    int dim = x.size(1);

    // Generate random noise
    auto noise = torch::randn_like(x);

    // Create output tensor
    auto x_new = torch::empty_like(x);

    // Configure CUDA kernel
    const int threads_per_block = 256;
    const int total_elements = batch_size * dim;
    const int blocks = (total_elements + threads_per_block - 1) / threads_per_block;

    // Launch kernel
    langevin_step_kernel<<<blocks, threads_per_block>>>(
        x.data_ptr<float>(),
        score.data_ptr<float>(),
        x_new.data_ptr<float>(),
        step_size,
        noise_scale,
        noise.data_ptr<float>(),
        batch_size,
        dim
    );

    return x_new;
}

Energy Function Optimization¶

Energy function evaluation is optimized for specific analytical energy functions:

Gaussian Energy¶

// In energy_kernels.cu
__global__ void gaussian_energy_kernel(
    const float* x,        // Input samples (batch_size * dim)
    const float* mean,     // Mean vector (dim)
    const float* precision,// Precision matrix (dim * dim)
    float* energy,         // Output energy (batch_size)
    int batch_size,        // Batch size
    int dim                // Dimensionality
) {
    // Get sample index
    int sample_idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Check bounds
    if (sample_idx < batch_size) {
        // Compute centered values
        float centered[MAX_DIM];
        for (int d = 0; d < dim; ++d) {
            centered[d] = x[sample_idx * dim + d] - mean[d];
        }

        // Compute quadratic form: centered^T * precision * centered
        float quadratic_sum = 0.0f;
        for (int i = 0; i < dim; ++i) {
            float row_sum = 0.0f;
            for (int j = 0; j < dim; ++j) {
                row_sum += precision[i * dim + j] * centered[j];
            }
            quadratic_sum += centered[i] * row_sum;
        }

        // Energy is 0.5 * quadratic_sum
        energy[sample_idx] = 0.5f * quadratic_sum;
    }
}

Memory Optimization Techniques¶

TorchEBM uses several memory optimization techniques:

Shared Memory Usage¶

__global__ void optimized_kernel(...) {
    // Declare shared memory for frequently accessed data
    __shared__ float shared_data[BLOCK_SIZE];

    // Load data into shared memory
    shared_data[threadIdx.x] = global_data[blockIdx.x * blockDim.x + threadIdx.x];
    __syncthreads();

    // Use shared memory for computation
    // ...
}

Memory Coalescing¶

// Good: Coalesced memory access
__global__ void coalesced_kernel(float* data, float* result, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        result[idx] = data[idx] * 2.0f;
    }
}

// Avoid: Non-coalesced memory access
__global__ void noncoalesced_kernel(float* data, float* result, int width, int height) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < height) {
        for (int col = 0; col < width; ++col) {
            // Non-coalesced access pattern
            result[row * width + col] = data[row * width + col] * 2.0f;
        }
    }
}

Reducing Register Pressure¶

__global__ void optimized_kernel(...) {
    // Use local variables instead of arrays where possible
    float x1, x2, x3, x4;

    // Process in chunks to reduce register usage
    // ...
}

Thread Block Organization¶

CUDA kernels in TorchEBM are organized to maximize performance:

// Compute optimal block size based on problem dimensions
int compute_block_size(int dim) {
    // Power of 2 for better performance
    if (dim <= 32) return 32;
    if (dim <= 64) return 64;
    if (dim <= 128) return 128;
    return 256;
}

// Launch kernel with optimal configuration
void launch_kernel(int batch_size, int dim) {
    int block_size = compute_block_size(dim);
    int grid_size = (batch_size + block_size - 1) / block_size;

    my_kernel<<<grid_size, block_size>>>(/* args */);
}

Custom CUDA Kernels for Special Energy Functions¶

TorchEBM includes specialized CUDA kernels for common energy functions:

// Specialized kernel for Rosenbrock function
__global__ void rosenbrock_energy_kernel(
    const float* x,
    float* energy,
    float a,
    float b,
    int batch_size,
    int dim
) {
    int sample_idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (sample_idx < batch_size) {
        float sum = 0.0f;

        for (int i = 0; i < dim - 1; ++i) {
            float x_i = x[sample_idx * dim + i];
            float x_i_plus_1 = x[sample_idx * dim + i + 1];

            float term1 = b * (x_i_plus_1 - x_i * x_i) * (x_i_plus_1 - x_i * x_i);
            float term2 = (x_i - a) * (x_i - a);

            sum += term1 + term2;
        }

        energy[sample_idx] = sum;
    }
}

Performance Benchmarks¶

The following benchmarks demonstrate the performance gains from CUDA optimization:

Score Function Computation
- CPU Implementation: 100 ms
- CUDA Implementation: 5 ms
- Speedup: 20x
Langevin Dynamics Sampling
- CPU Implementation: 2000 ms
- CUDA Implementation: 200 ms
- Speedup: 10x
Energy Evaluation
- CPU Implementation: 80 ms
- CUDA Implementation: 6 ms
- Speedup: 13x

Mixed Precision Training¶

TorchEBM supports mixed precision training:

def mixed_precision_score(energy_fn, x):
    """Compute score with mixed precision."""
    # Cast to half precision for computation
    x_half = x.half()
    x_half.requires_grad_(True)

    # Compute energy in half precision
    with torch.cuda.amp.autocast():
        energy = energy_fn(x_half)

    # Compute gradient in full precision
    score = torch.autograd.grad(energy.sum(), x_half)[0].float()

    return score

Multi-GPU Support¶

TorchEBM provides utilities for multi-GPU operation:

def distribute_sampling(energy_fn, n_samples, n_steps, device_ids):
    """Distribute sampling across multiple GPUs."""
    # Distribute samples across devices
    samples_per_device = n_samples // len(device_ids)

    results = []
    for i, device_id in enumerate(device_ids):
        device = torch.device(f"cuda:{device_id}")

        # Create sampler on device
        sampler = LangevinDynamics(energy_fn).to(device)

        # Compute samples for this device
        samples = sampler.sample(
            dim=energy_fn.dim,
            n_steps=n_steps,
            n_samples=samples_per_device
        )

        results.append(samples)

    # Gather results from all devices
    return torch.cat(results, dim=0)

CUDA Stream Management¶

TorchEBM uses CUDA streams for concurrent execution:

def parallel_score_computation(energy_fn, samples_list):
    """Compute scores for multiple sample batches in parallel."""
    # Create streams for parallel execution
    streams = [torch.cuda.Stream() for _ in range(len(samples_list))]

    # Start computation in separate streams
    results = []
    for i, samples in enumerate(samples_list):
        with torch.cuda.stream(streams[i]):
            score = energy_fn.score(samples)
            results.append(score)

    # Synchronize streams
    for stream in streams:
        stream.synchronize()

    return results

Implementing Custom CUDA Kernels¶

To add a new CUDA kernel to TorchEBM:

Create a new .cu file in the torchebm/cuda/kernels/ directory
Implement the CUDA kernel and C++ binding function
Add the source file to the CUDAExtension in setup.py
Create a Python interface in torchebm/cuda/ops.py

Example of a custom kernel implementation:

// In custom_kernel.cu
#include <torch/extension.h>
#include "common.cuh"

// CUDA kernel
__global__ void custom_kernel(...) {
    // Kernel implementation
}

// C++ binding function
torch::Tensor custom_kernel_cuda(...) {
    // Binding implementation
    // ...
    return result;
}

// Register function for Python binding
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("custom_kernel", &custom_kernel_cuda, "Custom kernel implementation");
}

Troubleshooting CUDA Issues¶

Common CUDA issues and solutions:

Memory Errors¶

Check for memory leaks
Reduce batch size
Use torch.cuda.empty_cache()
Monitor memory usage with torch.cuda.memory_summary()

Performance Issues¶

Use CUDA profiling tools
Check for serialized operations
Optimize memory access patterns
Reduce kernel launch overhead

Common Pitfalls

Check for proper error handling in CUDA code
Beware of race conditions in kernel execution
Ensure correct synchronization between CPU and GPU
Verify tensor memory layouts match expectations

Resources¶

Core Components

Understand the core components of TorchEBM.

Core Components
Energy Functions

Learn about energy function implementation details.

Energy Functions
CUDA Programming

NVIDIA's CUDA programming guide.

CUDA Guide