Contrastive_divergence¶

Contents¶

Classes¶

ContrastiveDivergence - Implementation of the standard Contrastive Divergence (CD-k) algorithm.
ParallelTemperingCD - No description available.
PersistentContrastiveDivergence - No description available.

API Reference¶

torchebm.losses.contrastive_divergence ¶

Contrastive Divergence Loss Module.

This module provides implementations of Contrastive Divergence (CD) and its variants for training energy-based models (EBMs). Contrastive Divergence is a computationally efficient approximation to the maximum likelihood estimation that avoids the need for complete MCMC sampling from the model distribution.

Key Features

Standard Contrastive Divergence (CD-k)
Persistent Contrastive Divergence (PCD)
Parallel Tempering Contrastive Divergence (PTCD)
Support for different MCMC samplers

Module Components¶

Classes:

Name	Description
`ContrastiveDivergence`	Standard CD-k implementation.
`PersistentContrastiveDivergence`	Implementation with persistent Markov chains.
`ParallelTemperingCD`	Implementation with parallel chains at different temperatures.

Usage Example¶

Basic ContrastiveDivergence Usage

from torchebm.losses import ContrastiveDivergence
from torchebm.samplers import LangevinDynamics
from torchebm.energy_functions import MLPEnergyFunction
import torch

# Define the energy function
energy_fn = MLPEnergyFunction(input_dim=2, hidden_dim=64)

# Set up the sampler
sampler = LangevinDynamics(
    energy_function=energy_fn,
    step_size=0.1,
    noise_scale=0.01
)

# Create the CD loss
cd_loss = ContrastiveDivergence(
    energy_function=energy_fn,
    sampler=sampler,
    n_steps=10,
    persistent=False
)

# In the training loop:
data_batch = torch.randn(32, 2)  # Real data samples
loss, negative_samples = cd_loss(data_batch)
loss.backward()

Mathematical Foundations¶

Contrastive Divergence Principles

Contrastive Divergence approximates the gradient of the log-likelihood:

\[ \nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta(x')} [\nabla_\theta E_\theta(x')] \]

by replacing the expectation under the model distribution with samples obtained after \( k \) steps of MCMC starting from the data:

\[ \nabla_\theta \log p_\theta(x) \approx -\nabla_\theta E_\theta(x) + \nabla_\theta E_\theta(x_k) \]

where \( x_k \) is obtained after running \( k \) steps of MCMC starting from \( x \).

Why Contrastive Divergence Works

Computational Efficiency: Requires only a few MCMC steps rather than running chains to convergence.
Stability: Starting chains from data points ensures the negative samples are in high-density regions.
Effective Learning: Despite theoretical limitations, works well in practice for many energy-based models.

Variants¶

Persistent Contrastive Divergence (PCD)

PCD improves upon standard CD by maintaining a persistent set of Markov chains between parameter updates. Instead of restarting chains from the data, it continues chains from the previous iterations:

Initialize a set of persistent chains (often with random noise)
For each training batch: a. Update the persistent chains with k steps of MCMC b. Use these updated chains for the negative samples c. Keep the updated state for the next batch

PCD can explore the energy landscape more thoroughly, especially for complex distributions.

Parallel Tempering CD

Parallel Tempering CD uses multiple chains at different temperatures to improve exploration:

Maintain chains at different temperatures \( T_1 < T_2 < ... < T_n \)
For each chain, perform MCMC steps using the energy function \( E(x)/T_i \)
Occasionally swap states between adjacent temperature chains
Use samples from the chain with \( T_1 = 1 \) as negative samples

This helps overcome energy barriers and explore multimodal distributions.

Practical Considerations¶

Tuning Parameters

n_steps: More steps improves quality of negative samples but increases computational cost.
persistent: Setting to True enables PCD, which often improves learning for complex distributions.
sampler parameters: The quality of CD depends heavily on the underlying MCMC sampler parameters.

How to Diagnose Issues?

Watch for these signs of problematic training:

Exploding or vanishing gradients
Increasing loss values over time
Negative samples that don't resemble the data distribution
Energy function collapsing (assigning same energy to all points)

Common Pitfalls

Too Few MCMC Steps: Can lead to biased gradients and poor convergence
Improper Initialization: For PCD, poor initial chain states may hinder learning
Unbalanced Energy: If negative samples have much higher energy than positive samples, learning may be ineffective

Advanced Insights¶

Why CD May Outperform MLE

In some cases, CD might actually lead to better models than exact maximum likelihood:

Prevents overfitting to noise in the data
Focuses the model capacity on distinguishing data from nearby non-data regions
May result in more useful representations for downstream tasks