Skip to content

Torchebm > Losses > Contrastive_divergence

Contents

Classes

API Reference

torchebm.losses.contrastive_divergence

Contrastive Divergence Loss Module.

This module provides implementations of Contrastive Divergence (CD) and its variants for training energy-based models (EBMs). Contrastive Divergence is a computationally efficient approximation to the maximum likelihood estimation that avoids the need for complete MCMC sampling from the model distribution.

Key Features

  • Standard Contrastive Divergence (CD-k)
  • Persistent Contrastive Divergence (PCD)
  • Parallel Tempering Contrastive Divergence (PTCD)
  • Support for different MCMC samplers

Module Components

Classes:

Name Description
ContrastiveDivergence

Standard CD-k implementation.

PersistentContrastiveDivergence

Implementation with persistent Markov chains.

ParallelTemperingCD

Implementation with parallel chains at different temperatures.


Usage Example

Basic ContrastiveDivergence Usage

from torchebm.losses import ContrastiveDivergence
from torchebm.samplers import LangevinDynamics
from torchebm.energy_functions import MLPEnergyFunction
import torch

# Define the energy function
energy_fn = MLPEnergyFunction(input_dim=2, hidden_dim=64)

# Set up the sampler
sampler = LangevinDynamics(
    energy_function=energy_fn,
    step_size=0.1,
    noise_scale=0.01
)

# Create the CD loss
cd_loss = ContrastiveDivergence(
    energy_function=energy_fn,
    sampler=sampler,
    n_steps=10,
    persistent=False
)

# In the training loop:
data_batch = torch.randn(32, 2)  # Real data samples
loss, negative_samples = cd_loss(data_batch)
loss.backward()

Mathematical Foundations

Contrastive Divergence Principles

Contrastive Divergence approximates the gradient of the log-likelihood:

θlogpθ(x)=θEθ(x)+Epθ(x)[θEθ(x)]

by replacing the expectation under the model distribution with samples obtained after k steps of MCMC starting from the data:

θlogpθ(x)θEθ(x)+θEθ(xk)

where xk is obtained after running k steps of MCMC starting from x.

Why Contrastive Divergence Works

  • Computational Efficiency: Requires only a few MCMC steps rather than running chains to convergence.
  • Stability: Starting chains from data points ensures the negative samples are in high-density regions.
  • Effective Learning: Despite theoretical limitations, works well in practice for many energy-based models.
Variants

Persistent Contrastive Divergence (PCD)

PCD improves upon standard CD by maintaining a persistent set of Markov chains between parameter updates. Instead of restarting chains from the data, it continues chains from the previous iterations:

  1. Initialize a set of persistent chains (often with random noise)
  2. For each training batch: a. Update the persistent chains with k steps of MCMC b. Use these updated chains for the negative samples c. Keep the updated state for the next batch

PCD can explore the energy landscape more thoroughly, especially for complex distributions.

Parallel Tempering CD

Parallel Tempering CD uses multiple chains at different temperatures to improve exploration:

  1. Maintain chains at different temperatures T1<T2<...<Tn
  2. For each chain, perform MCMC steps using the energy function E(x)/Ti
  3. Occasionally swap states between adjacent temperature chains
  4. Use samples from the chain with T1=1 as negative samples

This helps overcome energy barriers and explore multimodal distributions.


Practical Considerations

Tuning Parameters

  • n_steps: More steps improves quality of negative samples but increases computational cost.
  • persistent: Setting to True enables PCD, which often improves learning for complex distributions.
  • sampler parameters: The quality of CD depends heavily on the underlying MCMC sampler parameters.

How to Diagnose Issues?

Watch for these signs of problematic training:

  • Exploding or vanishing gradients
  • Increasing loss values over time
  • Negative samples that don't resemble the data distribution
  • Energy function collapsing (assigning same energy to all points)

Common Pitfalls

  • Too Few MCMC Steps: Can lead to biased gradients and poor convergence
  • Improper Initialization: For PCD, poor initial chain states may hinder learning
  • Unbalanced Energy: If negative samples have much higher energy than positive samples, learning may be ineffective

Advanced Insights

Why CD May Outperform MLE

In some cases, CD might actually lead to better models than exact maximum likelihood:

  • Prevents overfitting to noise in the data
  • Focuses the model capacity on distinguishing data from nearby non-data regions
  • May result in more useful representations for downstream tasks
Further Reading
  • Hinton, G. E. (2002). "Training products of experts by minimizing contrastive divergence."
  • Tieleman, T. (2008). "Training restricted Boltzmann machines using approximations to the likelihood gradient."
  • Desjardins, G., et al. (2010). "Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines."