Torchebm > Losses > Contrastive_divergence¶
Contents¶
Classes¶
ContrastiveDivergence
- Implementation of the standard Contrastive Divergence (CD-k) algorithm.ParallelTemperingCD
- No description available.PersistentContrastiveDivergence
- No description available.
API Reference¶
torchebm.losses.contrastive_divergence
¶
Contrastive Divergence Loss Module.
This module provides implementations of Contrastive Divergence (CD) and its variants for training energy-based models (EBMs). Contrastive Divergence is a computationally efficient approximation to the maximum likelihood estimation that avoids the need for complete MCMC sampling from the model distribution.
Key Features
- Standard Contrastive Divergence (CD-k)
- Persistent Contrastive Divergence (PCD)
- Parallel Tempering Contrastive Divergence (PTCD)
- Support for different MCMC samplers
Module Components¶
Classes:
Name | Description |
---|---|
ContrastiveDivergence |
Standard CD-k implementation. |
PersistentContrastiveDivergence |
Implementation with persistent Markov chains. |
ParallelTemperingCD |
Implementation with parallel chains at different temperatures. |
Usage Example¶
Basic ContrastiveDivergence Usage
from torchebm.losses import ContrastiveDivergence
from torchebm.samplers import LangevinDynamics
from torchebm.energy_functions import MLPEnergyFunction
import torch
# Define the energy function
energy_fn = MLPEnergyFunction(input_dim=2, hidden_dim=64)
# Set up the sampler
sampler = LangevinDynamics(
energy_function=energy_fn,
step_size=0.1,
noise_scale=0.01
)
# Create the CD loss
cd_loss = ContrastiveDivergence(
energy_function=energy_fn,
sampler=sampler,
n_steps=10,
persistent=False
)
# In the training loop:
data_batch = torch.randn(32, 2) # Real data samples
loss, negative_samples = cd_loss(data_batch)
loss.backward()
Mathematical Foundations¶
Contrastive Divergence Principles
Contrastive Divergence approximates the gradient of the log-likelihood:
by replacing the expectation under the model distribution with samples obtained after
where
Why Contrastive Divergence Works
- Computational Efficiency: Requires only a few MCMC steps rather than running chains to convergence.
- Stability: Starting chains from data points ensures the negative samples are in high-density regions.
- Effective Learning: Despite theoretical limitations, works well in practice for many energy-based models.
Variants¶
Persistent Contrastive Divergence (PCD)
PCD improves upon standard CD by maintaining a persistent set of Markov chains between parameter updates. Instead of restarting chains from the data, it continues chains from the previous iterations:
- Initialize a set of persistent chains (often with random noise)
- For each training batch: a. Update the persistent chains with k steps of MCMC b. Use these updated chains for the negative samples c. Keep the updated state for the next batch
PCD can explore the energy landscape more thoroughly, especially for complex distributions.
Parallel Tempering CD
Parallel Tempering CD uses multiple chains at different temperatures to improve exploration:
- Maintain chains at different temperatures
- For each chain, perform MCMC steps using the energy function
- Occasionally swap states between adjacent temperature chains
- Use samples from the chain with
as negative samples
This helps overcome energy barriers and explore multimodal distributions.
Practical Considerations¶
Tuning Parameters
- n_steps: More steps improves quality of negative samples but increases computational cost.
- persistent: Setting to True enables PCD, which often improves learning for complex distributions.
- sampler parameters: The quality of CD depends heavily on the underlying MCMC sampler parameters.
How to Diagnose Issues?
Watch for these signs of problematic training:
- Exploding or vanishing gradients
- Increasing loss values over time
- Negative samples that don't resemble the data distribution
- Energy function collapsing (assigning same energy to all points)
Common Pitfalls
- Too Few MCMC Steps: Can lead to biased gradients and poor convergence
- Improper Initialization: For PCD, poor initial chain states may hinder learning
- Unbalanced Energy: If negative samples have much higher energy than positive samples, learning may be ineffective
Advanced Insights¶
Why CD May Outperform MLE
In some cases, CD might actually lead to better models than exact maximum likelihood:
- Prevents overfitting to noise in the data
- Focuses the model capacity on distinguishing data from nearby non-data regions
- May result in more useful representations for downstream tasks
Further Reading
- Hinton, G. E. (2002). "Training products of experts by minimizing contrastive divergence."
- Tieleman, T. (2008). "Training restricted Boltzmann machines using approximations to the likelihood gradient."
- Desjardins, G., et al. (2010). "Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines."