Yuanzhi Zhu

The Distillation in On-Policy Distillation

2026-03-17T00:00:00+01:00

TL;DR

This gonna be a short blog to share some of my thoughts on the distillation in on-policy distillation (during my discussion with Yuchen Zhu), which is obvious but anyway I want to write it down as a blog post.

In one sentence, I found that the famous diffusion distillation framework (DMD, Diff-Instruct, etc.) and the recently proposed forward-process based diffusion RL fine-tuning methods are actually doing the same thing.

The Distillation in On-Policy Distillation

The Illustration of the On-Policy Generator and the Objective

As shown in the figure below, the on-policy generator is actually a model $\theta$ that take inputs noise $x_0$ and context $c$ to generate outputs $x_1$. For diffusion model, this generation can take multiple $T$ steps, using corresponding samplers such as DDIM, DPM solver or even CM sampler (for the generator trained with consistency model objective or DMD objective).

Let’s denote the generator’s output distribution as $p_\theta(x_1\mid c)$, then we can write the objective of the on-policy distillation as minimizing some divergence between the generator’s output distribution and a target distribution $p_{\mathrm{target}}(x_1\mid c, c_\mathrm{ext})$, where $c_\mathrm{ext}$ is the external signals (e.g., reward model, privileged information, expert demonstrations, etc.).

The objective can be written as:

$$ \begin{align*} \mathcal{L}(\theta) &= \mathbb{E}_{x_0, c, c_\mathrm{ext}} \left[ D(p_{\mathrm{target}}(x_1|c, c_\mathrm{ext}) \| p_\theta(x_1|c)) \right] \end{align*} $$

where $D$ is some divergence measure, e.g., reverse KL divergence, JS divergence, etc.

Note that during implementation, we sample $x_1$ from the generator and compute the loss at the sample level. We can think of this objective as training the generator to produce outputs that are close to the target distribution, thus we omit the input of the generator in the following figures for simplicity.

Diffusion (Step) Distillation

For diffusion distillation, the target distribution is usually the teacher model distribution. According to my previous blog, the distillation can be seen as minimizing the KL divergence between the student marginal distribution $q_{t}(x_t)$ and the teacher marginal distribution $p_t(x_t)$ at each time step $t$, which is illustrated in the figure below.

As shown in the above figure, the gradient of the divergence can be estimated in two ways: (a) the score-based approach (e.g., DMD) and (b) the discriminator-based approach (e.g., Diffusion GAN).

Diffusion Reinforcement Learning

For diffusion RL fine-tuning, the target distribution is usually defined with a reference model and a reward model, which can be seen as a Boltzmann distribution $p_{\mathrm{target}}(x_1\mid c, c_\mathrm{ext}) \propto p_{\mathrm{ref}}(x_1\mid c) \exp(r(x_1))$, where $r(x_1)$ is the reward function defined by the reward model.

In this case, the objective can be written as:

$$ \begin{align*} \mathcal{L}(\theta) % &= \mathbb{E}_{x_0, c} \left[ D(p_\theta(x_1\mid c) \| p_{\mathrm{target}}(x_1\mid c)) \right] \\ &= \mathbb{E}_{x_0, c} \left[ D(p_{\mathrm{ref}}(x_1\mid c) \exp(r(x_1)) \| p_\theta(x_1\mid c)) \right] \end{align*} $$

Expanding using reverse KL divergence and splitting the log:

$$ \begin{align*} \mathbb{E}_{x_0, c} \left[ \mathbb{E}_{x_1 \sim p_\theta} \left[ \log \frac{p_\theta(x_1 \mid c)}{p_{\mathrm{ref}}(x_1 \mid c)} - r(x_1) \right] \right] \end{align*} $$

Splitting the expectation and rearranging terms, we have two terms: the reward term $\mathbb{E} _ {x_1 \sim p_\theta}[-r(x_1)]$ and the KL regularization term $\mathbb{E} _ {x_1 \sim p_\theta} \left[ \log \frac{p_\theta(x_1 \mid c)}{p_{\mathrm{ref}}(x_1 \mid c)} \right]$. With importance sampling and the REINFORCE log-derivative trick $\nabla_\theta \mathbb{E} _ {p_\theta}[f] = \mathbb{E} _ {p_\theta}[f \cdot \nabla_\theta \log p_\theta] $, we can rewrite the reward term as $\mathbb{E} _ {x_1 \sim p_{\mathrm{ref}}} \left[ - \frac{p_\theta(x_1 \mid c)}{p_{\mathrm{ref}}(x_1 \mid c)} r(x_1) \right]$.

As a result, the objective can be rewritten as a standard RL objective with KL regularization:

$$ \begin{align*} \boxed{\mathcal{L}(\theta) \propto -\mathbb{E}_{x_1 \sim p_{\mathrm{ref}}} \left[ \frac{p_\theta(x_1 \mid c)}{p_{\mathrm{ref}}(x_1 \mid c)} r(x_1) \right] + \mathbb{E}_{x_1 \sim p_\theta} \left[ \log \frac{p_\theta(x_1 \mid c)}{p_{\mathrm{ref}}(x_1 \mid c)} \right]} \end{align*} $$

For diffusion model, the density ratio $p_\theta / p_{\mathrm{ref}}$ can be estimated by the diffusion ELBO estimator, which is the same as the weighted denoising score matching (DSM) loss. To be specific, in the ELBO estimator, we calculate the density ratio at a randomly sampled time step $t$ as $r_t(x_t) = \frac{p_{\theta,t}(x_t|c)}{p_{\mathrm{ref},t}(x_t|c)}$, where $q_{\theta,t}$ is the generator marginal and $p_{\mathrm{ref},t}$ is the reference model marginal at time step $t$. The diffusion RL training can be summarized in the below figure.

Summary

In this blog, we show that diffusion distillation and diffusion RL fine-tuning are both instances of on-policy distillation, minimizing a divergence $D(p_{\mathrm{target}} | p_\theta)$, differing mainly in the choice of target distribution and the approach to estimate the divergence.

Acknowledgements

The author thanks Yuchen Zhu for his insightful discussion.

Thoughts on RL and ICL

2026-02-04T00:00:00+01:00

A Short Discussion on RL and ICL.

Recently I had a discussion with my friend Jiwen Yu, and I was trying to explain the benefits of RL, which in my opinion, is simply tilting the output distribution of the model towards high-reward regions $p^* = p_{\mathrm{ref}}\exp(\beta r)$.

“So basically, RL can increase the pass@1, hence the user will get a better answer with higher probability”

“Yeah, but in practice I found that for my specific question, the LLM will give me much better answers with more details I provided in the prompt.”

Self-Distillation Fine-Tuning (SDFT) and Self-Distillation Policy Optimization (SDPO)

Later I read the paper “Self-Distillation Enables Continual Learning” and I realized that the SDFT is trying to distilling the expert demonstrations in context into the model itself.

Concurrent work, “Reinforcement Learning via Self-Distillation”, also proposed a similar idea of using self-distillation to improve the model’s performance. To be specific, this time it does not rely on the expert demonstrations but the expert feedback on initial model outputs, then the expert feedback is fed back to the model in the form of context to generate better outputs for self-distillation. This work also demonstrated an elegant approach to utilize text rewards for RL fine-tuning.

Everything is about Distillation

I was supposed to make this a title for the whole blog, but perhaps it is too clickbaity. Nevertheless, here I would like to emphasize that the core idea is, for every method that is scalable at test time (with the help of external reward models or expert signals $c_\mathrm{ext}$), we can always find a way to distill the knowledge back into the model itself, so that at test time, the model can perform better without relying on external signals.

From another perspective from the first principal by Rishabh Agarwal on twitter, “Teacher doesn’t have to be a bigger neural net, just something better than the student”.

If we formulate the generation process as $p_\theta(x\mid c)$, where $c$ is the context, then we can always find a better teacher distribution of the same model $p(x\mid c, c_\mathrm{ext})$ that leverages the external signals $c_\mathrm{ext}$ to provide better outputs. Then we can distill the knowledge from the teacher back to the student model by minimizing some divergence between the two distributions, e.g. KL divergence.

Some examples include:

RL: distill the reward model and the policy into the model itself.
SDFT: distill the expert demonstrations in context into the model itself.
SDPO: distill the expert feedback as text reward in context into the model itself.

To end, everything is about distillation!

Acknowledgements

The author thanks Jiwen Yu and Yao Teng for their valuable feedbacks (for me to distill my thoughts).

A KL-Regularized Reward-Tilting View of DiffusionNFT

2026-01-24T00:00:00+01:00

Offline DiffusionNFT[1] as KL-Regularized Reward Tilting

In my personal view, DiffusionNFT is a significant work because it is the first native algorithm for diffusion RL without relying on differentiable reward models.

The paper derives an optimal velocity field update and interprets it as a positive/negative split, but the corresponding density-level optimum can be written explicitly as a reward-tilted distribution. This section provides a detailed derivation of this closed-form solution.

Setup (From DiffusionNFT)

Introduce a binary optimality variable $o\in{0,1}$ and a prompt/context $c$. DiffusionNFT defines the reward as an optimality probability

$$ \begin{align*} r(x_0,c)\;:=\;p(o=1\mid x_0,c)\in[0,1]. \tag{1} \end{align*} $$

Given a reference (old) model $\pi_{\mathrm{old}}(x_0\mid c)$, define the positive (optimality-conditioned) distribution

$$ \begin{align*} \pi_{+}(x_0\mid c) :=\pi_{\mathrm{old}}(x_0\mid o=1,c) =\frac{p(o=1\mid x_0,c)}{p_{\mathrm{old}}(o=1\mid c)}\,\pi_{\mathrm{old}}(x_0\mid c)= \frac{r(x_0,c)}{\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)]}\; \pi_{\mathrm{old}}(x_0\mid c), \label{pi_plus}\tag{2} \end{align*} $$

and the corresponding forward-noised marginals at diffusion time $t$ under a fixed kernel $\pi_{t\mid 0}(x_t\mid x_0)$:

$$ \begin{align*} p^{\mathrm{old}}_t(x_t\mid c)=\int \pi_{t\mid 0}(x_t\mid x_0)\,\pi_{\mathrm{old}}(x_0\mid c)\,dx_0, \qquad p^{+}_t(x_t\mid c)=\int \pi_{t\mid 0}(x_t\mid x_0)\,\pi_{+}(x_0\mid c)\,dx_0. \tag{3} \end{align*} $$

DiffusionNFT’s training objective¹ yields an optimal velocity field of the form

$$ \begin{align*} v^*(x_t,c,t)=v^{\mathrm{old}}(x_t,c,t)+\frac{2}{\beta}\,\Delta(x_t,c,t), \label{optimal_v} \tag{4} \end{align*} $$

with guidance direction $\Delta(x_t,c,t)=\alpha(x_t,c)\bigl(v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t)\bigr)$.

And the mixture coefficient is defined as

$$ \begin{align*} \alpha(x_t,c):=\frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)}\;\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)]. \tag{5} \end{align*} $$

Derivation of $\alpha(x_t,c)=p(o=1\mid x_t,c)$

Apply the forward diffusion to the positive distribution:

$$ \begin{align*} \frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} = \frac{p_{\mathrm{old}}(o=1\mid x_t,c)}{p_{\mathrm{old}}(o=1\mid c)}. \tag{6} \end{align*} $$

where we use the identity

$$ \begin{align*} p_{\mathrm{old}}(o=1\mid c)=\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)], \tag{7} \end{align*} $$

we obtain

$$ \begin{align*} \boxed{ \alpha(x_t,c)=p_{\mathrm{old}}(o=1\mid x_t,c). } \tag{8} \end{align*} $$

Note that $\alpha(x_t,c)=\mathbb{E}_{x_0\sim \pi _{\mathrm{old}}(\cdot \mid x_t,c)}[r(x_0,c)]$. This shows that the mixture coefficient equals the optimality posterior distribution at the noisy state $x_t$ under the old model.

Remark

This equality is a direct consequence of the definition of $\pi_{+}$ as an optimality-conditioned distribution and Bayes’ rule under the fixed forward noising kernel. It may be omitted or only implicit in the original DiffusionNFT paper. With this finding, the Lemma A.2 (Posterior Split) in the paper becomes obvious.

DiffusionNFT optimal distribution at each step

In order to compute the optimal distribution, we need to simplify the residual term $\Delta(x_t,c,t)=\alpha(x_t,c)\bigl(v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t)\bigr)$.

Using the Bayes relation for the positive marginal

$$ \begin{align*} p^{+}_t(x_t\mid c)=p^{\mathrm{old}}_t(x_t\mid c)\,\frac{p(o=1\mid x_t,c)}{p(o=1\mid c)}, \tag{9} \label{bayes_positive} \end{align*} $$

and i) utilizing the relation between velocity fields and score functions under fixed Gaussian noising², and ii) applying log and gradient, one has

$$ \begin{align*} v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t) &= \kappa(t)\Big(\nabla_{x_t}\log p^{+}_t(x_t\mid c)-\nabla_{x_t}\log p^{\mathrm{old}}_t(x_t\mid c)\Big) && \color{gray}{\text{// v to score}} \\ &= \kappa(t)\,\nabla_{x_t}\log\frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} \\ &= \kappa(t)\,\nabla_{x_t}\log\frac{p(o=1\mid x_t,c)}{p(o=1\mid c)} && \color{gray}{\text{// substitute eq(\ref{bayes_positive})}} \\ &= \kappa(t)\,\nabla_{x_t}\log p(o=1\mid x_t,c) && \color{gray}{\text{// } \nabla_{x_t}\log p(o=1\mid c)=0}. \tag{10} \end{align*} $$

Thus we can rewrite the guidance direction as:

$$ \begin{align*} \Delta(x_t,c,t) = \kappa(t)\,\alpha(x_t,c)\,\nabla_{x_t}\log p(o=1\mid x_t,c). \tag{11} \end{align*} $$

Closed-form optimal distribution (density-level)

Since $\alpha(x_t,c)=p(o=1\mid x_t,c)$ and $\alpha(x_t,c)\nabla\log\alpha(x_t,c)=\nabla\alpha(x_t,c)$, we can rewrite eq(\ref{optimal_v}) in the form of score and have:

$$ \require{cancel} \begin{align*} &v^*(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t) =\frac{2}{\beta}\,\Delta(x_t,c,t)=\kappa(t)\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c)\\ &\implies \cancel{\kappa(t)}\Big(\nabla_{x_t}\log p^{*}_t(x_t\mid c)-\nabla_{x_t}\log p^{\mathrm{old}}_t(x_t\mid c)\Big) =\cancel{\kappa(t)}\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c) && \color{gray}{\text{// v to score}} \\ &\implies \nabla_{x_t}\log\frac{p^{*}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} =\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c). \tag{12} \end{align*} $$

This implies the explicit density:

$$ \begin{align*} p^{*}_t(x_t\mid c) = \frac{1}{Z_t(c)}\;p^{\mathrm{old}}_t(x_t\mid c)\; \exp\!\Big(\frac{2}{\beta}\,p(o=1\mid x_t,c)\Big), \qquad Z_t(c)=\int p^{\mathrm{old}}_t(x_t\mid c)\exp\!\Big(\frac{2}{\beta}p(o=1\mid x_t,c)\Big)\,dx_t. \tag{13} \end{align*} $$

At $t=0$ this reduces to $p^{*}(x_0\mid c)\propto p^{\mathrm{old}}(x_0\mid c)\exp(\frac{2}{\beta}r(x_0,c))$.

Thus, DiffusionNFT with fixed $p^{\mathrm{old}}=p^{\mathrm{ref}}$ can be viewed as learning a KL-regularized exponential tilt of the reference model, where the “reward” is the optimality posterior at the noisy state.

Online DiffusionNFT Leads to Reward Hacking via Accumulated Tilting

In practice, DiffusionNFT can be run online: after fitting $v_\theta$ for one epoch, the old model is updated (e.g., by copying weights (hard) or by EMA (soft)) and the next epoch is trained against this updated old model. This induces a recursion over the initial reference distributions.

Idealized recursion with exact per-epoch optima

Let $p^{(k)}_t(x_t\mid c)$ denote the old marginal at epoch $k$ (the distribution induced by the current old velocity field). Treat the reward/optimality posterior $\alpha_t(x_t,c)=p(o=1\mid x_t,c)$ as fixed across epoches.

From the closed-form optimum, the population-optimal update at epoch $k$ is

$$ \begin{align*} p^{(k+1)}_t(x_t\mid c) = \frac{1}{Z^{(k)}_t(c)}\;p^{(k)}_t(x_t\mid c)\; \exp\!\Big(\lambda\,\alpha_t(x_t,c)\Big), \qquad \lambda=\frac{2}{\beta}. \tag{14} \end{align*} $$

Unrolling the recursion yields

$$ \begin{align*} p^{(K)}_t(x_t\mid c) \propto p^{(0)}_t(x_t\mid c)\; \exp\!\Big(\lambda K\,\alpha_t(x_t,c)\Big). \tag{15} \end{align*} $$

As $K\to\infty$, the previous expression concentrates mass on the set of maximizers of the reward.

Remarks on EMA references

If the reference is updated via EMA in parameter space, the induced distribution recursion is not exactly the simple multiplicative update above. Nevertheless, EMA typically acts as a trust-region mechanism that interpolates between keeping $p^{(k)}_t$ fixed and fully replacing it by the latest student, effectively reducing the rate at which the tilt coefficient grows with $k$. The qualitative conclusion remains: in the absence of an anchoring term to the initial model, repeated online improvement tends to accumulate reward tilt and can become increasingly peaked around high-reward regions and thus prone to reward hacking.

Online DiffusionNFT with an Additional KL to the Initial Reference

To prevent unbounded drift from the original model, one can augment the per-epoch objective with an additional regularizer that penalizes deviation from the initial reference distribution $p^{(0)}_t(\cdot\mid c)=p^{\mathrm{ref}}_t(\cdot\mid c)$.

At the distribution level, consider the KL-regularized problem at epoch $k$ [2,3]:

$$ \begin{align*} p^{(k+1)}_t = \arg\max_{p_t}\; \mathbb{E}_{x_t\sim p_t}\!\big[\alpha_t(x_t,c)\big] -\frac{1}{\eta_1}\mathrm{KL}\!\big(p_t\|p^{(k)}_t\big) -\frac{1}{\eta_0}\mathrm{KL}\!\big(p_t\|p^{(0)}_t\big), \qquad \eta_0,\eta_1>0. \tag{16} \end{align*} $$

Here $\eta_1$ controls the trust region to the current reference and $\eta_0$ controls anchoring to the initial reference.

The unique optimizer has the closed form

$$ \begin{align*} p^{(k+1)}_t(x_t\mid c) = \frac{1}{\widetilde Z^{(k)}_t(c)}\; \Big(p^{(k)}_t(x_t\mid c)\Big)^{w}\, \Big(p^{(0)}_t(x_t\mid c)\Big)^{1-w}\, \exp\!\Big(\lambda_{\mathrm{eff}}\,\alpha_t(x_t,c)\Big), \tag{17} \end{align*} $$

with weights and effective tilt coefficient

$$ \begin{align*} w=\frac{\eta_0}{\eta_0+\eta_1}\in(0,1), \qquad \lambda_{\mathrm{eff}}=\frac{\eta_0\eta_1}{\eta_0+\eta_1}. \tag{18} \end{align*} $$

Thus, the per-epoch solution is an exponential tilt of a geometric mixture of the current and initial references.

It is straightforward to see that the iterates converge to a limiting distribution:

$$ \begin{align*} \boxed{ p^{(\infty)}_t(x_t\mid c) = \frac{1}{Z^{(\infty)}_t(c)}\;p^{(0)}_t(x_t\mid c)\exp\!\big(\eta_0\,\alpha_t(x_t,c)\big). } \tag{19} \end{align*} $$

Notably, the limiting distribution depends only on the anchoring strength $\eta_0$; the trust-region parameter $\eta_1$ affects only the dynamics (convergence rate and stability) through $w$.

In practice, the finetuned model in DiffusionNFT is initialized from a pre-trained diffusion model without CFG. Moreover, the initial reference model is also the pre-trained diffusion model without CFG. Thus, with large inital KL strength, the online DiffusionNFT procedure effectively regularizes the finetuned model toward the pre-trained diffusion model without CFG and the learned model can only generate blurry samples; with small initial KL strength used in the experiments in the paper, the finetuned model generates samples with high reward but low diversity and pure color background, which suggests severe reward hacking. Based on the analysis above, we suggest adding an medium-strength initial KL regularization and inference the pre-trained model with CFG as the initial reference to mitigate reward hacking in online DiffusionNFT.

Acknowledgements

The author thanks Huayu Chen for his insightful work, helpful discussions, and valuable feedback. The author also thanks Ruiqing Wang for proofreading.

References

[1] DiffusionNFT: Online Diffusion Reinforcement with Forward Process.
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. arXiv preprint arXiv:2509.16117 (2025).

[2] Test-time Alignment of Diffusion Models without Reward Over-optimization.
Sunwoo Kim, Minkyu Kim, and Dongmin Park. The Thirteenth International Conference on Learning Representations (ICLR 2025).

[3] Feedback Efficient Online Fine-Tuning of Diffusion Models.
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M. Tseng, Sergey Levine, and Tommaso Biancalani. Proceedings of the 41st International Conference on Machine Learning (ICML 2024).

New Year Resolution 2026

2026-01-07T00:00:00+01:00

It feels like yesterday when I wrote my New Year Resolution for 2025. While many (mostly good) things have happened in 2025, at this moment, I have the resolutions firmly set for 2026 and wish to avoid the mistakes made in the previous year.

Time, is the most fair resource people have. The feeling of time is related to the infomation input we receive, thus we need to experience more. Due to the finite memorey capacity of human brain, it is more important to collect precious memories, better with treasureable people.

On the Connection between DMD and GAN for Diffusion Distillation

2025-11-06T00:00:00+01:00

In this blog, I want to show a simple but intuitive connection between Variational Score Distillation (VSD/DMD) [1,2,3] and Diffusion-GAN [4] for diffusion distillation. This connection is noticed when I was trying to extend our recent workshop paper Di-Bregman [5].

The final conclusion is not bonded to Di-Bregman, I just want to use Di-Bregman as an example to illustrate the connection.

Preliminaries

In Di-Bregman, we derived a new distillation loss, whose loss gradient extend the DMD loss with an extra coefficient $h^{\prime\prime}(r_t(x_t))\,r_t(x_t)$:

$$ \begin{align*} \nabla_\theta \mathbb{D}_h(r_t\|1) = -\mathbb{E}_{\epsilon,t}\Big[\,w(t)h^{\prime\prime}(r_t(x_t))\,r_t(x_t)\,\big(\nabla_{x_t}\log p_t({x_t})-\nabla_{x_t}\log q_{\theta,t}({x_t})\big)\, \nabla_\theta G_\theta(\epsilon)\Big], \tag{1} \end{align*} $$

where

$r_t(x_t) = \frac{q_{\theta,t}(x_t)}{p_t(x_t)}$ is the density ratio between the student marginal $q_{\theta,t}$ and the teacher marginal $p_t$ at time $t$;
$G_\theta(\epsilon)$ is the student generative model that maps noise $\epsilon$ to clean sample;
$h(r)$ is a convex function defining the Bregman divergence $\mathbb{D}_h$;
$w(t)$ is a time-dependent weight function;
$\epsilon \sim \mathcal{N}(0,I)$, $t \sim \mathcal{U}(0,1)$, and $x_t = \alpha_t G_\theta(\epsilon) + \sigma_t z_t$ with $z_t \sim \mathcal{N}(0,I)$.

The DMD loss gradient derived from reverse KL divergence can be seen as a special case of the Di-Bregman when choosing $h(r)=r\log r$.

From Scores to a Discriminator

One difference between Di-Bregman and DMD is that we need to estimate the density ratio $r_t(x_t)$ in the extra coefficiency $h^{\prime\prime}(r_t(x_t))\,r_t(x_t)$, usually achieved by training a discriminator $D(x_t,t)$ to distinguish samples from $p_t$ and $q_{\theta,t}$.

With the usual logistic convention (where $D_t(x_t)$ outputs the probability that $x_t$ comes from $p_t$), we have the following optimal discriminator $D_t^*(x_t)=\frac{p_t(x_t)}{p_t(x_t)+q_{\theta,t}(x_t)}$.

A natural question is: can we rewrite the Di-Bregman loss gradient entirely in terms of the discriminator $D_t$, eliminating explicit score functions?

Step 1: Substitute the density-ratio gradient identity

We have the identity

$$ \begin{align*} \nabla_{x_t} r_t({x_t}) = r_t({x_t})\,\big(\nabla_{x_t}\log q_{\theta,t}({x_t}) - \nabla_{x_t}\log p_t({x_t})\big) \tag{2} \end{align*} $$

Substituting into the Di-Bregman gradient gives

$$ \begin{align*} \nabla_\theta \mathbb{D}_h(r_t\|1) &= -\mathbb{E}_{\epsilon,t}\Big[w(t)\,h^{\prime\prime}(r_t)\,r_t\Big(-\frac{\nabla_{x_t} r_t}{r_t}\Big)\,\nabla_\theta G_\theta(\epsilon)\Big] \\ &= \mathbb{E}_{\epsilon,t}\Big[w(t)\,h^{\prime\prime}(r_t)\,\nabla_{x_t} r_t(x_t)\,\nabla_\theta G_\theta(\epsilon)\Big]. \tag{3} \end{align*} $$

Step 2: Express $\nabla_{x_t} r_t$ and $r_{t}$ in terms of $D_t$

Since $r_t = \frac{q_{\theta,t}}{p_t} = \frac{1-D_t}{D_t}$, we have $\frac{dr_t}{dD_t} = -\frac{1}{D_t^2}$, hence

$$ \begin{align*} \nabla_{x_t} r_t(x_t) = -\frac{1}{D_t(x_t)^2}\,\nabla_{x_t} D_t(x_t). \tag{4} \end{align*} $$

Substitute this back:

$$ \begin{align*} \nabla_\theta \mathbb{D}_h(r_t\|1) = -\mathbb{E}_{\epsilon,t}\Big[w(t)\,\frac{h^{\prime\prime}(\frac{1-D_t}{D_t})}{D_t^2}\,\nabla_{x_t} D_t(x_t)\,\nabla_\theta G_\theta(\epsilon)\Big]. \tag{5} \end{align*} $$

Step 3: Convert $\nabla_{x_t} D_t$ to $\nabla_\theta D_t$

Using the chain rule, we have:

$$ \begin{align*} \boxed{ \nabla_\theta \mathbb{D}_h(r_t\|1) = -\mathbb{E}_{\epsilon,t}\Big[ w'(t)\,\frac{h^{\prime\prime}(\frac{1-D_t}{D_t})}{D_t^2}\, \nabla_\theta D_t(x_t,t) \Big] } \tag{6} \end{align*} $$

The conversion from $\nabla_{x_t}D_t$ to $\nabla_\theta D_t$ introduces the factor $\alpha_t$, which is absorted into $w’(t)$.

The boxed equation shows that the Di-Bregman loss gradient can be computed by backpropagating through the discriminator $D_t$ only, without explicitly estimating the score functions. This suggests that Di-Bregman / VSD and Diffusion-GAN are closely connected.

Verify with Special Case: reverse KL Divergence (DMD)

When $h(r)=r\log r$, we have $h^{\prime\prime}(r)=\frac{1}{r}$, hence the coefficiency in the boxed equation becomes $\frac{h^{\prime\prime}(\frac{1-D_t}{D_t})}{D_t^2} = \frac{1}{D_t(1-D_t)}$, and the loss gradient becomes

$$ \begin{align*} \nabla_\theta \mathbb{D}_{\mathrm{KL}}(q_{\theta,t}\|p_t) = -\mathbb{E}_{\epsilon,t}\Big[ w'(t)\,\frac{1}{D_t(1-D_t)}\,\nabla_\theta D_t(x_t,t) \Big]. \tag{7} \end{align*} $$

This resonates with the GAN generate loss for KL divergence $\mathrm{KL}(q_\theta||p)$:

$$ \begin{align*} \mathcal{L}_G = -\mathbb{E}_{\epsilon}\Big[\log \frac{D(G_\theta(\epsilon))}{1-D(G_\theta(\epsilon))}\Big]. \tag{8} \end{align*} $$

This special case was also noticed in Diff-instruct [3] Corollary 3.5 (From KL perspective to derive both losses).

Given that both DMD and Di-Bregman are equivalent to diffusion-GAN training, and with the belief that scalable pre-training requires algorithms that directly optimize the data likelihood or an associated ELBO, I am skeptical that these approaches can serve as general-purpose pre-training methods. For example, it seems unlikely that we could pre-train a next-frame prediction video generative model using methods such as self-forcing.

References

[1] Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Advances in neural information processing systems 36 (2023).

[2] One-step diffusion with distribution matching distillation.
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6613-6623. 2024.

[3] Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Advances in Neural Information Processing Systems 36 (2023).

[4] Diffusion-gan: Training gans with diffusion.
Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. arXiv preprint arXiv:2206.02262 (2022).

[5] One-step Diffusion Models with Bregman Density Ratio Matching.
Yuanzhi Zhu, Eleftherios Tsonis, Lucas Degeorge, Vicky Kalogeiton. arXiv preprint, arXiv:2510.16983, 2025.

A Second-order MeanFlow

2025-10-25T00:00:00+02:00

Second-order MeanFlow: Combining Forward and Backward Distillation

In this blog, We show how the MeanFlow (backward distillation) identity and the tri-consistency constraint can be combined to produce a second-order identity for the student velocity in diffusion/flow distillation. Similar to MeanFlow, this 2-order MeanFlow contains remarkably simple and elegant target: the average of two end point velocities and a second-order correction term.

1. Setup: Notation and Goals

$x_t$ denotes the state at time $t$ on the diffusion/flow trajectory.
$v_\phi(x_t,t)$ is the pre-trained teacher velocity field (teacher’s instantaneous velocity at $(x_t,t)$).
$v_\theta(x_t,t,s)$ is the student average velocity intended to take a sample at time $t$ to the time $s$ in a single step (a short-cut/one-step generator). We treat $s > t$.
We will often use small increments $ds$ with $s_2 = s_1 + ds$.

Two consistency principles underly the derivation:

MeanFlow (backward) identity (backward distillation): for $s > t$,

$$ v(x_t,t,s) = v_\phi(x_t,t) + (s-t)\frac{d}{dt}\bigl[v(x_t,t,s)\bigr], $$

which expresses the student average velocity from timestep $t$ to $s$ in terms of the (teacher’s) instantaneous velocity at $t$ and the time derivative of the student velocity with respect to the local timestep $t$.
Tri-consistency (additivity of short segments): for $s_2 > s_1 > t$,

$$ (s_1-t)\,v(x_t,t,s_1) + (s_2-s_1)\,v(x_{s_1},s_1,s_2) \;=\; (s_2-t)\,v(x_t,t,s_2). $$

This says that two short forwards ($t$ to $s_1$ to $s_2$) should equal the single forward (from $t$ to $s_2$).

Goal: derive a practical target for $v_\theta(x_t,t,s)$ that consider both the forward and backward constraints.

2. A Intuitive Derivation

Start from tri-consistency and replace the second segment ($s_1$ to $s_2$) with teacher PF-ODE simulation:

$$ (s_1-t)\,v(x_t,t,s_1) + \mathrm{ODE}\bigl[v_\phi,x_{s_1},s_1,s_2\bigr] = (s_2-t)\,v(x_t,t,s_2). $$

For small step $ds = s_2 - s_1$ the one-step Euler approximation to the teacher ODE gives

$$ \mathrm{ODE}[v_\phi,x_{s_1},s_1,s_2] \approx \,v_\phi(x_{s_1},s_1) ds. $$

Now set $s_1 = s$ and $s_2 = s + ds$. Substitute the MeanFlow identity (backward) for the two student velocities $v(x_t,t,s)$ and $v(x_t,t,s_2)$. After algebra and cancellation of terms proportional to $ds$, we arrive at the following mixed relation:

$$ v_\phi(x_s,s) - v_\phi(x_t,t) \;=\; 2(s-t)\,\frac{d}{dt}v(x_t,t,s) + (s-t)^2\;\frac{d^2}{dt\,ds}v(x_t,t,s). $$

We can substitute the MeanFlow identity to eliminate the first time derivative and rearrange to isolate $v(x_t,t,s)$, yields the second-order MeanFlow identity:

$$ \boxed{\;v(x_t,t,s) = \frac{1}{2}\bigl(v_\phi(x_t,t) + v_\phi(x_s,s)\bigr) - \frac{1}{2}(s-t)^2\,\frac{d^2}{dt\,ds}v(x_t,t,s)\;} \tag{★} $$

Remarks

Equation (★) is the second-order MeanFlow expression: the student velocity equals the average of the two teacher velocities plus a second-order correction involving the mixed partial $\dfrac{d^2}{dt\,ds}v$.
(Quick validation) Similar to MeanFlow, when $t=s$, Equation (★) degrades to flow matching loss.
Unlike forward and backward distillation loss, Equation (★) can not be simply interpreted as a special case of tri-consistency.
In practive, we expect to use this loss along with the first-order MeanFlow loss to improve the few-step performance (requires more computation).

2. Generalized 2-order Loss

Given the backward and forward distillation losses:

$$ v(x_t,t,s) = v_\phi(x_t,t) + (s-t)\frac{d}{dt}\bigl[v(x_t,t,s)\bigr],\\ v(x_t,t,s) = v_\phi(x_s,s) - (s-t)\frac{d}{ds}\bigl[v(x_t,t,s)\bigr], $$

we can combine them with our second-order MeanFlow loss using weights $\alpha$ and $\beta$ to get a generalized loss:

$$ \begin{aligned} v(x_t,t,s) \;=\;& \; {\alpha v_\phi(x_t,t) + (1-\alpha) v_\phi(x_s,s)} \\ &+ (s-t)\left[(\alpha - \frac{1}{2}\beta)\frac{d}{dt}v(x_t,t,s) - (1 - \alpha - \frac{1}{2}\beta)\frac{d}{ds}v(x_t,t,s)\right] \\ &- \frac{\beta}{2}(s-t)^2\frac{d^2}{dt\,ds}v(x_t,t,s). \end{aligned} $$

For $\beta = 0$, $\alpha = 1$ or $0$, we recover backward or forward distillation loss; for $\beta = 1$, $\alpha = \frac{1}{2}$, we recover the second-order MeanFlow loss (★).

References

[1] Mean Flows for One-Step Generative Modeling.
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. arXiv preprint, arXiv:2505.13447, 2025.

[2] ICML Tutorial on the Blessing of Flow.
Qiang Liu. International Conference on Machine Learning (ICML), 2025.

New Year Resolution 2025

2024-12-09T00:00:00+01:00

It’s a time to reflect on the past year and set new goals for the next year. While I tried to summarize my research journey in 2024, I finally commented out the whole content because it was a bit lengthy and not well-organized, which might not be interesting to read.

But anyway, here comes my new year resolution for 2025.

Stay close to my family and friends.
Keep healthy and exercise regularly.
Do some interesting research (overcome necessary engineering difficulties).
Have more discussions with my colleagues and other researchers, inspire more sparks.

Bayesian Posterior Sampling (3) Langevin Dynamics and Diffusion Models

2023-07-05T00:00:00+02:00

Introduction

In the previous post, I presented some of the most classical MCMC method, Metropolis-Hastings (MH) and its variants. In this post I’d like to focus on another important MCMC method: Unadjusted Langevin Algorithm (ULA) in Bayesian statistics or Langevin Monte Carlo (LMC) in machine learning. LMC is a MCMC method which utilize Langevin dynamics for obtaining random samples from a probability distribution $\pi$ for which direct sampling is difficult (high-dimension and large number of data).

Langevin dynamics provides an MCMC procedure to sample from a distribution $p(x)$ using only its score function $\nabla_x \log p(x)$. Specifically, it initializes the chain from an arbitrary prior distribution $x_0 \sim \pi(x)$, and then iterates the following:

$$ \begin{align*} x_{i+1} \leftarrow x_i + \epsilon \nabla_x \log p(x) + \sqrt{2 \epsilon} z_i \label{Des_LD}\tag{1} \end{align*}$$

—— Generative Modeling by Estimating Gradients of the Data Distribution by Yang Song. Personal Blog, 2021.

Langevin Dynamics Sampling

Langevin Diffusion

The overdamped¹ Langevin Itô diffusion can be written as the following Stochastic Differential Equation (SDE):

$$ \begin{align*} d{X_t}=\underbrace{-\nabla U(X_t)\mathrm{d}t}_{\text{drift term}} + \underbrace{\sqrt {2}{B_t}\mathrm{d}t}_{\text{diffusion term}}, \quad X_0\sim p_0 \label{LD}\tag{2} \end{align*} $$

where $U(X)$ is the (time-dependent) potential energy function on $\mathbb{R}^d$ and $B_t$ is a d-dimensional standard Brownian Motion (or called the Wiener process). We assume that $U(X)$ is $L$-smooth (or $\nabla U(X)$ is $L$-Lipschitz): i.e. continuously differentiable and $||\nabla U(x)-\nabla U(y)|| \leq L||x-y||, \exists L > 0$.

In order to sample the diffusion paths, we can discrete eq(\ref{LD}) using the Euler-Maruyama (EM) scheme as following² (similar to eq(\ref{Des_LD})):

$$ \begin{align*} x_{i+1} = x_i - \gamma_{i+1} \nabla_x U(x_i) + \sqrt{2 \gamma_{i+1}} z_i \label{Des_LD2}\tag{3} \end{align*} $$

where $z_i$ is i.i.d $\mathcal{N}(0,I_d)$ and $\gamma_{i}$ is the stepsize, either constant or decreasing to 0.

The subject of mixing time in MCMC algorithms is quite complex and necessitates substantial analysis skills. Hence, we won’t delve into it in this blog post.

Fokker-Plank Equation

The Fokker-Planck (FP) equation, a form of partial differential equation (PDE), outlines how a probability distribution changes over time due to the influences of deterministic drift forces and stochastic fluctuations.

Let’s denote the law of $X_t$ as $p_t$ and the FP equation for $p_t$ in eq(\ref{LD}) can be written as³^,⁴:

$$ \begin{align*} \partial_t p_t = \nabla\cdot(p_t\nabla U) + \Delta p_t \label{FP}\tag{4} \end{align*} $$

To verify that $\pi$ is the unique stationary distribution for this FP equation (\ref{FP}) and hence the corresponding Langevin diffusion — which motivate us to use Langevin dynamics for sampling from $\pi$ — we can do any of the following⁵:

substitute $p_t$ with $\pi$ to verify that it’s a stationary distribution ($\partial_t \pi = 0$)
assume we already have a stationary distribution $p_\infty(x), s.t. \partial_t p_\infty=0$ and verify that $p_\infty(x) \propto \exp(-U(x))$.

Unadjusted Langevin Algorithm

When we using eq(\ref{Des_LD2}) for sampling we are implementing the ULA. Even though the stationary distribution of the Langevin diffusion is $\pi$, the stationary distribution of ULA is not, due to the discretization⁶! Note that when $\gamma_i$ is constant, the value of it controls the trade-off between the convergence speed and the accuracy⁷. ULA can be biased easily with larger step-size $\gamma_i$ and one could use a decreasing step-size or introduce the HM scheme to eliminate the bias.

Metropolis-Adjusted Langevin Algorithm

The well-known MALA can be considered as MH with improved proposals or in this post ULA with an extra rejection step:

$$ \begin{align*} g(x'|x) &= \mathcal{N}(x';x-\tau \nabla \tilde{\pi}(x),2\tau I) \\ p_{\mathrm{accept}}(x') &= \min (1,\alpha) =\min \left(1, \frac{\tilde\pi\left(x^{\prime}\right) g\left(x | x^{\prime}\right)}{\tilde\pi(x) g\left(x^{\prime} | x\right)}\right) \label{MALA}\tag{5} \end{align*} $$

where the additional term $\tau \nabla \tilde{\pi}(x)$ drives the samples toward areas with high probability density, which makes the proposed samples more likely to be accepted. One significant advantage of MALA over Random Walk Metropolis (RWM) lies in the optimal step-size selection. In MALA, the asymptotic value for this parameter is larger compared to the RWM equivalent, leading to a decrease in the dependence observed between subsequent data points.

Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD) is a method proposed by Max Welling and Yee Whye Teh where the traget density $\pi$ is the density of the posterior distribution $p(\theta|x)\propto {p(\theta)}\prod_{i=1}^{N}{p(x_i|\theta)}$. The authors initially observe the parallels between stochastic gradient algorithms (that utilize a batch of samples in each iteration) and Langevin dynamics (that employ all available samples in each iteration; one can consider ULA as gradient descent with some Gaussian noise added to the gradient at each iteration), and then propose a novel update rule that merges these two concepts with unbiased estimates of the gradient:

$$ \begin{align*} \Delta\theta_{t}=\frac{\epsilon_{t}}{2}\left(\nabla\log p(\theta_{t})+\frac{N}{n}\sum_{i=1}^{n}\nabla\log p(x_{t i}|\theta_{t})\right)+\eta_{t} \label{SGLD}\tag{6} \end{align*} $$

where the step-sizes $\epsilon_{t}$ decrease towards zero, n is the batch size, and $\eta_{t} \sim \mathcal{N}(0,\epsilon_{t})$.

Yes, SGLD is proposed to train a model given the dataset⁸, and the idea is quite simple: We train the model using regular SGD, but add some Gaussian noise to each step. There are two sources of randomness: estimates of the gradient and Gaussian added noise to sample.

From Langevin Dynamics to Diffusion Models

Pitfalls with Unadjusted Langevin Dynamics

⁹A Connection between Score Matching and Denoising Autoencoders P. Vincent. Neural computation, Vol 23(7), pp. 1661--1674. MIT Press. 2011.

¹⁰Generative Modeling by Estimating Gradients of the Data Distribution Y. Song, S. Ermon. NeuIPS 2019.

¹¹In this blog by Jianlin Su you can find some discussion (in Chinese) about the relationship between denoising score matching and score matching.

¹²In this blog the authors introduced a very similar technique called Simulated Tempering Langevin Monte Carlo.

Assume now we have access to oracle $s_\theta \approx \nabla \log p_\infty = -\nabla U$ through score matching⁹:

$$ \begin{align*} \theta^{\star} = & \arg \min \frac{1}{2}\mathbb{E}_{p_{\mathrm{data}}(x)} \left[||s_\theta (x)-\nabla_{x} \log p_{\mathrm{data}}(x)||_{2}^{2} \right] \\ = & \arg \min \mathbb{E}_{p_{\mathrm{data}}(x)}\left[\mathrm{tr}(\nabla_{x}{s}_\theta(x))+\frac{1}{2}||{s}_\theta(x)||_{2}^{2}\right], \label{sm}\tag{7} \end{align*} $$

are we safe to use Langevin Algorithm (\ref{Des_LD2}) for image generation in high dimension? No, there are pitfalls¹⁰!

The manifold hypothesis: the score function $s_\theta$ is ill-defined when $x$ is confined to a low dimensional manifold in a high-dimensional space; and the score matching objective in eq(\ref{sm}) will provides a inconsistent score estimator when the data reside on a low-dimensional manifold.
The scarcity of data in low density regions can cause difficulties for both score estimation with score matching and MCMC sampling with Langevin dynamics (suffers from slow mixing time due to separate regions of data manifold).

Annealed Langevin Dynamics

To overcome these issues, Yang Song et al. proposed Noise Conditional Score Network (NCSN) $s_\theta(x,\sigma)$, which learns the Gaussian-perturbed data distribution $p_{\sigma_i}(x) = \mathbb{E}_{x’ \sim p(x’)}\left[p_{\sigma_i}(x|x’)\right] = \int p(x’) \mathcal{N}(x; x’, \sigma_i^2 I) \mathrm{d} x’ $ with various levels of noise $\sigma_i$. $\{\sigma_i\}^T_{i=1}$ is a positive geometric sequence that satisfies $\frac{\sigma_1}{\sigma_2}=\frac{\sigma_i}{\sigma_{i+1}}=\frac{\sigma_{T-1}}{\sigma_T}>1$. Since the distributions $\{p_{\sigma_i}\}^T_{i=1}$ are all perturbed by Gaussian noise, their supports span the whole space and their scores are well-defined, avoiding difficulties from the manifold hypothesis. To generate samples with annealed Langevin dynamics, we start with the largest noise level $\sigma_0$ and anneal the noise level to $\sigma_T\approx 0$:

$$ \begin{align*} x_{i+1} = x_i + {\alpha_{i+1}} s_\theta(x_i,\sigma_{i+1}) + \sqrt{2\alpha_{i+1}} z_i \label{ALD}\tag{8} \end{align*} $$

where $\alpha_{i}=\frac{\gamma}{2} \frac{\sigma_i^2}{\sigma_{i+1}^2}$ is the step-size. The only difference between the above eq(\ref{ALD}) and eq(\ref{Des_LD2}) is that the drift term $s_\theta(x_i,\sigma_{i+1}) \approx -\nabla U_i(x_i)$ is no longer fixed and will change with time.

The sampling algorithm of NCSN, which includes a double loop, can be located in ¹⁰. The outer loop is responsible for determining the noise levels, while the inner loop takes $T$ steps to guarantee that the samples are from the distribution $p_{\sigma_i}$.

To learn the score function of the perturbed data distribution, we can use the objective of denoising score matching⁹:

$$ \begin{align*} \arg\min_\theta \mathbb{E}_{x_i\sim p_{\sigma_i}(x_i)} \left[||s_\theta (x_i)-\nabla_{x_i} \log p_{\sigma_i}(x_i)||_{2}^{2} \right] = \arg\min_\theta \mathbb{E}_{x,{x}_i \sim p_{\mathrm{data}}(x)p_{\sigma_i}({x}_i|{x})}\left[||s_\theta (x_i)-\nabla_{x_i} \log p_{\sigma_i}(x_i|x)||_{2}^{2}\right], \label{dsm}\tag{9} \end{align*} $$

From score matching to denoising score matching, the difference is a constant term independent of $\theta$¹¹:

$$ \mathbb{E}_{x_i\sim p_{\sigma_i}(x_i)}\left[{\mathbb{E}_{x\sim p_{\sigma_i}(x|x_i)}\left[\left||\nabla_{x_i}\log p_{\sigma_i}(x_i|x)\right||^2\right] - \left||\nabla_{x_i}\log p_{\sigma_i}(x_i)\right||^2}\right] $$

By matching the score, we are actually minimize the KL divergence between the parameterized and the target (data) probability distribution: $D_\mathrm{KL}(p_{data}(x)||p_{\theta}(x))$.

With the learnt scores $s_\theta(x,\sigma_i) \approx \nabla_x \log p_{\sigma_i}(x)$, we can generate samples according to eq(\ref{ALD}). When $\sigma_i$ is large, modes in $p(x)$ are smoothed out by the Gaussian kernel and $s_\theta(x,\sigma_i)$ points to the mean of the modes; as $\sigma_i$ annealed down, the dynamic will be attracted to the actual modes of the target distribution¹².

In the next section, we will explore another family of sampling algorithm: reverse SDEs.

Score-based Stochastic Differential Equations

The continuous SDE form of eq(\ref{ALD}) can be written as:

$$ \begin{align*} d{X_t}=\underbrace{-\nabla U_t(X_t)\mathrm{d}t}_{\text{drift term}} + \underbrace{\sqrt {2}{B_t}\mathrm{d}t}_{\text{diffusion term}}, \quad X_0\sim p_0 \label{ALD2}\tag{10} \end{align*} $$

where $U_t\propto -\log p_t$ now also depends on time $t$ comparing to $U\propto -\log p_\infty$ in eq(\ref{LD})¹³.

In another wonderful work by Yang Song et al.¹⁴, the authors construct forward diffusion process, which maps the target distribution to a (usually) simple distribution, in a more general form of SDEs:

$$ \begin{align*} d{x_t}={f_t(x_t)\mathrm{d}t} + {g_t{B_t}\mathrm{d}t}, \label{SDE}\tag{11} \end{align*} $$

Each pair of $f_t$ and $g_t$ define the unique forward process and the corresponding marginal $p_t$ given the initial distribution.

From now on, we will swap the notation of time $t$, such that for $t=0$ we have target distribution $p_0 = \pi$ and for $t=T$ we have simple distribution $p_T$.

Furthermore, we can initiate with samples from the simple distribution $p_T$, and generate samples from target distribution $p_0$ following the dynamics of the corresponding reverse SDE:

$$ \begin{align*} \mathrm{d}{x_t} &= \left[f_t(x_t) - g_t^2 \nabla_{x_t} \log p_t({x_t})\right]\mathrm{d}t + g_t {\bar{B}_t}\mathrm{d}t, \quad x_0\sim p_0 % \\ x_t&=x_{t+1}-f_{t+1}(x_{t+1})+g_{t+1}g^T_{t+1}\nabla_{x_t} \log p_t({x_t})+g_{t+1}z_{t+1} \label{reverseSDE}\tag{12} \end{align*} $$

This SDE is only meant for time flows backwards from $T$ to 0, and $dt$ is an infinitesimal negative timestep. Now we can sample from the target distribution use this reverse SDE as long as we have access to $s_\theta \approx \nabla \log p_t$.

Note that eq(\ref{ALD2}) and eq(\ref{reverseSDE}) are two different sampling strategies, and we can apply both for sampling¹⁴ (aka. corrector and predictor): we use the corrector to ensure $x_t \sim p_t$ and use the predictor to jump to $p_{t-1}$.

Still, for $f_t=\nabla_{x} \log p_t({x})$ and $g_t=\sqrt{2}$, we get eq(\ref{ALD2}) as a special case of eq(\ref{reverseSDE}) representing backward sampling (note that eq(\ref{reverseSDE}) is reversed in time). But now the forward diffusion does not lead to the simple Gaussian distribution that we can sample from easily. In other words, we can’t enjoy the fast sampling trajectory from Gaussian to target distribution anymore in this scenario.

For score-based models (or diffusion models) with forward SDE in the form of eq(\ref{SDE}), we can write the corresponding FP equation in the form of the continuity equation¹⁵:

$$ \begin{align*} \partial_t p_t = -\nabla\cdot(f_t p_t) + \frac{g_t^2}{2}\Delta p_t = -\nabla\cdot((f_t - \frac{g_t^2}{2} \nabla\log p_t) p_t) \label{FP2}\tag{13} \end{align*} $$

where we define the vector field as $w_t = f_t - \frac{g_t^2}{2} \nabla\log p_t$, $g_t$ is somehow related to the temperature of the stationary distribution, and $p_\infty$ is the initial distribution (not necessarily Gaussian).

The corresponding Ordinary Differential Equations (ODE) of a particle moving along this vector field is the so-called probability flow ODE (PFODE)¹⁴:

$$ \begin{align*} \mathrm{d} x = \bigg[f_t(x) - \frac{1}{2}g_t^2 \nabla_{x} \log p_t({x})\bigg] \mathrm{d}t \quad \Longleftrightarrow \quad \frac{\mathrm{d}x}{\mathrm{d}t} = v_t \label{prob_ode} \tag{14} \end{align*} $$

And we can use this probability flow ODE to travel either forward or backward in time.

Given $f_t$ and $g_t$ which define the forward diffusion process, we can travel backward in time with learned score function $s_\theta \approx \nabla \log p_t$ or the learned vector field $v_\theta \approx f_t - \frac{g_t^2}{2} \nabla\log p_t$.

It’s worthy noting that, in order to sample from a target distribution according to the reverse SDEs (\ref{reverseSDE}), the score functions with different $t$ are explicitly required (no matter the form of $f_t$ and $g_t$) rather than only the score function of the target distribution. Moreover, my understanding to the success of diffusion models is that, diffusion models explicitly define paths in both the data space and the probability measure space: given the initial state $x_0$, for each time-step $t$ on the paths, we know where we are ($p_t$) and where we are aiming for ($v_t$).

Bayesian Posterior Sampling (2) MCMC Basics

2023-06-26T00:00:00+02:00

Introduction

In the previous post, I showed that we can use MCMC methods to draw samples from a target distribution $\pi$. In this post, I will present the most classical MCMC method, Metropolis-Hastings (MH) and its variants. Prior to MCMC methods, I want to first get the readers familiar with another Monte Carlo approach named acceptance-rejection method, which is also able to generate samples from a (unnormalized) distribution $\tilde\pi$ and whose shortcomings motivate us to use Markov chain.

Acceptance-Rejection Method

To sample directly from the target distribution $\pi(x)$ is unfeasible. The technique named acceptance-rejection sampling, also known as rejection sampling, provides a solution to this issue. The fundamental concept here is to propose a different probability distribution, $G$, with a density function $g(x)$, which not only have an efficient sampling algorithm¹ readily available, but should also span (at least) the same domain as $\tilde\pi(x)$ (Otherwise, there would be parts of the curved area we want to sample from $\pi(x)$ that could never be reached.). Consequently, it’s seem natural now to construct a decision rule² to determine whether to accept samples from the proposed distribution $g$, of course based on our knowledge of $\tilde\pi(x)$. This decision rule should ensure that the long-term fraction of time spent in each state is proportional to $\tilde\pi(x)$.

Let first have a look at the algorithm, which is quite simple by repeating the following two steps until enough samples are accepted:

sample $x$ from proposal $g$.

accept with probability $p_{\mathrm{accept}}(x) = \frac{\tilde\pi(x)}{Mg(x)}$.

where $M$ is a sufficient large constant to ensure $\left[\frac{\tilde\pi(x)}{Mg(x)}\right]$ can be interpreted as probability, that is $0 < p_{\mathrm{accept}} < 1$ for all values of $x$. This also require $g(x)>0$ whenever $\tilde\pi(x)>0$ as mentioned above.

Let’s now establish that the accepted samples indeed come from the distribution $\pi$ (asymptotically correct). Suppose that the act of accepting a sample is denoted as $A$. Therefore, for any given sample $s$, we can state that:

$$ \begin{align*} p(s|A) &= \frac{p(A|s)p(s)}{p(A)} = \frac{p_{\mathrm{accept}}(s)g(s)}{\int p_{\mathrm{accept}}(x)g(x) dx} = \frac{\tilde\pi(s)/M}{\int \tilde\pi(x)/M dx} = \frac{\tilde\pi(s)}{\int \tilde\pi(x) dx} = \pi(s) \label{AR}\tag{1} \end{align*} $$

The denominator $p(A)$ is the unconditional acceptance probability and is equal $\frac{Z}{M}$ (For a more thorough explanation, please refer to the detailed derivation available at this wiki):

$$ \begin{align*} p(A) = \int p(A|x)p(x)dx = \int \frac{\tilde\pi(x)}{M} dx = \frac{Z}{M} \label{denominator}\tag{2} \end{align*} $$

The unconditional acceptance probability $p(A)$ tell us, we need on average $\frac{M}{Z}$ draws to get an accepted sample, which can be very inefficient when $M$ is very large. And especially when we have random variable $x$ in high dimension, rejection sampling suffers from the “curse of dimensionality”³.

Moreover, in acceptance-rejection method, each draw is independent and information from the last draw is discarded completely (i.e, we may want to explore its neighbor if the sample from last draw get accepted). It’s natural to consider adding some correlation between the closest samples, and this is when Markov chain (or the Markov transition kernel $p(x’|x)$) is introduced.

Recap of Markov Chain

A (discrete-time) Markov chain is a sequence of random variables $X_1, X_2, X_3, …$ with the Markov property:

$$ \begin{align*} p(X_{t}=x_{t}| X_{t-1}=x_{t-1},\dots ,X_{0}=x_{0})=p(X_{t}=x_{t}| X_{t-1}=x_{t-1}) \label{Markov_property}\tag{3} \end{align*} $$

and a Markov process is uniquely defined by its transition probabilities $p(x’| x)$.

In our case, we always want the Markov chain to have a unique stationary distribution that is indeed $\pi$.

Markov Chain Properties

Prior $p(X_1)$ and transition probabilities $p(X_{t+1} | X_{t} )=p(x’| x)$ independent of $t$.
Ergodic Markov Chains: there exists a finite $t$ such that every state can be reached from every state in exactly $t$ steps.
Stationary Distributions: $\lim\limits_{T \rightarrow \infty} p\left(X_{T}=x\right)=\pi(x)$, which is independent of $p(X_1)$.

Detailed Balance Equation

A chain satisfies detailed balance⁴ if we have:

$$ \begin{align*} \pi(\mathbf{x}) p\left(\mathbf{x}^{\prime} | \mathbf{x}\right)=\pi\left(\mathbf{x}^{\prime}\right) p\left(\mathbf{x} | \mathbf{x}^{\prime}\right) \label{Detailed_Balance}\tag{4} \end{align*} $$

This means in the in-flow to state $x’$ from $x$ (the probability of being in state $x$ times the transitional probability from $x$ to $x’$) is equal to the out-flow from state $x’$ back to $x$, and vice versa (Or each elementary process is in equilibrium with its reverse process at equilibrium / stationary). If a chain with transitional probability $p\left(\mathbf{x}^{\prime} | \mathbf{x}\right)$ satisfies the above detailed balance, then $\pi(\mathbf{x})$ is its stationary distribution. And it worth to mention that detailed balance is in general sufficient but not necessary condition for stationarity.

Metropolis-Hastings Methods

In the previous section, we learnt that the key of rejection sampling is proposal and decision rule. These principles remain applicable in MH techniques. In MH methods, instead generate samples independently from proposal $g$, the proposal (with Markov property) $g(x_{t+1}|x_t)$ generate new candidate based on the current sample value (The samples are correlated.). And we still need the decision rule to ensure $\pi$ is the stationary distribution.

Again, let’s have a look at the algorithm first, which is similar to rejection sampling. For each current state $x_t$:

sample $x’$ from proposal $g(x’|x=x_t)$.

accept $x_{t+1}=x’$ with probability $p_{\mathrm{accept}}(x’) := p_{\mathrm{accept}}(x’|x)$, set $x_{t+1}=x_{t}$ with probability $1-p_{\mathrm{accept}}(x’)$.

where the probability $p_{\mathrm{accept}}(x’)$ is defined as:

$$ \begin{align*} p_{\mathrm{accept}}(x') = \min (1,\alpha) =\min \left(1, \frac{\tilde\pi\left(x^{\prime}\right) g\left(x | x^{\prime}\right)}{\tilde\pi(x) g\left(x^{\prime} | x\right)}\right) \label{accept_probability}\tag{5} \end{align*} $$

The transition probability of the Markov chain defined in the MH algorithm can be written as:

$$ \begin{align*} p(x'|x)=\begin{cases} g(x'|x)p_{\mathrm{accept}}(x'|x), & \text{if } x'\neq x \\ g(x|x)+\sum_{x'\neq x}g(x'|x)(1-p_{\mathrm{accept}}(x'|x)), & \text{otherwise} \end{cases} \label{transition_kernel}\tag{6} \end{align*} $$

We need to ascertain that our proposal along with the decision rule satisfies the detailed balance equation (\ref{Detailed_Balance}). To accomplish this, we can verify (left as exercise) that the left and right sides of the equation are equal using the target distribution $\pi$ and the transition probability $p(x’|x)$, which is defined in eq(\ref{transition_kernel}).

Rejection Sampling as Special Case

When the proposal $g(x’|x)$ is independent of previous state $x_t$ — in other words, $g(x’|x) = g(x’)$ is an independence sampler — we end up with the method rejection sampling introduced in the beginning.

⁵This should be indeed called Random Walk Metropolis, as the Hastings correction will be introduced when the $g(x'|x)$ is asymmetric.

⁶You can test your intuition of high-dimension in lecture note 1, and check the argument why HM is inefficient in high dimension in note 4 from this course. Also have a look at this blog by Stanislav Fort to get some intuition about cubes and balls.

⁷This blog by Richard McElreath present a demo on why MH fails in high dimension and motivate the usage of Hamiltonian Monte Carlo: where a vector field aligned with the typical set counteract the attraction of the mode.

⁸This blog and this paper by Michael Betancourt studies counterintuitive behaviors of high-dimensional spaces which can explain why Metropolis does not scale well enough to high dimensions.

⁹You can find a 3D version illustration in this blog by Alex Rogozhnikov.

Random Walk Metropolis-Hastings⁵

When the proposal $g(x’|x)$ is Gaussian, we get random walk Metropolis-Hastings (RWM), where the new state is a random walk based on the previous state. This proposal can be written as following:

$$ \begin{align*} g(x'|x) = \mathcal{N}(x';x,\tau^2\mathbf{I}) \label{Random_Walk}\tag{7} \end{align*} $$

where $\tau$ is a scale factor chosen to facilitate rapid mixing, often referred to as the random walk step size. This step size plays a crucial role in balancing the trade-off between exploration, which is essential for covering the distribution, and maintaining an acceptable acceptance rate.

RR01b prove that, if the (target) posterior is Gaussian, the asymptotically optimal value is to use $\tau^2 = 2.382/D$, where $D$ is the dimensionality of $x$; this results in an acceptance rate of 0.234, which (in this case) is the optimal tradeoff …

—— Probabilistic Machine Learning: Advanced Topics by Kevin Patrick Murphy. MIT Press, 2023.

Since the proposal $g(x’|x)$ is now symmetric with $g(x’|x)=g(x|x’)$, we have simplified $p_{\mathrm{accept}}(x’) = \min \left(1, \frac{\tilde\pi\left(x^{\prime}\right)}{\tilde\pi(x)}\right)$. This simplified acceptance probability is more easily understood: if $x’$ is more probable than $x$, we unquestionably transition there, otherwise, we may still move there anyway by change, depending on the relative probabilities.

However, in high-dimensional space⁶, this classical MH algorithm is not applicable as it will end up without moving at all for a long time⁷^,⁸. This is because due to the concentration of measure, most of the probability mass are concentrating around the typical set which is a narrow crust shape area away from the (high density) mode; almost all proposed jumps will be outside — outside the crust and away from the mode because the ratio between volume of n-ball and n-cube tends to 0⁶ (thus there is almost 0 probability of falling inside the ball from the sphere) — the typical set and thus will be rejected for mode-seeking algorithms such as MH.

Figure 1: Visual illustration of random walk Metropolis-Hastings in 2D space⁹.

Metropolis-Adjusted Langevin Algorithm (MALA)

The well-known Metropolis-Adjusted Langevin Algorithm, which we will revisit in the next post, can be considered as MH with improved proposals:

$$ \begin{align*} g(x'|x) = \mathcal{N}(x';x-\tau \nabla \tilde{\pi}(x),2\tau I) \label{MALA}\tag{8} \end{align*} $$

where the additional term $\tau \nabla \tilde{\pi}(x)$ drives the samples toward areas with high probability density, which makes the proposed samples more likely to be accepted.

Remarks

In this post we mainly discussed the basic idea of MCMC method, especially the propose-accept scheme in Metropolis-Hastings algorithm. In the next posts, we will explore a more intuitive family of methods, the dynamic-based MCMC.

Bayesian Posterior Sampling (1) Introduction

2023-05-25T00:00:00+02:00

The last post was written around one year ago, when I decided to switch my semester project topic from style transfer with normalizing flow to image restoration with diffusion models.

In my current engineering-oriented master’s thesis, I find myself longing for the elegance of the theory of diffusion models. As a solution, I have made the decision to dedicate my spare time to learning Bayesian sampling (sampling method for bayesian inference).

Hence I will write a series of posts to record this learning process and to improve my understanding of this topic by reorganizing my knowledge. In this post, I will start with some basics and a brief introduction to this topic.

Introduction

Bayesian Inference Problem & Challenging

Figure 1: Visual Representation of Bayes' Theorem.

In the probabilistic approach to machine learning, all unknown quantities — be they predictions about the future, hidden states of a system, or parameters of a model — are treated as random variables, and endowed with probability distributions. The process of inference corresponds to computing the posterior distribution over these quantities, conditioning on whatever data is available.

—— Probabilistic Machine Learning: Advanced Topics by Kevin Patrick Murphy. MIT Press, 2023.

To be specific, we assume the dependence between unknown random latent variable $z$ and the available data $x$ is probabilistic and what we want to do is to estimate $z$ given $x$ with $p(z| x)$, which we call posterior. Most of the time we only have some prior knowledge about $z$ as $p(z)$ and the likelihood model $p(x|z)$, or equivalently the deep latent variable model $p_\theta(x, z)=p_\theta(z)p_\theta(x|z)$ through Bayesian Modeling, where $\theta$ can be estimated with maximizing likelihood (Maximizing the expected log-likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between $p_{\mathrm{data}}(\mathbf{x})$ and $p_{\theta}(\mathbf{x})$):

$$ \begin{align} -\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\left[\log p_\theta(\mathbf{x})\right] &= D_{K L}(p_{\mathrm{data}}(\mathbf{x})\lVert p_{\theta}(\mathbf{x}))\underbrace{-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})} \left[ \log p_{\mathrm{data}}(\mathbf{x})\right]}_{\text{constant}} \\ &\approx \prod_{n=1}^{N}\log p_\theta(x_{n}) = \prod_{n=1}^{N}\log [\int p_\theta(z_{n})p_\theta(x_{n}|z_{n})dz_{n}] \label{likelihood}\tag{1} \\ \end{align} $$

where the constant term is the entropy of the data distribution. While the entropy of the empirical data distribution goes to infinity, it is independent of $\theta$ and can be ignored in optimization.

We can compute the posterior $p_{\theta}(z| x)$ using Bayes’s rule (see Figure 1 for visual illustration):

$$ \begin{align} p_{\theta}(z|x)&=\frac{p_{\theta}(x|z)p_{\theta}(z)}{p_{\theta}(x)}\label{bayes}\tag{2} \\ \end{align} $$

where the evidence term (also called marginal likelihood) served as a normalization constant in the denominator can be formulated as:

$$ \begin{align} p_{\theta}(x)=\int p_{\theta}(x,z)dz=\int p_{\theta}(x|z)p_{\theta}(z)dz\label{evidence}\tag{3} \\ \end{align} $$

Beyond point estimation (MLE, MAP), we can use the posterior distribution to get posterior expectations of any function $f(z)$, such as mean and marginals. For instance, predicting new output in Bayesian linear regression where $w$ represents the coefficient that we aim to estimate its posterior distribution given the data: $y^\star =\int p(y^\star|x^\star, w)p(w | X,y)dw$.

However, this integral is usually analytically intractable¹ to calculate or evaluate, which leads to intractable posterior, and most Bayesian inference requires numerical approximation of such intractable integrals.

Two Approaches: Variational Inference & (MCMC) Sampling

In this post, I go through the two primary methodologies utilized to address the problem of Bayesian inference: Variational Inference (VI) and Markov Chain Monte Carlo (MCMC). I will strive to cover the most significant concepts associated with VI, and also provide a brief introduction to MCMC.

Variational Inference

Variational inference (VI) is a method in machine learning that approximates complex probability distributions by finding the most similar, simpler and hence tractable distribution $q(z)$ from a specified family $\mathcal{Q}$, thereby enabling efficient computation and handling of uncertainty.

Most of the time when referring to Variational Inference (VI), we are discussing parametric VI. In parametric VI, we use a parameter $\phi$ to represent the variational distribution $q_\phi(z|{x})$. There is another type of VI called particle-based VI, which utilizes a set of particles $\{z^{(i)}\}_{i=1}^{N}$ to represent the variational distribution $q(z|{x})$.

Figure 2: Illustration of Variational Inference.

The main idea of variational methods is to cast inference as an optimization problem. The goal of VI is to approximate an intractable probability distribution, so as to find $q_{\phi} \in \mathcal{Q}$ that minimize some discrepancy $D$ (here we use the KL divergence) between $q_{\phi}({z}|{x})$ and $p_{\theta}({z}|{x})$:

$$ \begin{align} q^{\star}_{\phi} = \underset{q_{\phi}\in \mathcal{Q}}{\operatorname{\arg\min }} D_{K L}(q_{\phi}({z}|{x})\lVert p_{\theta}({z}|{x}))\label{KL}\tag{4} \\ \end{align} $$

The challenge here is that we still don’t know the true posterior $p_{\theta}({z}|{x})$, and the KL divergence is intractable to compute. Luckily, we can rewrite the KL divergence in a way that makes it easier to optimize:²:

$$ \begin{align*} D_{K L}(q_{\phi}({z}|{x})\lVert p_{\theta}({z}|{x}))&=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log\left[\frac{q_{\phi}({z}|{x})}{p_{\theta}({z}|{x})}\right]\right] \\ &=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log\left[\frac{q_{\phi}({z}|{x})p_{\theta}({x})}{p_{\theta}({x},{z})}\right]\right] \\ &=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log\left[\frac{q_{\phi}({z}|{x})}{p_{\theta}({x},{z})}\right]\right] + \mathbb{E}_{q_{\phi}({z}|{x})}\left[\log p_{\theta}({x})\right]\\ &=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log\left[\frac{q_{\phi}({z}|{x})}{p_{\theta}({x},{z})}\right]\right] + \log p_{\theta}({x})\label{KL_derivation}\tag{5}\\ \end{align*} $$

where the log evidence $\log p_\theta({x})$ does not change with the choice of $\phi$ during variational inference.

For convention, we can rewrite this as:

$$ \begin{align*} \log p_\theta({x})=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log\left[\frac{p_{\theta}({x},{z})}{q_{\phi}({z}|{x})}\right]\right]+\underbrace{D_{K L}(q_{\phi}({z}|{x})\lVert p_{\theta}({z}|{x}))}_{\geq 0} \label{evidence2}\tag{6} \end{align*} $$

Since the KL divergence between $q_{\phi}({z}|{x})$ and $p_{\theta}({z}|{x})$ is non-negative, the first term in the RHS of eq(\ref{evidence2}) is a lower bound of the log evidence term $\log p_\theta({x})$, which is named variational lower bound, also called the Evidence Lower BOund (ELBO):

$$ \begin{align*} \mathcal{L}_{\theta,\phi}({x})=\mathbb{E}_{q_{\phi}({z}|{x})}\left[\log p_{\theta}({x},{z})-\log q_{\phi}({z}|{x})\right]\label{ELBO}\tag{7} \end{align*} $$

Therefore, the common mission to find the optimal $q_{\phi}({z}|{x})$ that minimizes the KL divergence (approximates $p_{\theta}({z}|{x})$) is equivalent to maximize the ELBO (without worrying about the evidence term in $p_{\theta}({z}|{x})$), and we can optimize it w.r.t. both ${\phi}$ and ${\theta}$ (when $\theta$ is unknown) in algorithms such as variational EM.

As $\theta$ is the model parameters, optimizing $\theta$ is essential for learning the underlying model that generates the data. When $\theta$ is tunable, we can jointly optimize $\phi$ and $\theta$ to maximize the ELBO. This joint optimization can help in finding a more accurate posterior approximation ($\phi$) and model parameters that better explain the data ($\theta$ that gives tighter ELBO of likelihood).

²We can also get the same ELMO starting from the KL divergence between joint distributions $D_{KL}(q_{\phi}({x},{z})\lVert p_{\theta}({x},{z}))$, see this blog by Jianlin Su and this blog by Alex Alemi.

³It's recommended to read this blog by Casey Chu and this famous paper ELBO surgery by Matthew D. Hoffman and Matthew J. Johnson for more perspectives on the ELBO.

⁴For VAE, there is a great introduction by D.P. Kingma and Max Welling.

⁵For a comprehensive understanding of advanced concepts in VAE such as the $\beta$-VAE, I highly recommend reading Lilian Weng's blog.

⁶This great blog by Alex Alemi also derives the diffusion loss through variational perspective for those who are interested in diffusion models.

We can rewrite the ELBO as follows³:

$$ \begin{align*} \mathcal{L}_{\theta,\phi}({x}) &= \underbrace{\mathbb{E}_{q_{\phi}({z}|{x})}[\log p_{\theta}({x},{z})]}_{\text{expected log joint}}+\underbrace{\mathbb{H}(q_{\phi}({z}|{x}))}_{\text{entropy}}\label{ELBO2}\tag{8} \end{align*} $$

Additionally, the ELBO can be reorganized and interpreted as the following in Variational AutoEncoder (VAE)⁴^,⁵, where $\phi$ and $\theta$ represent the encoder and decoder, respectively:

$$ \begin{align*} \mathcal{L}_{\theta,\phi}({x}) &= -[\underbrace{\mathbb{E}_{q_{\phi}({z}|{x})}[-\log p_{\theta}({x}|{z})]}_{\text{expected negative log likelihood}}+\underbrace{D_{K L}(q_{\phi}({z}|{x})\lVert p_{\theta}({z}))}_{\text{KL from posterior to prior}}] \\ &=-\mathbb{E}_{q_{\phi}({z}|{x})}[\underbrace{-\log p_{\theta}({x}|{z})}_{\text{reconstruction error}}+\underbrace{\log q_{\phi}({z}|{x})-\log p_{\theta}({z})}_{\text{regularization (align) terms}}] \label{ELBO3}\tag{9}\\ \end{align*} $$

Suppose $p_{\theta}({x}|{z})=\mathcal{N}(\mu_\theta(z),\sigma^2)$, and $q_{\phi}({z}|{x})$ is a deterministic mapping $\psi_{\phi}(x)$. The first term is a reconstruction error, proportional to $|| x−\mu_\theta(\psi_{\phi}(x))||^2$. And the training objective on given dataset is hence⁶:

$$ \begin{align*} \mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}[\mathcal{L}_{\theta,\phi}({x})] = -\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\mathbb{E}_{\mathbf{z}\sim q_{\phi}({z}|{x})}[-\log p_{\theta}({x}|{z})+\log q_{\phi}({z}|{x})-\log p_{\theta}({z})] \label{objective}\tag{10} \end{align*} $$

The optimization of the above equation (\ref{objective}) usually involve taking gradient w.r.t. $\phi$, which is more difficult as we cannot swap the gradient and the expectation like when taking gradient w.r.t. $\theta$. To resolve this issue, we can use methods like score function estimator and the reparametrization trick.

Markov Chain Monte Carlo

Unlike VI which solves inference with optimization, MCMC tackles it via sampling techniques. More specifically, MCMC applies Monte Carlo methods to generate a sufficient number of samples for an accurate estimation of the posterior distribution. However, it is almost always impossible to directly do so. As a solution, we can use MCMC, which is aimed at simulating a Markov chain whose stationary distribution is $p_{\theta}({z}|{x})$ and hope a fast convergence. And guess what, we only need unnormalized probability density (e.g. $p(x,z)$) to simulate the chain!

Optimization: find the minimum $min_{x\in \mathbb{R}^d} U(x)$

Sampling: draw samples from the density $\pi(x)\propto e^{-U(x)}$

A more general problem setting is: sampling (=generating new examples) from a target distribution $\pi$ over $\mathbb{R}^d$ whose density is known up to an intractable normalization constant $Z$⁷:

$$ \begin{align*} \pi(x) &= \frac{1}{Z}\tilde{\pi}= \frac{\exp(-\beta U(x))}{Z} \label{target_dist}\tag{11}\\ \end{align*} $$

where $\tilde{\pi}$ is the known unnormalized distribution, $\beta$ is an arbitrary positive constant akin to an inverse temperature, and $U(\cdot)$ can be treated as energy function. To make the notation consistent, now we rewrite the problem (\ref{KL}) as:

$$ \begin{align*} \pi^\star = \underset{\mu\in \mathcal{P}_2(\mathbb{R}^d)}{\operatorname{\arg\min }} D(\mu \lVert \pi) := \mathcal{F}_\pi(\mu)\label{KL_MCMC}\tag{12} \\ \end{align*} $$

where $D$ is a dissimilarity functional such as KL divergence, and $\mathcal{F}_{\pi}(\mu)$ is a shorthand of $D(\mu \lVert \pi)$. We can approximate integrals $\int f(\cdot) d\pi$ of any function $f(\cdot)$ with samples from the Markov chain as $\frac{1}{n}\Sigma_{i=b}^{b+n-1}f(x_i)$, where $b,n$ are sufficiently large integers, and $b$ is called the mixing time or burn-in time. Note that the initial samples from the chain should be discarded because they do not come from the stationary distribution; reducing this is one of the most important factors in securing a fast convergence.

The ultimate goal of this series of posts is exactly to learn various MCMC techniques to sample from $\pi^\star$!
It’s also highly recommended to try out this great website for fantastic MCMC animations first.

The future topics should include:

Yuanzhi Zhu

The Distillation in On-Policy Distillation

TL;DR

The Distillation in On-Policy Distillation

The Illustration of the On-Policy Generator and the Objective

Diffusion (Step) Distillation

Diffusion Reinforcement Learning

Summary

Acknowledgements

Thoughts on RL and ICL

A Short Discussion on RL and ICL.

Self-Distillation Fine-Tuning (SDFT) and Self-Distillation Policy Optimization (SDPO)

Everything is about Distillation

Acknowledgements

A KL-Regularized Reward-Tilting View of DiffusionNFT

Offline DiffusionNFT[1] as KL-Regularized Reward Tilting

Setup (From DiffusionNFT)

Derivation of $\alpha(x_t,c)=p(o=1\mid x_t,c)$

Remark

DiffusionNFT optimal distribution at each step

Closed-form optimal distribution (density-level)

Online DiffusionNFT Leads to Reward Hacking via Accumulated Tilting

Idealized recursion with exact per-epoch optima

Remarks on EMA references

Online DiffusionNFT with an Additional KL to the Initial Reference

Acknowledgements

References

New Year Resolution 2026

On the Connection between DMD and GAN for Diffusion Distillation

Preliminaries

From Scores to a Discriminator

Step 1: Substitute the density-ratio gradient identity

Step 2: Express $\nabla_{x_t} r_t$ and $r_{t}$ in terms of $D_t$

Step 3: Convert $\nabla_{x_t} D_t$ to $\nabla_\theta D_t$

Verify with Special Case: reverse KL Divergence (DMD)

References

A Second-order MeanFlow

Second-order MeanFlow: Combining Forward and Backward Distillation

1. Setup: Notation and Goals

2. A Intuitive Derivation

2. Generalized 2-order Loss

References

New Year Resolution 2025

Bayesian Posterior Sampling (3) Langevin Dynamics and Diffusion Models

Introduction

Langevin Dynamics Sampling

Langevin Diffusion

Fokker-Plank Equation

Unadjusted Langevin Algorithm

Metropolis-Adjusted Langevin Algorithm

Stochastic Gradient Langevin Dynamics

From Langevin Dynamics to Diffusion Models

Pitfalls with Unadjusted Langevin Dynamics

Annealed Langevin Dynamics

Score-based Stochastic Differential Equations

Bayesian Posterior Sampling (2) MCMC Basics

Introduction

Acceptance-Rejection Method

Recap of Markov Chain

Metropolis-Hastings Methods

Rejection Sampling as Special Case

Random Walk Metropolis-Hastings5

Metropolis-Adjusted Langevin Algorithm (MALA)

Remarks

Bayesian Posterior Sampling (1) Introduction

Introduction

Bayesian Inference Problem & Challenging

Two Approaches: Variational Inference & (MCMC) Sampling

Variational Inference

Markov Chain Monte Carlo

Random Walk Metropolis-Hastings⁵