diff --git a/_posts/2025-08-18-diff-distill.md b/_posts/2025-08-18-diff-distill.md
new file mode 100644
index 0000000..9b46cb6
--- /dev/null
+++ b/_posts/2025-08-18-diff-distill.md
@@ -0,0 +1,386 @@
+---
+layout: distill
+title: A Unified Framework for Diffusion Distillation
+description: The explosive growth in one-step and few-step diffusion models has taken the field deep into the weeds of complex notations. In this blog, we cut through the confusion by proposing a coherent set of notations that reveal the connections among these methods.
+tags: generative-models diffusion flows
+giscus_comments: true
+date: 2025-08-21
+featured: true
+
+authors:
+ - name: Yuxiang Fu
+ url: "https://felix-yuxiang.github.io/"
+ affiliations:
+ name: UBC
+
+bibliography: 2025-08-18-diff-distill.bib
+
+# Optionally, you can add a table of contents to your post.
+# NOTES:
+# - make sure that TOC names match the actual section names
+# for hyperlinks within the post to work correctly.
+# - we may want to automate TOC generation in the future using
+# jekyll-toc plugin (https://github.com/toshimaru/jekyll-toc).
+toc:
+ - name: Introduction
+ - name: Notation at a Glance
+ - name: ODE Distillation methods
+ - subsections:
+ - name: MeanFlow
+ - name: Consistency Models
+ - name: Flow Anchor Consistency Model
+ - name: Align Your Flow
+ - name: Connections
+ - subsections:
+ - name: Shortcut Models
+ - name: ReFlow
+ - name: Inductive Moment Matching
+ - name: Closing Thoughts
+
+# Below is an example of injecting additional post-specific styles.
+# If you use this post as a template, delete this _styles block.
+# _styles: >
+# .fake-img {
+# background: #bbb;
+# border: 1px solid rgba(0, 0, 0, 0.1);
+# box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
+# margin-bottom: 12px;
+# }
+# .fake-img p {
+# font-family: monospace;
+# color: white;
+# text-align: left;
+# margin: 12px 0;
+# text-align: center;
+# font-size: 16px;
+# }
+---
+
+## Introduction
+
+Diffusion and flow-based models have taken over the generative AI space, enabling unprecedented capabilities in videos, audios, and text generation. Nonetheless, there is a caveat⚠️ --- they are painfully **slow** during inference. Generating a single high-quality sample requires running through hundreds of denoising steps, which translate to high costs and long wait times.
+
+At its core, diffusion models (equivalently, flow matching models) operate by iteratively refining noisy data into high-quality outputs through a series of denoising steps. Similar to divide-and-conquer algorithms Common ones like Mergesort, locating the median and Fast Fourier Transform., diffusion models first *divide* the difficult denoising task into subtasks and *conquer* one of these at a time during training. To obtain a sample, we make a sequence of recursive predictions which means we need to *conquer* the entire task end-to-end.
+
+This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning. In this blog, we focus on an orthogonal approach, **ODE distillation**, which minimize Number of Function Evaluations (NFEs) so that we can generate high-quality samples with as few denoising steps as possible.
+
+Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments.
+
+
+ {% include video.liquid path="blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true%}
+
+
+
+
+## Notation at a Glance
+
+
+ {% include figure.liquid loading="eager" path="blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %}
+
+
+
+From left to right:conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space.
+
+
+The modern approaches of generative modelling consist of picking some samples from a base distribution $$\mathbf{x}_1\sim p_{\text{noise}}$$, typically an isotropic Gaussian, and learning a map such that $$\mathbf{x}_0\sim p_{\text{data}}$$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $v(\mathbf{x}_t, t)$,
+
+$$
+\require{physics}
+\begin{equation}
+ \dv{\psi_t(\mathbf{x}_t)}{t}=v(\psi_t(\mathbf{x}_t), t),\quad\psi_0(\mathbf{x}_0)=\mathbf{x}_0,\quad \mathbf{x}_0\sim p_{\text{data}} \label{eq:1}
+\end{equation}
+$$
+
+where the **flow** $\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equationThis is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$ $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details. associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow.
+
+Most of the conditional probability paths are designed as the **differentiable** interpolation between noise and data for simplicity, and we can express sampling from a marginal path
+$$\mathbf{x}_t = \alpha(t)\mathbf{x}_0 + \beta(t)\mathbf{x}_1$$ where $$\alpha(t), \beta(t)$$ are predefined schedules. The stochastic interpolant paper defines this probability path that summarizes all diffusion models, with several assumptions. Here, we use a simpler interpolant for clean illustration.
+
+
+
+We provide some popular instances We ignore the diffusion models with SDE formulation like DDPM or ScoreSDE on purpose since we concentrate on ODE distillation in this blog. of these schedules in the table below.
+
+| Method | Probability Path $$p_t$$ | Vector Field $$u(\mathbf{x}_t, t\vert\mathbf{x}_0)$$ |
+|--------|---------------------------|------------------------------|
+| Gaussian |$$\mathcal{N}(\alpha(t)\mathbf{x}_0,\beta^2(t)I_d)$$ | $$\left(\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t}\alpha_t\right) \mathbf{x}_0 + \frac{\dot{\beta}_t}{\beta_t}\mathbf{x}_1$$|
+| FM | $$\mathcal{N}(t\mathbf{x}_1, (1-t+\sigma t)^2I_d)$$ | $$\frac{\mathbf{x}_1 - (1-\sigma)\mathbf{x}_t}{1-\sigma+\sigma t}$$ |
+| iCFM | $$\mathcal{N}( t\mathbf{x}_1 + (1-t)\mathbf{x}_0, \sigma^2I_d)$$ | $$\mathbf{x}_1 - \mathbf{x}_0$$ |
+| OT-CFM | Same prob. path above with $$q(z) = \pi(\mathbf{x}_0, \mathbf{x}_1)$$ | $$\mathbf{x}_1 - \mathbf{x}_0$$ |
+| VP-SI | $$\mathcal{N}( \cos(\pi t/2)\mathbf{x}_0 + \sin(\pi t/2)\mathbf{x}_1, \sigma^2I_d)$$ | $$\frac{\pi}{2}(\cos(\pi t/2)\mathbf{x}_1 - \sin(\pi t/2)\mathbf{x}_0)$$ |
+
+The simplest form of conditional probability path is $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbb{E}[\dot{\mathbf{x}}_t\vert \mathbf{x}_0]=\mathbf{x}_1- \mathbf{x}_0.$$
+
+Borrowed from this [slide](https://rectifiedflow.github.io/assets/slides/icml_07_distillation.pdf) at ICML2025, the objectives of ODE distillation have been categorized into three cases, i.e., (a) **forward loss**, (b) **backward loss** and (c) **self-consistency loss**.
+
+
+
+Training: Since minimizing the conditional Flow Matching (FM) loss is equivalent to minimize the marginal FM loss, the optimization problem becomes
+
+$$
+\arg\min_\theta\mathbb{E}_{\mathbf{x}_0, \mathbf{x}_1, t}
+\left[ w(t) \left\| v_\theta(\mathbf{x}_t, t) - v(\mathbf{x}_t, t | \mathbf{x}_0) \right\|_2^2 \right]
+$$
+
+where $$w(t)$$ is a reweighting functionThe weighting function modulates the contribution of the loss at each time step. This is necessary because the nature of the task differs fundamentally between high and low noise levels, requiring a balanced treatment of the loss across these regimes. Some common ones are included in this blog https://diffusionflow.github.io/..
+
+Sampling: Solve the ODE $$\require{physics} \dv{\mathbf{x}_t}{t}=v_\theta(\mathbf{x}_t, t)$$ from the initial condition $$\mathbf{x}_1\sim p_{\text{noise}}.$$ Typically, an Euler solver or another high-order ODE solver is employed, taking a few hundred discrete steps through iterative refinements.
+
+
+## ODE Distillation methods
+Before introducing ODE distillation methods, it is imperative to define a general continuous-time flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ where it maps any noisy input $$\mathbf{x}_t, t\in[0,1]$$ to any point $$\mathbf{x}_s, s\in[0,1]$$ on the ODE (\ref{eq:1}) that describes the probability flow aformentioned. This is a generalization of flow-based distillation and consistency models within a single unified framework. The flow map is well-defined only if its **boundary conditions** satisfy $$f_{t\to t}(\mathbf{x}_t, t, t) = \mathbf{x}_t$$ for all time steps. One popular way to meet the condition is to parameterize the model as $$ f_{t\to s}(\mathbf{x}_t, t, s)= c_{\text{skip}}(t, s)\mathbf{x}_t + c_{\text{out}}(t,s)F_{t\to s}(\mathbf{x}_t, t, s)$$ where $$c_{\text{skip}}(t, t) = 1$$ and $$c_{\text{out}}(t, t) = 0$$ for all $$t$$.
+
+At its core, ODE distillation boils down to how to strategically construct the training objective of the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ so that it can be efficiently evaluated during sampling. In addition, we need to orchestrate the schedule of $$(t,s)$$ pairs for better training dynamics.
+
+In the context of distillation, the forward direction $$s sampling, the conditional probability path is traversed twice. In our flow map formulation, this can be replaced with the flow maps $$f_{\tau_i\to 0}(\mathbf{x}_{\tau_i}, \tau_i, 0), f_{0\to \tau_{i-1}}(\mathbf{x}_0, 0, \tau_{i-1})$$ where $$0<\tau_{i-1}<\tau_i<1$$. Intuitively, the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ represents a direct mapping of some **displacement field** where $$F_{t\to s}(\mathbf{x}_t, t, s)$$ measures the increment which corresponds to a **velocity field**.
+
+### MeanFlow
+MeanFlow can be trained from scratch or distilled from a pretrained FM model. The conditional probability path is defined as the linear interpolation between noise and data $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ The main contribution consists of identifying and defining an **average velocity field** which coincides with our flow map as
+
+$$
+F_{t\to s}(\mathbf{x}_t, t, s)=u(\mathbf{x}_t, t, s) \triangleq \frac{1}{t - s} \int_s^t v(\mathbf{x}_\tau, \tau) d\tau=\dfrac{f_{t\to s}(\mathbf{x}_t, t, s)-f_{t\to t}(\mathbf{x}_t, t, t)}{s-t}
+$$
+
+where $$c_{\text{out}}(t,s)=s-t$$. This is great since it attributes actual physical meaning to our flow map. In particular, $$f_{t\to s}(\mathbf{x}_t, t, s)$$ represents the "displacement" from $$\mathbf{x}_t$$ to $$\mathbf{x}_s$$, while $$F_{t\to s}(\mathbf{x}_t, t, s)$$ is the average velocity field pointing from $$\mathbf{x}_t$$ to $$\mathbf{x}_s$$.
+
+We rearrange equation above.
+
+$$
+\begin{equation}
+ (t-s)F_{t\to s}(\mathbf{x}_t, t, s)=\int_s^t v(\mathbf{x}_\tau, \tau) d\tau \label{eq:2}
+\end{equation}
+$$
+
+Differentiating (\ref{eq:2}) both sides w.r.t. $t$ and considering the assumption that $s$ is independent of $t$, we obtain the MeanFlow identity
+
+$$
+\require{physics}
+v(\mathbf{x}_t, t)=F_{t\to s}(\mathbf{x}_t, t, s) +(t-s)\dv{F_{t\to s}(\mathbf{x}_t, t, s)}{t}
+$$
+
+where we further compute the total derivative and derive the target $$F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s)$$.
+
+Training: Adapting to our flow map notation, the training objective turns to
+
+$$
+\mathbb{E}_{\mathbf{x}_0, \mathbf{x}_1, t, s}
+\left[ w(t) \left\| F^\theta_{t\to s}(\mathbf{x}_t, t, s) - F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s | \mathbf{x}_0) \right\|_2^2 \right]
+$$
+
+
+where $$F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0)=v - (t-s)(v\partial_{\mathbf{x}_t}F^{\theta^-}_{t\to s}(\mathbf{x}_t, t, s) + \partial_t F^{\theta^-}_{t\to s}(\mathbf{x}_t, t, s))$$ and $$\theta^-$$ means `stopgrad()`. Note `stopgrad` aims to avoid high order gradient computation. There are a couple of choices for $$v$$, we can substitute it with $$F_{t\to t}(\mathbf{x}_t, t, t)$$ or $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ Again, MeanFlow adopts the latter to reduce computation.
+
+
+Full derivation of the target
+Based on the MeanFlow identity, we can compute the target as follows:
+$$
+\require{physics}
+\require{cancel}
+\begin{align*}
+F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0) &= \dv{\mathbf{x}_t}{t} - (t-s)\dv{F_{t\to s}(\mathbf{x}_t, t, s)}{t} \\
+& = \dv{\mathbf{x}_t}{t} - (t-s)\left(\nabla_{\mathbf{x}_t} F_{t\to s}(\mathbf{x}_t, t, s) \dv{\mathbf{x}_t}{t} + \partial_t F_{t\to s}(\mathbf{x}_t, t, s) + \cancel{\partial_s F_{t\to s}(\mathbf{x}_t, t, s) \dv{s}{t}}\right) \\
+& = v - (t-s)\left(v \nabla_{\mathbf{x}_t} F_{t\to s}(\mathbf{x}_t, t, s) + \partial_t F_{t\to s}(\mathbf{x}_t, t, s)\right). \\
+\end{align*}
+$$
+Note that in MeanFlow $$\dv{\mathbf{x}_t}{t} = v(\mathbf{x}_t, t\vert \mathbf{x}_0)$$ and $$\dv{s}{t}=0$$ since $s$ is independent of $t$.
+
+
+In practice, the total derivative of $$F_{t\to s}(\mathbf{x}_t, t, s)$$ and the evaluation can be done in a single function call: `f, dfdt=jvp(f_theta, (xt, s, t), (v, 0, 1))`. Despite `jvp` operation only introduces one extra backward pass, it still incurs instability and slows down training. Moreover, the `jvp` operation is currently incompatible with the latest attention architecture. SplitMeanFlow circumvents this issue by enforcing another consistency identity $$(t-s)F_{t\to s} = (t-r)F_{t\to r}+(r-s)F_{r\to s}$$ where $$s
+Loss type
+Type (b) backward loss
+
+Sampling:
+Either one-step or multi-step sampling can be performed. It is intuitive to obtain the following expression by the definition of average velocity field
+
+$$
+\mathbf{x}_s = \mathbf{x}_t - (t-s)F^\theta_{t\to s}(\mathbf{x}_t, t, s).
+$$
+
+In particular, we achieve one-step inference by setting $t=1, s=0$ and sampling from $$\mathbf{x}_1\sim p_{\text{noise}}$$.
+
+
+### Consistency Models
+
+Essentially, consistency models (CMs) are our flow map when $$s=0$$, i.e., $$f_{t\to 0}(\mathbf{x}_t, t, 0).$$
+
+**Discretized CM**
+
+CMs are trained to have consistent outputs between adjacent timesteps along the ODE (\ref{eq:1}) trajectory. They can be trained from scratch by consistency training or distilled from given diffusion or flow models via consistency distillation like MeanFlow.
+
+- Training: When expressed in our flow map notation, the objective becomes
+
+$$
+\mathbb{E}_{\mathbf{x}_t, t} \left[ w(t) d\left(f_{t \to 0}^\theta(\mathbf{x}_t, t,0), f_{t \to 0}^{\theta^-}(\mathbf{x}_{t-\Delta t}, t - \Delta t,0)\right) \right],
+$$
+
+where $$\theta^-$$ denotes $$\text{stopgrad}(\theta)$$, $$w(t)$$ is a weighting function, $$\Delta t > 0$$ is the distance between adjacent time steps, and $d(\cdot, \cdot)$ is a distance metric.Common choices include $\ell_2$ loss $d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2^2$, pseudo-Huber loss $d(\mathbf{x}, \mathbf{y}) = \sqrt{||\mathbf{x} - \mathbf{y}||_2^2 + c^2} - c$ and Learned Perceptual Image Patch Similarity (LPIPS) loss.
+
+- Sampling:
+It is natural to conduct one-step sampling with CM
+
+$$
+\hat{\mathbf{x}}_0 = f^{\theta}_{1\to 0}(\mathbf{x}_1, 1,0),
+$$
+
+while multi-step sampling is also possible since we can compute the next noisy output $$\mathbf{x}_{t-\Delta t}\sim p_{t-\Delta t}(\cdot\vert \mathbf{x}_0)$$ using the prescribed conditional probability path at our discretion. Discrete-time CMs depend heavily on the choice of $$\Delta t$$ and often require carefully designed annealing schedules. To obtain the noisy sample $$\mathbf{x}_{t-\Delta t}$$ at the previous step, one typically evolves backward $$\mathbf{x}_t$$ by numerically solving the ODE (\ref{eq:1}), which can introduce additional discretization errors.
+
+**Continuous CM**
+
+When using $$d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2^2$$ and taking the limit $\Delta t \to 0$, Song et al. show that the gradient of the discretized CM's loss with respect to $\theta$ converges to a new objective with no $$\Delta t$$ involved.
+- Training: In our notation, the objective is
+
+$$
+\require{physics}
+\mathbb{E}_{\mathbf{x}_t, t} \left[ w(t) (f^\theta_{t\to 0})^{\top}(\mathbf{x}_t, t,0) \dv{f^{\theta^-}_{t\to 0}(\mathbf{x}_t, t,0)}{t} \right]
+$$
+
+where $$ \require{physics} \dv{f^{\theta^-}_{t\to 0}(\mathbf{x}_t, t,0)}{t} = \nabla_{\mathbf{x}_t} f^{\theta^-}_{t\to 0}(\mathbf{x}_t, t,0) \dv{\mathbf{x}_t}{t} + \partial_t f^{\theta^-}_{t\to 0}(\mathbf{x}_t, t,0)$$ is the tangent of $$f^{\theta^-}_{t\to 0}$$ at $$(\mathbf{x}_t, t)$$ along the trajectory of the ODE (\ref{eq:1}). Consistency Trajectory Models (CTMs) extend this objective so that the forward loss (type (a)) becomes globally optimized. In this context, their intuition is that $$f^\theta_{t \to s}(\mathbf{x}_t, t, s)\approx f^\theta_{r \to s}(\texttt{Solver}_{t\to r}(\mathbf{x}_t, t, r), r, s).$$ The composition order on the right-hand side depends on the assumption of the solver of the teacher model.
+
+- Sampling
+
+Same as the Discretized Version. CTMs introduce a new sampling method called $$\gamma$$-sampling which controls the noise level of diffusing the intermediate noisy sample according to the conditional probability path during multi-step sampling.
+
+
+Loss type
+Type (b) backward loss, while CTMs optimize type (a) forward loss, both locally and globally.
+
+
+### Flow Anchor Consistency Model
+
+Similar to MeanFlow preliminaries, Flow Anchor Consistency Model (FACM) also adopts the linear conditional probability path $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ In our flow maps notation, FACM parameterizes the model as $$ f^\theta_{t\to s}(\mathbf{x}_t, t, 0)= \mathbf{x}_t - tF^\theta_{t\to s}(\mathbf{x}_t, t, 0) $$ where $$c_{\text{skip}}(t,s)=1$$ and $$c_{\text{out}}(t,s)=-t$$.
+
+FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero
+$$
+\require{physics}
+\dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0.
+$$
+
+By substituting the parameterization of FACM, we have
+
+$$\require{physics}
+F^\theta_{t\to 0}(\mathbf{x}_t, t, 0)=v(\mathbf{x}_t, t)-t\dv{F^\theta_{t\to 0}(\mathbf{x}_t, t, 0)}{t}.
+$$
+
+Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates CM objective directly forces the network $$F^\theta_{t\to 0}(\mathbf{x}_t, t, 0)$$ to learn the properties of an average velocity field heading towards the data distribution, thus enabling the 1-step generation shortcut.
+
+
+Training: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \, \|\mathbf{y}\|_{2}}$ and norm $\ell_2$ loss$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow. respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.
+
+
+ {% include figure.liquid loading="eager" path="blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
+
+
+
+Sampling: Same as CM.
+
+Loss type
+Type (b) backward loss
+
+
+### Align Your Flow
+
+Our notation incorporates a small modification of the flow map introduced by Align Your Flow, where we indicate the direction of the distillation. Hence, we say that Align Your Flow (AYF) the continuous-time flow map $$f^{\text{AYF}}(\mathbf{x}_t, t, s)=f_{t\to s}(\mathbf{x}_t, t, s).$$ Specifically, AYF selects a tighter set of boundary conditions $$c_{\text{skip}}(t,s)=1$$ and $$c_{\text{out}}(t,s)=s-t$$.
+
+Training:
+The first variant of the objective, called AYF-**Eulerian Map Distillation**, is compatible with both distillation and training from scratch.
+
+$$
+\nabla_\theta \mathbb{E}_{\mathbf{x}_t, t, s}\left[w(t, s)\text{sign}(t - s) \cdot (f^\theta_{t \to s})^\top(\mathbf{x}_t, t, s) \cdot \frac{\text{d}f^{\theta^-}_{t\to s}(\mathbf{x}_t, t, s)}{\text{d}t}\right]
+$$
+
+It is intriguing that this objective reduces to the [continuous CM](#consistency-models) objective when $$s=0$$, while transforming to original FM objective when $$s\to t$$The gradient of AYF-EMD matches the gradient of FM objective up to some constant when taking the limit $s\to t$.. In addition, CTMs uses a discrete consistency loss with a fixed discretized time schedule comparing to AYF-EMD objective.
+Regarding the second variant, named AYF-**Lagrangian Map Distillation**, it is only applicable to distillation from a pretrained flow model $$F^\delta_{t \to t}(\mathbf{x}_t,t,t)$$.
+
+$$
+\nabla_\theta \mathbb{E}_{\mathbf{x}_t, t, s}\left[w(t, s)\text{sign}(s - t) \cdot (f^\theta_{t \to s})^\top \cdot \left(\frac{\text{d}f^{\theta^-}_{t\to s}}{\text{d}s} - F^\delta_{s \to s}((f_{\theta^-}(\mathbf{x}_t, t, s), s,s)\right)\right].
+$$
+
+Sampling: Same as CM. A combination of $$\gamma$$-sampling and classifier-free guidance.
+
+The formulation of these objectives is majorly built on the Flow Map Matching. Similar to the trick in training [Meanflow](#meanflow) and [CMs](#consistency-models), they add a `stopgrad` operator to the loss to stabilize training and make the objective practical. In their appendix, they provide a detailed proof of why these objectives are equivalent to the objectives in Flow Map Matching.
+
+
+Loss type
+Type (b) backward loss for AYF-EMD, type (a) forward loss for AYF-LMD.
+
+
+## Connections
+Now it is time to connect the dots with some previous existing methods. Let's frame their objectives in our flow map notation and identify their loss types if possible.
+
+### Shortcut Models
+
+
+ {% include figure.liquid loading="eager" path="blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
+
+
+
+The diagram of Shortcut Models
+
+In essence, Shortcut Models augment the standard flow matching objective with a self-consistency regularization term. This additional loss component ensures that the learned vector field satisfies a midpoint consistency property: the result of a single large integration step should match the composition of two smaller steps traversing the same portion of the ODE (\ref{eq:1}) trajectory.
+
+Training: In the training objective, we neglect the input arguments and focus on the core transition between time steps. Again, we elaborate it with our flow map notation.
+
+$$
+\mathbb{E}_{\mathbf{x}_t, t, s}\left[\left\|F^\theta_{t\to t} - \dfrac{\text{d}\mathbf{x}_t}{\text{d}t}\right\|_2^2 + \left\|f^\theta_{t\to s} - f^{\theta^-}_{\frac{t+s}{2}\to s}\circ f^{\theta^-}_{t \to \frac{t+s}{2}}\right\|_2^2\right]
+$$
+
+where we adopt the same flow map conditions based on [AYF](#align-your-flow).
+
+
+Sampling: Same with MeanFlow yet with specific shortcut lengths.
+
+Loss type
+Type (c) tri-consistency loss
+
+
+### ReFlow
+
+
+ {% include figure.liquid loading="eager" path="blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
+
+
+
+The diagram of rectified flow and ReFlow process
+
+Unlike most ODE distillation methods that learn to jump from $$t\to s$$ according to our defined flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$, ReFlow takes a different approach by establishing new noise-data couplings so that the new model will generate straighter trajectories.In the rectified flow paper, the straightness of any continuously differentiable process $$Z=\{Z_t\}$$ can be measured by $$S(Z)=\int_0^1\mathbb{E}\|(Z_1-Z_0)-\dot{Z}_t\|^2 dt$$ where $S(Z)=0$ implies the trajectories are perfectly straight. In this case, this allows the ODE (\ref{eq:1}) to be solved with fewer steps and larger step sizes. To some extent, this resembles the preconditioning from OT-CFM where they intentionally sample noise and data pairs jointly from an optimal transport map $$\pi(\mathbf{x}_0, \mathbf{x}_1)$$ instead of independent marginals.
+
+Training: Pair synthesized data from the pretrained model with the noise. Use this new coupling to train a student model with the standard FM objective.
+
+Sampling: Same as FMs.
+
+### Inductive Moment Matching
+
+
+ {% include figure.liquid loading="eager" path="blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
+
+
+
+The diagram of IMM
+
+This recent method trains our flow map from scratch via matching the distributions of $$f^{\theta}_{t\to s}(\mathbf{x}_t, t, s)$$ and $$f^{\theta}_{r\to s}(\mathbf{x}_r, r, s)$$ where $$sTraining: In our flow map notation, the training objective becomes
+
+$$
+\mathbb{E}_{\mathbf{x}_t, t, s} \left[ w(t,s) \text{MMD}^2\left(f_{t \to s}(\mathbf{x}_t, t,s), f_{r \to s}(\mathbf{x}_{r}, r,s)\right) \right]
+$$
+
+where $$w(t,s)$$ is a weighting function.
+
+Sampling: Same spirit as [AYF](#align-your-flow).
+
+
+## Closing Thoughts
+
+The concept of a flow map offers a capable and unifying notation for summarizing the diverse landscape of diffusion distillation methods. Beyond these ODE distillation methods, an intriguing family of approaches pursues a more direct goal: training a one-step generator from the ground up by directly matching the data distribution from the teacher model.
+
+The core question is: how can we best leverage a pre-trained teacher model to train a student that approximates the data distribution $$p_{\text{data}}$$ in a single shot? With access to the teacher's flow, several compelling strategies emerge. It becomes possible to directly match the velocity fields, minimize the $$f$$-divergence between the student and teacher output distributions, or align their respective score functions.
+
+This leads to distinct techniques in practice. For example, adversarial distillation employs a min-max objective to align the two distributions, while other methods like [IMM](#inductive-moment-matching) rely on statistical divergences like the Maximum Mean Discrepancy (MMD).
+
+In our own work on human motion prediction, we explored this direction by using Implicit Maximum Likelihood Estimation (IMLE). IMLE is a potent, if less common, technique that aligns distributions based purely on their samples, offering a direct and elegant way to distill the teacher's knowledge without requiring an explicit density function or a discriminator.
+
+Diffusion distillation is a dynamic field brimming with potential. The journey from a hundred steps to a single step is not just a technical challenge but a gateway to real-time, efficient generative AI applications.
+
+
diff --git a/assets/bibliography/2025-08-18-diff-distill.bib b/assets/bibliography/2025-08-18-diff-distill.bib
new file mode 100644
index 0000000..37af97f
--- /dev/null
+++ b/assets/bibliography/2025-08-18-diff-distill.bib
@@ -0,0 +1,183 @@
+@misc{lipman_flow_2023,
+ title = {Flow Matching for Generative Modeling},
+ url = {http://arxiv.org/abs/2210.02747},
+ doi = {10.48550/arXiv.2210.02747},
+ abstract = {We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows ({CNFs}), allowing us to train {CNFs} at unprecedented scale. Specifically, we present the notion of Flow Matching ({FM}), a simulation-free approach for training {CNFs} based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing {FM} with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training {CNFs} with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport ({OT}) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training {CNFs} using Flow Matching on {ImageNet} leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical {ODE} solvers.},
+ number = {{arXiv}:2210.02747},
+ publisher = {{arXiv}},
+ author = {Lipman, Yaron and Chen, Ricky T. Q. and Ben-Hamu, Heli and Nickel, Maximilian and Le, Matt},
+ urldate = {2024-07-05},
+ date = {2023-02-08},
+ eprinttype = {arxiv},
+ eprint = {2210.02747 [cs, stat]}
+}
+
+@article{albergo2023stochastic,
+ title={Stochastic interpolants: A unifying framework for flows and diffusions},
+ author={Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric},
+ journal={arXiv preprint arXiv:2303.08797},
+ year={2023}
+}
+
+@article{tong2023improving,
+ title={Improving and generalizing flow-based generative models with minibatch optimal transport},
+ author={Tong, Alexander and Fatras, Kilian and Malkin, Nikolay and Huguet, Guillaume and Zhang, Yanlei and Rector-Brooks, Jarrid and Wolf, Guy and Bengio, Yoshua},
+ journal={arXiv preprint arXiv:2302.00482},
+ year={2023}
+}
+
+@article{liu2022flow,
+ title={Flow straight and fast: Learning to generate and transfer data with rectified flow},
+ author={Liu, Xingchao and Gong, Chengyue and Liu, Qiang},
+ journal={arXiv preprint arXiv:2209.03003},
+ year={2022}
+}
+
+@article{hu2021lora,
+ title={Lora: Low-rank adaptation of large language models. arXiv 2021},
+ author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
+ journal={arXiv preprint arXiv:2106.09685},
+ volume={10},
+ year={2021}
+}
+
+@article{micikevicius2017mixed,
+ title={Mixed precision training},
+ author={Micikevicius, Paulius and Narang, Sharan and Alben, Jonah and Diamos, Gregory and Elsen, Erich and Garcia, David and Ginsburg, Boris and Houston, Michael and Kuchaiev, Oleksii and Venkatesh, Ganesh and others},
+ journal={arXiv preprint arXiv:1710.03740},
+ year={2017}
+}
+
+@inproceedings{fu2025moflowonestep,
+ author = {Fu, Yuxiang and Yan, Qi and Wang, Lele and Li, Ke and Liao, Renjie},
+ title = {MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation},
+ journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ year = {2025},
+}
+
+@misc{lipman2024flowmatchingguidecode,
+ title={Flow Matching Guide and Code},
+ author={Yaron Lipman and Marton Havasi and Peter Holderrieth and Neta Shaul and Matt Le and Brian Karrer and Ricky T. Q. Chen and David Lopez-Paz and Heli Ben-Hamu and Itai Gat},
+ year={2024},
+ eprint={2412.06264},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2412.06264},
+}
+
+@article{boffi2025build,
+ title={How to build a consistency model: Learning flow maps via self-distillation},
+ author={Boffi, Nicholas M and Albergo, Michael S and Vanden-Eijnden, Eric},
+ journal={arXiv preprint arXiv:2505.18825},
+ year={2025}
+}
+
+@article{geng2025mean,
+ title={Mean flows for one-step generative modeling},
+ author={Geng, Zhengyang and Deng, Mingyang and Bai, Xingjian and Kolter, J Zico and He, Kaiming},
+ journal={arXiv preprint arXiv:2505.13447},
+ year={2025}
+}
+
+@article{peng2025flow,
+ title={Flow-Anchored Consistency Models},
+ author={Peng, Yansong and Zhu, Kai and Liu, Yu and Wu, Pingyu and Li, Hebei and Sun, Xiaoyan and Wu, Feng},
+ journal={arXiv preprint arXiv:2507.03738},
+ year={2025}
+}
+
+@article{guo2025splitmeanflow,
+ title={SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling},
+ author={Guo, Yi and Wang, Wei and Yuan, Zhihang and Cao, Rong and Chen, Kuan and Chen, Zhengyang and Huo, Yuanyuan and Zhang, Yang and Wang, Yuping and Liu, Shouda and others},
+ journal={arXiv preprint arXiv:2507.16884},
+ year={2025}
+}
+
+@article{ho2020denoising,
+ title={Denoising diffusion probabilistic models},
+ author={Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
+ journal={Advances in neural information processing systems},
+ volume={33},
+ pages={6840--6851},
+ year={2020}
+}
+
+@article{song2020score,
+ title={Score-based generative modeling through stochastic differential equations},
+ author={Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben},
+ journal={arXiv preprint arXiv:2011.13456},
+ year={2020}
+}
+
+@article{lu2024simplifying,
+ title={Simplifying, stabilizing and scaling continuous-time consistency models},
+ author={Lu, Cheng and Song, Yang},
+ journal={arXiv preprint arXiv:2410.11081},
+ year={2024}
+}
+
+@article{kim2023consistency,
+ title={Consistency trajectory models: Learning probability flow ode trajectory of diffusion},
+ author={Kim, Dongjun and Lai, Chieh-Hsin and Liao, Wei-Hsiang and Murata, Naoki and Takida, Yuhta and Uesaka, Toshimitsu and He, Yutong and Mitsufuji, Yuki and Ermon, Stefano},
+ journal={arXiv preprint arXiv:2310.02279},
+ year={2023}
+}
+
+@article{sabour2025align,
+ title={Align Your Flow: Scaling Continuous-Time Flow Map Distillation},
+ author={Sabour, Amirmojtaba and Fidler, Sanja and Kreis, Karsten},
+ journal={arXiv preprint arXiv:2506.14603},
+ year={2025}
+}
+
+@article{frans2024one,
+ title={One step diffusion via shortcut models},
+ author={Frans, Kevin and Hafner, Danijar and Levine, Sergey and Abbeel, Pieter},
+ journal={arXiv preprint arXiv:2410.12557},
+ year={2024}
+}
+
+@article{zhou2025inductive,
+ title={Inductive moment matching},
+ author={Zhou, Linqi and Ermon, Stefano and Song, Jiaming},
+ journal={arXiv preprint arXiv:2503.07565},
+ year={2025}
+}
+
+@article{yin2024improved,
+ title={Improved distribution matching distillation for fast image synthesis},
+ author={Yin, Tianwei and Gharbi, Micha{\"e}l and Park, Taesung and Zhang, Richard and Shechtman, Eli and Durand, Fredo and Freeman, Bill},
+ journal={Advances in neural information processing systems},
+ volume={37},
+ pages={47455--47487},
+ year={2024}
+}
+
+@article{song2020denoising,
+ title={Denoising diffusion implicit models},
+ author={Song, Jiaming and Meng, Chenlin and Ermon, Stefano},
+ journal={arXiv preprint arXiv:2010.02502},
+ year={2020}
+}
+
+@article{wang2025uni,
+ title={Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction},
+ author={Wang, Yifei and Bai, Weimin and Zhang, Colin and Zhang, Debing and Luo, Weijian and Sun, He},
+ journal={arXiv preprint arXiv:2505.20755},
+ year={2025}
+}
+
+@inproceedings{zhou2024score,
+ title={Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation},
+ author={Mingyuan Zhou and Huangjie Zheng and Zhendong Wang and Mingzhang Yin and Hai Huang},
+ booktitle={International Conference on Machine Learning},
+ url={https://arxiv.org/abs/2404.04057},
+ year={2024}
+}
+
+@article{xu2025one,
+ title={One-step Diffusion Models with $ f $-Divergence Distribution Matching},
+ author={Xu, Yilun and Nie, Weili and Vahdat, Arash},
+ journal={arXiv preprint arXiv:2502.15681},
+ year={2025}
+}
\ No newline at end of file
diff --git a/blog/2025/diff-distill/FACM_training.png b/blog/2025/diff-distill/FACM_training.png
new file mode 100644
index 0000000..6c599c2
Binary files /dev/null and b/blog/2025/diff-distill/FACM_training.png differ
diff --git a/blog/2025/diff-distill/IMM.png b/blog/2025/diff-distill/IMM.png
new file mode 100644
index 0000000..1eba5ce
Binary files /dev/null and b/blog/2025/diff-distill/IMM.png differ
diff --git a/blog/2025/diff-distill/diff-distill.mp4 b/blog/2025/diff-distill/diff-distill.mp4
new file mode 100644
index 0000000..3cce266
Binary files /dev/null and b/blog/2025/diff-distill/diff-distill.mp4 differ
diff --git a/blog/2025/diff-distill/rectifiedflow.png b/blog/2025/diff-distill/rectifiedflow.png
new file mode 100644
index 0000000..d502dcf
Binary files /dev/null and b/blog/2025/diff-distill/rectifiedflow.png differ
diff --git a/blog/2025/diff-distill/shortcut_model.png b/blog/2025/diff-distill/shortcut_model.png
new file mode 100644
index 0000000..47bb958
Binary files /dev/null and b/blog/2025/diff-distill/shortcut_model.png differ
diff --git a/blog/2025/diff-distill/teaser_probpath_velocity_field.png b/blog/2025/diff-distill/teaser_probpath_velocity_field.png
new file mode 100644
index 0000000..9c5a79c
Binary files /dev/null and b/blog/2025/diff-distill/teaser_probpath_velocity_field.png differ