Theory

theory

Tweedie's formula

$z \sim \mathcal{N}(\mu_{z}, \Sigma_{z})$ $E(\mu_{z} | z) = z + \Sigma_{z} \nabla_{z} \log p(z)$ 。

$x_{t} \sim \mathcal{N}(\sqrt{\bar{\alpha}_{t}}x_{0}, (1-\bar{\alpha}_{t})I)$ $x_{t}$ $\sqrt{\bar{\alpha}_{t}} x_{0}$ $x_{t} + (1-\bar{\alpha}_{t}) \nabla_{x_{t}} \log q(x_{t}|x_{0})$ $x_{t} = \sqrt{\bar{\alpha}_{t}} x_{0} + \sqrt{1-\bar{\alpha}_{t}} \epsilon_{t}$ $\nabla_{x_{t}} \log q(x_{t}|x_{0}) = - \frac{\epsilon_{t}}{\sqrt{1-\bar{\alpha}_{t}}}$ 。

Awesome Architecture

Cascaded

Cascaded Diffusion Models for High Fidelity Image Generation

MDM

Matryoshka Diffusion Models

所有resolution一起训练，训练时对每个data的所有resolution的加噪时间步要一样，避免信息泄露，同样使用了SimpleDiffusion提出的noise schedule shift。
类似ProgressiveGAN，从低resolution开始训练，之后逐渐加宽UNet增多loss项数去训练更高resolution，训练高resolution时，低resolution网络也会一起训练。

ADM

Diffusion Models Beat GANs on Image Synthesis

$y_s$ $y_b$ $y_s \cdot GroupNorm(h) + y_b$ ）。
$x_{0}^{low}$ $x_{t}^{high}$ $x_{t}^{high}$ $x_{t}^{high}$ $t$ $x_{t}^{high}$ $x_{0}^{low}$ $x_{t}^{low}$ 。

AsCAN

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

AsCAN is a hybrid architecture, combining both convolutional and transformer blocks.

D3PM

Structured Denoising Diffusion Models in Discrete State-Spaces

Discrete Diffusion Model
马尔科夫状态转换。

GGM

Glauber Generative Model: Discrete Diffusion Models via Binary Classification

D3PM的前向过程每一步对所有token进行随机加噪，GGM每一步只对某个token进行随机加噪，这样逆向过程每一步只需要预测一个token。
$T$ $\{i_{t}\}_{t=0}^{T-1}$ $x_{t+1}$ $i_{t}$ $i_{t}$ 位置的token进行随机加噪（也有可能保持不变）。训练和采样时使用相同的index序列。

BDPM

Binary Diffusion Probabilistic Model

Decomposing images into bitplanes and employing XOR-based noise transformations, with a denoising model trained using binary cross-entropy loss.
加噪时随机一个binary noise与binary data做异或操作，噪声水平由binary noise中bit flipped的比例决定。

ImageBART

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

VQGAN + multinomial diffusion
$x_{t-1}$ $x_t$

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

VQVAE + multinomial diffusion
$x_{t-1}$ $x_t$

CATDM

Mitigating Embedding Collapse in Diffusion Models for Categorical Data

While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training.

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

$H \times W \times 3 \rightarrow \frac{H}{f} \times \frac{W}{f} \times C$ $f=2^m$ ，训练一个AutoEncoder对图像进行适当降维。

$m$ 一般取4或8最好。

slight regularization：KL or VQ，避免 high-variance latent spaces。

$x_t$ feature map KV: conditions

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

$1024 \rightarrow 256$ 。
$16 \times 24 \times 24$ 。
stage B：diffusion建模stage A中图像quantize之前的embedding，以图像经过semantic compressor的输出为条件（Wuerstchen还以text为额外条件），相当于self-condition。
stage C：diffusion建模图像经过semantic compressor的输出，以text为条件。
$C \rightarrow B \rightarrow A$ 。

MSF

MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

diffusion版的VAR，但只需要任意预训练好的VAE就可以，自定义scale序列进行训练，不需要重新设计和训练VAE。
$x$ $f=\mathcal{E}(x)$ 。
$f_{i}=\text{BILINEAR}(f, h_{i}, w_{i})$ $x_{i} = \text{BILINEAR}(x, H_{i}, W_{i}), f_{i} = \mathcal{E}(x_{i})$ $f$ $f_{0}$ $f_{0}$ $f$ $f_{0}$ $x$ $f_{0}$ ，实验发现后者更好。
$\{0, 1, \cdots, N \}$ $I_{N}$ $x$ $I_{0}$ $x$ $h_{i}, w_{i}$ $f=\mathcal{E}(I_{N})$ $f_{0}=\mathcal{E}(I_{0})$ $i$ $0$ $N-1$ $f = f-\text{Up}(f_{i}, h_{N}, w_{N})$ $f_{i+1} = \text{Down}(f, h_{i+1}, w_{i+1})$ $\{f_{i}\}_{i=0}^{N}$ $f^{c}=0$ $i$ $0$ $N-1$ $f^{c} = f^{c} + \text{Up}(f_{i}, h_{N}, w_{N})$ $f^{\text{input}}_{i+1} = \text{Down}(f^{c}, h_{i+1}, w_{i+1})$ $\{ f^{\text{input}}_{i}\}_{i=1}^{i=N}$ $\{f_{i}\}_{i=0}^{N}$ $c$ .
$f_{i}$ $f^{\text{input}}_{i-1}$ $c$ $N$ 个scale的diffusion loss相加一起优化。

BinaryLDM

Binary Latent Diffusion

和LDM一样的思路，只是latent是二值化的。
$x\in R^{h \times w \times 3}$ $y=Sigmoid(E(x))$ $z=Bernoulli(y)$ $\hat{z}=sg(z)+y-sg(y)$ $\hat{x}=D(\hat{z})$ 。
$z$ 的分布。

DiC

DiC: Rethinking Conv3x3 Designs in Diffusion Models

We rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model.
We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3. The encoder progressively downsamples the feature dimensions, while the decoder symmetrically upsamples the spatial dimensions.
We introduce sparse skip connections to reduce redundancy and improve scalability.
We propose using stage-specific condition embeddings for each stage of the encoder-decoder structure. Each stage will have its own independent embedding, with the embedding dimension aligned to that stage’s feature dimension.
We found that for our fully convolutional diffusion architecture, injecting the condition into the second convolutional layer within each block yields the best performance.

SiD

Simple Diffusion: End-to-end diffusion for high resolution images

现有的高分辨率扩散模型有两种，一种是StableDiffusion的降维法，一种是coarse-to-fine的cascade super resolution法。SimpleDiffusion采用以下几个方法解决高分辨率扩散模型的像素空间的直接训练问题。
$x_{0}$ $x_{t}=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon$ $SNR$ $SNR$ 函数，得到shifted noise schedule。
$L_{\theta}^{d\times d}(x)=\frac{1}{d^{2}}\mathbb{E}_{\epsilon,t}||D^{d\times d}[\epsilon]-D^{d\times d}[\epsilon_{\theta}(x_{t},t)]||^{2}_{2}$ $D^{d\times d}[\cdot]$ $d\times d$ $D^{d\times d}[\epsilon_{\theta}]$ $d\times d$ $\sum_{s\in \{32,64,128,\cdots,d\}}\frac{1}{s}L_{\theta}^{s\times s}(x)$ 。
为了解决显存和计算问题，在低分辨率的feature map上增加网络深度，即block数量，本论文选择16x16，并且在整个模型最前面加一个下采样层，在最后面加一个上采样层，避免在最高分辨率下做计算。
只在低分辨率feature map上加dropout。

SiD2

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Loss Weight: sigmoid shift
Flop Heavy Scaling: 增长token序列而不是增大模型size
Residual U-ViT

Scaling-Law-1

On the Scalability of Diffusion-based Text-to-Image Generation

For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency.

Scaling-Law-2

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

When operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.

Scaling-Law-3

Computational Tradeoffs in Image Synthesis Diffusion, Masked-Token, and Next-Token Prediction

We recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

DiT-Scaling-Law

Scaling Laws For Diffusion Transformers

RDM

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

cascaded models perform better than end-to-end models under a fair setting，RDM也是用cascaded模式，但是与传统cascaded models不同的是，RDM是在时间步上cascade，这降低了训练和采样的步数。
$x_{t} = x + \sigma \epsilon$ 。
$t$ $t$ $x_{t}$ $64 \times 64$ $256 \times 256$ ，这两个噪声加到各自分辨率的图像上后SNR就相同了。
$x_{t}$ $x_{t}$ $x$ $x_{t}$ 的上采样看成是高分辨率图像被blur后的加噪结果。
$256 \times 256$ $x_{t}$ 上采样后就能直接输入blurring diffusion进行生成。

U-ViT

All are Worth Words: A ViT Backbone for Diffusion Models

ViT in Pixel Space

Diffusion-RWKV

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

RWKV brought improvements for standard RNN architecture, which is computed in parallel during training while inference like RNN. It involves enhancing the linear attention mechanism and designing the receptance weight key value (RWKV) mechanism.

DiT

Scalable Diffusion Models with Transformers

ViT in Latent Space
adaLN：LayerNorm中不学习scale和shift，而是额外使用一个MLP（每个block都有）根据timestep和condition预测一个scale和shift。
$0$ ，这样整个DiT Block就被初始化为indentity function，有利于训练。

DiT-MoE

Scaling Diffusion Transformers to 16 Billion Parameters

$e$ $K$ $n$ $n_{s}$ $n+n_{s}$ 个expert个参与计算。This approach enables super-linear scaling of the number of model parameters relative to the computational cost of inference and training.
$K$ $n$ ，训练速度完全一样，但效果却会变好。
除了diffusion loss，额外加一个balance loss，避免imbalanced expert。

EC-DiT

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

DyDiT

Dynamic Diffusion Transformer

slimmable neural network，更细粒度的MoE。

PoM

PoM: Efficient Image and Video Generation with the Polynomial Mixer

We propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens.

U-DiT

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

$2$ is performed while the feature dimension is doubled as well. Skip connections are provided at each stage transition. The skipped feature is concatenated and fused with the upsampled output from the previous decoder stage, replenishing information loss to decoders brought by feature downsampling.
$N \times N$ $4$ $\frac{N}{2} \times \frac{N}{2}$ $4$ $N \times N$ $4$ $2\times 2$ $\frac{1}{4}$ ，但并没有信息丢失。Unlike U-Net downsampling, we are not reducing or increasing the number of elements in the feature during the downsampling process. The substitution of downsampled self-attention to full-scale self-attention brings slight improvement in the FID metric despite a significant reduction in FLOPs.

Switch-DiT

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

引入MoE，每个block都使用timestep-based gating network预测一个概率分布，取TopK，这样可以做到参数隔离，缓解不同时间步冲突的问题。

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

design space + DiT

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

传统Transformer每一个block计算量都是一样的，这里把UNet思想用在Transformer上，中间层减少token数量，降低运算量。
同时借鉴SimpleDiffusion，高分辨率少做计算，低分辨率增长增宽。
这样可以在pixel层面的计算复杂度只随着图像分辨率提升线性增长。

EDT

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

HDiT和U-ViT的结合体。
基于人类画画的过程，先画整体（global），再画某个局部（local），局部画好后再看整体是否和谐，之后再找一个局部进行修改，依此循环。AMM是一个基于token之间距离计算的mask，让global attention变为local attention。

DiMR

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

U-ViT
Transformer在计算量和效果之间存在tradeoff，patch size小时，token length长，计算量大，但效果好，patch size大时，token length短，计算量小，但效果差。
$R$ $x_{t}$ $r$ $2^{R-r} \times 2^{R-r}$ $x_{t}$ $\epsilon$ $r$ 倍的average pooling下采样。
在低分辨率上使用U-ViT架构，在高分辨率使用ConvNeXt节省计算量。

Flag-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Flag-DiT substitutes all LayerNorm with RMSNorm to improve training stability. Moreover, it incorporates key-query normalization (KQ-Norm) before key-query dot product attention computation. The introduction of KQ-Norm aims to prevent loss divergence by eliminating extremely large values within attention logits.
We introduce learnable special tokens including the [nextline] and [nextframe] tokens to transform training samples with different scales and durations into a unified one-dimensional sequence. We add [PAD] tokens to transform 1-D sequences into the same length for better parallelism.
由于将不同模态的数据都转换为一个1D序列统一建模，使用1D RoPE，所以要加[nextline] and [nextframe] tokens，如果只有图像一个模态，可以使用2D RoPE，这样就不需要[nextline] token了。本质上，[nextline] and [nextframe] tokens是为了补充将高维模态转换为1D后丢失的位置信息的。
有text时，self-attention和cross-attention并列。

Next-DiT

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations.
NTK
Token Merge

FiT

FiT: Flexible Vision Transformer for Diffusion Model

ViT in Latent Space
$HW \le 256^{2}$ $(\frac{\frac{256}{8}}{2})^{2}=256$ ，不够256的pad到256，MHSA时mask掉pad token，只在unmask token上计算loss。
使用2D-RoPE位置编码方法，利用其外推性可以生成任意分辨率和长宽比的图像。

VisionLLaMA

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

ViT in Latent Space
a vision transformer architecture similar to LLaMA to reduce the architectural differences between language and vision.

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

ViT in Latent Space
DiT models have faced challenges with scalability and quadratic complexity efficiency. We leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models and offering superior efficiency and effectiveness.

SANA

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

CLEAR

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

By fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model.

LiT

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

类似CLEAR，把full attention模型蒸馏到linear attention模型上。

Transfusion

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

MonoFormer

MonoFormer: One Transformer for Both Diffusion and Autoregression

ARFlow

ARFlow: Autogressive Flow with Hybrid Linear Attention

CausalFusion

Causal Diffusion Transformers for Generative Modeling

RDPM

RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

ACDiT

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

DART

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Neural-RDM

Neural Residual Diffusion Models for Deep Scalable Vision Generation

Mamba

DiS

Scalable Diffusion Models with State Space Backbone

DiffuSSM

Diffusion Models Without Attention

ZigMa

ZigMa: Zigzag Mamba Diffusion Model

DiM

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

DiM

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

Dimba

Dimba: Transformer-Mamba Diffusion Models

LaMamba-Diff

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

LinFusion

LinFusion: 1 GPU, 1 Minute, 16K Image

传统DiT之前的模型随着分辨率增大（token长度增大），计算量指数增长。
借鉴Mamba2、RWKV6、GLA等，we introduce a generalized linear attention paradigm，随着分辨率增大（token长度增大），计算量线性增长。
使用SD蒸馏训练该模型，achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity.
外拓zero-shot cross-resolution generation。

Others

CrossFlow

Flowing from Words to Pixels A Framework for Cross-Modality Evolution

$(x, z_{1})$ $x$ $z_{0}$ $z_{1}$ $v_{\theta}$ $v_{\theta}$ or concurrently.

Infinite-Diff

Infinite-Diff: Infinite Resolution Diffusion with Subsampled Mollified States

Infinite-Diff is a generative diffusion model defined in an infinite dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling.

DoD

Diffusion Models Need Visual Priors for Image Generation

DoD enhances diffusion models by recurrently incorporating previously generated samples as visual priors to guide the subsequent sampling process.
We propose the Latent Embedding Module (LEM) that filters the conditional information using a compression-reconstruction approach to discard redundant details. We reasonably assume that the high-level semantic information extracted from generated images is similar to that obtained from real images. This assumption allows us to use the latents of ground truth images as inputs to LDM during training, simplifying the training strategy. Such simplification allows end-to-end training of DoD on image latents and joint optimization of the backbone model and LEM.

INFD

Image Neural Field Diffusion Models

Neural field is also known as Implicit Neural Representations (INR), which represents signals as coordinate-based neural networks.
提出Image Neural Field Autoencoder，目的是有一个隐空间分布可以建模并采样。
类似Diff-AE和PDAE，使用diffusion model建模隐空间分布。

CAN

Condition-Aware Neural Network for Controlled Image Generation

传统的条件模型中，所有条件共用相同的处理条件的static network，这限制了网络的建模能力。一种解决方案是每个条件使用一个expert model，但消耗极大。因此学习一个生成网络，根据条件动态生成处理条件的网络参数。introduces a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition.
Making depthwise convolution layers, the patch embedding layer, and the output projection layers condition-aware brings a significant performance boost.
CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT, on class conditional image generation on ImageNet and text-to-image generation on COCO.

Effectiveness and Efficiency Enhancement

hybrid generative model

D2C

D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation

diffusion建模autoencoder的latent distribution。

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

解决原始扩散模型不可训练的问题，参考theory第6章。
$q(x_{t-1} | x_{t})$ $p_{\theta}(x_{t-1} | x_{t})$ $\min_{\theta} \sum_{t \ge 1} E_{q(x_{t})}[D_{\text{adv}}(q(x_{t-1}|x_{t}) \parallel p_{\theta}(x_{t-1} | x_{t}))]$ $\min_{\phi} \sum_{t \ge 1} E_{q(x_{t})} \{E_{q(x_{t-1} | x_{t})}[- \log D_{\phi}(x_{t-1}, x_{t}, t)] + E_{p_{\theta}(x_{t-1} | x_{t})}[- \log (1 - D_{\phi}(x_{t-1}, x_{t}, t))]\}$ $\max_{\theta} \sum_{t \ge 1} E_{q(x_{t})} E_{p_{\theta}(x_{t-1} | x_{t})} [\log D_{\phi}(x_{t-1}, x_{t}, t) ]$ 。
$q(x_{t})q(x_{t-1}|x_{t})=q(x_{t-1},x_{t})=\int dx_{0} q(x_{0}, x_{t-1}, x_{t}) = \int dx_{0} q(x_{0})q(x_{t-1} | x_{0})q(x_{t} | x_{t-1}, x_{0}) = \int dx_{0} q(x_{0})q(x_{t-1} | x_{0})q(x_{t} | x_{t-1})$ $x_{0}$ $x_{t-1}$ $x_{t-1}$ $x_{t}$ 。
$p_{\theta}(x_{t-1} | x_{t}) = \int p_{\theta}(x_{0} | x_{t}) q(x_{t-1} | x_{t}, x_{0}) dx_{0} = \int p(z) q(x_{t-1} | x_{t}, x_{0} = G_{\theta}(z, x_{t}, t)) dz$ $x_{t-1}$ $x_{0}$ $q(x_{t-1} | x_{t}, x_{0})$ $x_{t-1}$ 。
$x_{t-1}$ $t$ $x_{t}$ $x_{t-1}$ $t$ $x_{0}$ $q(x_{t-1} | x_{t}, x_{0})$ .
$p_{\theta}(x_{t-1} | x_{t}) = q(x_{t-1} | x_{t}, x_{0} = \frac{x_{t} - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(x_{t},t)}{\sqrt{\bar{\alpha}_{t}}})$ $x_{0}$ $p_{\theta}(x_{t-1} | x_{t})$ $x_{0}$ $x_{0}$ $p_{\theta}(x_{t-1} | x_{t})$ 就是多峰的。
$x_{t}$ . Moreover, the diffusion process smoothens the data distribution, making the discriminator less likely to overfit.

DiffuseVAE

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

$\epsilon$ -VAE

epsilon-VAE: Denoising as Visual Decoding

FM-Boosting

Boosting Latent Diffusion with Flow Matching

LDM的推理速度随着图像分辨率提高而平方增长。
使用Flow Matching在上采样的低分辨率的latent和高分辨率的latent之间建模。使用低分辨率LDM进行生成，使用Flow Matching转换到高分辨率。
一般的Flow Matching是在数据分布和高斯分布之间建模的，这里是在数据对之间建模，所以叫Coupling Flow Matching。

PDM

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

类似FM-Boosting的思路逐步生成高分辨率的图像。

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

LiteVAE

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

We leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.

Wuerstchen

`Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

DiC

DiC: Rethinking Conv3x3 Designs in Diffusion Models

SimpleDiffusion

Simple Diffusion: End-to-end diffusion for high resolution images

BK-SDM

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

compact UNet：fewer blocks in the down and up stages (Base), further removal of the entire mid-stage (Small), further removal of the innermost stages (Tiny).
distillation-based retraining：除了diffusion loss，还可以使用训练好的大网络的StableDiffusion进行output-level distillation（相同输入得到的输出之间的MSE loss）和feature-level distillation（相同输入得到的网络feature之间的MSE loss）。

KOALA

KOALA: Fast and Memory-Efficient Latent Diffusion Models via Self-Attention Distillation

和BK-SDM做法一样。
进一步改进了feature-level distillation，测试了使用不同模块输出的feature进行distillation时的效果，发现对self-attention输出的feature进行distillation时效果最好，而且decoder early blocks位置的self-attention输出的feature效果最好。

DuoDiff

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

SANA

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

CLEAR

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

LiT

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

BLAST

BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference

HFDM

Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation

SnapFusion

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

SnapGen

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

MobileDiffusion

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

lightweight model architecture + DiffusionGAN + distillation

Spiking Network

Spiking Diffusion Models

Spiking-Diffusion Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

Spiking Denoising Diffusion Probabilistic Models

Fully Spiking Denoising Diffusion Implicit Models

SDiT: Spiking Diffusion Model with Transformer

NAS

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

DDSM

Denoising Diffusion Step-aware Models

不同step的重要性是不同的，没必要每一步都使用大模型。
slimmable network： a neural network that can be executed at arbitrary model sizes.
搜索最优的采样策略，不同步使用不同size的模型，减少计算量。

ScaleLong

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

feature norm的角度理论分析UNet的long skip connection coefficient的影响。

Q-DM

Q-DM: An Efficient Low-bit Quantized Diffusion Model

DiffuSSM

Diffusion Models Without Attention

EDM-2

Analyzing and Improving the Training Dynamics of Diffusion Models

We update all of the operations (e.g., convolutions, activations, concatenation, summation) to maintain magnitudes on expectation.

ReDistill

ReDistill: Residual Encoded Distillation for Peak Memory Reduction

Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. We propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling.

Quantum

Quantum Denoising Diffusion Models

Quantum Generative Diffusion Model

Towards Efficient Quantum Hybrid Diffusion Models

Quantum Hybrid Diffusion Models for Image Synthesis

Enhancing Quantum Diffusion Models with Pairwise Bell State Entanglement

Mixed-State Quantum Denoising Diffusion Probabilistic Model

UDiTQC: U-Net-Style Diffusion Transformer for Quantum Circuit Synthesis

量子计算

Optical

Optical Diffusion Models for Image Generation

光学

pre-trained network compression

quantization

Scaling-Law

Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models

PTQ

Post-training Quantization on Diffusion Models

Q-Diffusion

Q-Diffusion: Quantizing Diffusion Models

BiDM

BiDM: Pushing the Limit of Quantization for Diffusion Models

SD-PTQ

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

LDM-PTQ

Efficient Quantization Strategies for Latent Diffusion Models

EMF

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

TerDiT

TerDiT: Ternary Diffusion Models with Transformers

VQ4DiT

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

PTQ4DiT

PTQ4DiT: Post-training Quantization for Diffusion Transformers

DPQ

Diffusion Product Quantization

DiTAS

DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

HQ-DiT

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

SVDQuant

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

StableQ

StableQ: Enhancing Data-Scarce Quantization with Text-to-Image Data

BinaryDM

BinaryDM: Towards Accurate Binarization of Diffusion Model

APQ-DM

Towards Accurate Post-training Quantization for Diffusion Models

StepbaQ

StepbaQ: Stepping backward as Correction for Quantized Diffusion Models

TFMQ-DM

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

D2-DPM

D2-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

EfficientDM

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

EDA-DM

Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models

Memory-Efficient

Memory-Efficient Personalization using Quantized Diffusion Model

fine-tune quantized diffusion model

MixDQ

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

MPQ-DM

MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

COMQ

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

QNCD

QNCD: Quantization Noise Correction for Diffusion Models

TAC-Diffusion

Timestep-Aware Correction for Quantized Diffusion Models

TCAQ-DM

TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models

TFM

Temporal Feature Matters: A Framework for Diffusion Model Quantization

network pruning

DiffPruning

Structural Pruning for Diffusion Models

DiffPruning

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

LTDDPM

Successfully Applying Lottery Ticket Hypothesis to Diffusion Model

LAPTOP-Diff

LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models

LD-Pruner

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

LayerMerge

LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging

DiP-GO

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

SD2-Pruning

Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion

SVS

Singular Value Scaling Efficient Generative Model Compression via Pruned Weights Refinement

Distillation

DKDM

DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

训练student输出（预测噪声）与teacher输出对齐。

enhancement of training

novel model family

iDDPM

Improved Denoising Diffusion Probabilistic Models

$\sigma_{t}^{2}=\beta_{t}$ $\sigma_{t}^{2}=\tilde{\beta}_{t}$ $q(x_{0})$ $\tilde{\beta}_{0}=0$ $\beta_{t}$ $\tilde{\beta}_{t}$ $\Sigma_{\theta}(x_{t}, t) = \exp( v \log \beta_{t} + (1-v) \log \tilde{\beta}_{t})$ $v$ $\Sigma_{\theta}(x_{t}, t)$ are indeed expressive enough.
$0.001 \cdot \mathcal{L}_{\text{vlb}}$ $\mu_{\theta}(x_{t}, t)$ $\mathcal{L}_{\text{vlb}}$ term.
$t$ $\mathcal{L}_{\text{vlb}}$ $\mathcal{L}_{t}$ $\mathcal{L}_{t}$ $t$ 的采样概率。

PD

Progressive Distillation for Fast Sampling of Diffusion Models

$\epsilon$ -prediction是等价的，因为两者可以相互转换。

FM

Flow Matching for Generative Modeling

基于Continuous Normalizing Flows（Neural ODE），CNFs在训练时是先使用模型对数据样本进行转化（ODE simulations），计算转化后的样本与标准高斯分布之间的KL散度，优化模型；flow matching是simulation-free的，因为ODE路径已经提前定义好了。
$\nabla_{x_{t}} \log p_{t}(x_{t})$ $\frac{d}{dt} x_{t}$ $v_{\theta}$ $\epsilon_{\theta}$ $s_{\theta}$ 都可以相互转换，类似PD的v-prediction）。
We find this training alternative to be more stable and robust in our experiments to existing score matching approaches.

RectFlow

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

可以在任意两个分布之间进行建模。
$X_{0}$ $X_{1}$ $X_{t} = t X_{1} + (1-t)X_{0}$ $\frac{d}{dt} X_{t} = X_{1} - X_{0}$ $\int_{0}^{1} E \parallel (X_{1} - X_{0}) - v_{\theta}(X_{t},t) \parallel^{2} dt$ 。
每次训练完成后进行采样，得到当前flow下的coupling数据，之后用这些couping数据再进行训练，以此循环对轨迹进行修正，会得到直的没有交叉点的flow。

EDM

Elucidating the Design Space of Diffusion-Based Generative Models

$x_{t} = x_{0} + \epsilon$ $\epsilon \sim \mathcal{N}(0, \sigma_{t}^{2} I)$ $D_{\theta}(x_{t}, \sigma_{t}) \rightarrow x_{0}$ 。
$x_{t}$ $\sigma_{t}$ $D_{\theta}$ $F_{\theta}$ $D_{\theta}$ $F_{\theta}$ $\epsilon$ $D_{\theta}(x_{t}, \sigma_{t}) = x_{t} - \sigma F_{\theta}(x_{t}, \sigma_{t})$ $\sigma$ $\epsilon$ $\sigma_{t}$ $D_{\theta}(x_{t}, \sigma_{t})$ $\sigma$ $x_{0}$ $\epsilon$ $D_{\theta}(x_{t}, \sigma_{t}) = c_{\text{skip}}(\sigma_{t})x_{t} + c_{\text{out}}(\sigma_{t}) F_{\theta}(c_{\text{in}}(\sigma_{t})x_{t}, \sigma_{t}) \rightarrow x_{0}$ $F_{\theta}(c_{\text{in}}(\sigma_{t})x_{t}, \sigma_{t}) \rightarrow \frac{1}{c_{\text{out}}(\sigma_{t})}(x_{0} - c_{\text{skip}}(\sigma_{t})x_{t})$ $c_{\text{in}}$ $c_{\text{out}}$ $c_{\text{skip}}$ $F_{\theta}$ $c_{\text{skip}}=1$ .
$F_{\theta}$ ; during inference we set the them to zero to guarantee that only non-augmented images are generated. (macro conditioning as augmentation)

VDM

Variational Diffusion Models

efficient optimization of the noise schedule jointly with the rest of the model

VDM++

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation

DiffEnc

DiffEnc: Variational Diffusion with a Learned Encoder

SPD

Image generation with shortest path diffusion

optimal noise schedule

Cosine

Improved Denoising Diffusion Probabilistic Models

$\bar{\alpha}_{t} = \cos^{2} (\frac{t+0.008}{1.008}\cdot \frac{\pi}{2})$ $\beta_{t}$ 。

Laplace

Improved Noise Schedule for Diffusion Training

$\text{SNR}(t) = \frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}$ $\lambda(t) = \log \text{SNR}(t)$ $\alpha_{t}^{2}=\frac{\exp(\lambda(t))}{\exp(\lambda(t)) + 1}$ $\sigma_{t}^{2}=\frac{1}{\exp(\lambda(t)) + 1}$ $\lambda(t) = \mu -b \text{sgn}(0.5-t)\log(1 - 2|t-0.5|)$ $p(\lambda) = \frac{\exp(-\frac{\lambda-\mu}{b})}{2b}$ 。

FixFlaw

Common Diffusion Noise Schedules and Sample Steps are Flawed

$z_{T} = 0.068265 z_{0} + 0.997667 \epsilon$ $z_{T}$ $0$ $z_{T}$ $0$ $\epsilon$ $z_{T}$ $z_{T}$ $\epsilon$ 开始生成的结果也不同，参考Magic-Fixup。
$\bar{\alpha_{T}}=0$ $\sqrt{\bar{\alpha}_{1}}$ $\sqrt{\bar{\alpha_{T}}}=0$ $2, \cdots, T$ $\sqrt{\bar{\alpha_{t}}}$ 进行rescale，之后再重新训练模型。
$T$ $v_{t} = \sqrt{\bar{\alpha}_{t}}\epsilon - \sqrt{1-\bar{\alpha}_{t}}x_{0}$ $v_{T} = -x_{0}$ $v_{1} = \sqrt{\bar{\alpha}_{1}}\epsilon - \sqrt{1-\bar{\alpha}_{1}}x_{0}$ ，预测都有了意义。
结合上述两点，可以使用纠正后的noise schedule和v-prediction方法fine-tune已有的StableDiffusion，效果一致。
Rescale Classifier-Free Guidance：使用zero terminal SNR的noise schedule后，原有的cfg会变得敏感，导致生成图像过曝，所以对cfg结果进行rescale。

SingDiffusion

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

现有的模型不能生成太亮或太暗的图像。
$(0,1-\epsilon]$ $T$ $x$ -predicition。
$x_{1-\epsilon}$ ，之后和原来一样。

SODC

Score-Optimal Diffusion Schedules

optimal loss weighting

P2-weighting

Perception Prioritized Training of Diffusion Models

Debias

Debias the Training of Diffusion Models

类似P2-weighting。

$\hat{x}_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}x_{t} - \frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t}, t) = \frac{1}{\sqrt{\bar{\alpha}_{t}}} (\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon) - \frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t}, t) = x_{0} + \frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}} (\epsilon - \epsilon_{\theta}(x_{t}, t)) = x_{0} + \frac{1}{\sqrt{SNR(t)}} (\epsilon - \epsilon_{\theta}(x_{t}, t))$

$\frac{1}{\sqrt{SNR(t)}}$ $\| \epsilon_{\theta}(x_{t},t) - \epsilon \|^{2}_{2}$ $\hat{x}_{0}$ $x_{0}$ 。

SpeeD

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergencen areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training.
We design an asymmetric time step sampling strategy that reduces the frequency of time steps from the convergence area while increasing the sampling probability for time steps from other areas.

optimal timestep sampling

B-TTDM

Beta-Tuned Timestep Diffusion Model

The distribution variations are non-uniform throughout the diffusion process and the most drastic variations in distribution occur in the initial stages.
We propose a novel timestep sampling strategy that utilizes the beta distribution.
B-TTDM not only improves the quality of the generated samples but also speedups the training process.

AdaTS

Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training

Multi-Task Learning

MoE

Multi-Architecture Multi-Expert Diffusion Models

Addressing Negative Transfer in Diffusion Models

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models

Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture

Denoising Task Routing for Diffusion Models

Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

diffusion model需要处理识别所有level的noise这一特点导致了其模型需要大量参数进行训练。
不同时间步（expert）用不同的特性网络（architecture），降低学习难度，减少模型参数量。

DeMe

Decouple-Then-Merge: Towards Better Training for Diffusion Models

$N$ $N$ 份，分别在一个时间范围内进行fine-tune，之后merge为一个模型进行推理。

Dual-Output

Dynamic Dual-Output Diffusion Models

Cas-DM

Bring Metric Functions into Diffusion Models

DBCL

Denoising Task Difficulty-based Curriculum for Training Diffusion Models

diffusion model不同时间步的学习难度是不同的，将时间步等分为20个区间，在每个区间单独训练一个模型（总共20个），考察它们的收敛速度，在loss和生成质量（混合采样，使用一个正常训练好的扩散模型，采样时只在指定区间使用单独训练的模型）方面，都是时间步越大收敛速度越快。
Curriculum Learning：a method of training models in a structured order, starting with easier tasks or examples and gradually increasing difficulty. 所以将时间步分区域后，先从最靠后的区域开始训练，之后依次向靠前的区域训练，每次训练时依然要训练之前区域的时间步，避免遗忘。
收敛更快，生成效果更好。

REPA

Representation Alignment for Generation Training Diffusion Transformers Is Easier Than You Think

有点类似MaskDiT的MAE loss，可以提升生成效果。

Masked Diffusion

MaskDM

Masked Diffusion Models are Fast Learners

使用U-ViT架构（Pixel Space），最高90%的mask ratio，比DDPM收敛速度快4倍，且生成效果更好。

MDT

Masked Diffusion Transformer is a Strong Image Synthesizer

使用DiT架构（Latent Space），为了解决训练和推理时mask不同的distribution shift，训练时使用一个side-interpolater补全masked tokens，比DiT收敛速度快3倍，生成效果更好。

MaskDiT

Fast Training of Diffusion Models with Masked Transformers

$8$ 个DiT block。
只根据visible token预测invisible token的score太困难了，所以将diffusion loss拆分，对于visible token使用diffusion loss，对于invisible token使用对应noisy patches的MSE loss（注意是直接预测加噪的invisible patch，不是预测其噪声或者原图），类似MaskDM + MAE。
最高50%的mask ratio，比DiT收敛速度快3倍，达到相同的生成效果。
MAE必不可少，如果不加MAE，生成效果降低很多，但MAE的loss的系数如果太大反而会影响生成效果，所以这个系数要精心挑选。Without the MAE reconstruction task, the training easily overfits the local subset of unmasked tokens as it lacks a global understanding of the full image, making the gradient update less informative. 理解辅助生成。

SD-DiT

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

$8$ 个DiT block。
DiT decoder的输入插入的不是learnable mask token，而是直接插入invisible patch，diffusion loss在所有patch上计算，而不是像MaskDiT那样只根据visible token去预测invisible patch。
$K$ 维的分布，以teacher encoder处的预测结果为label计算cross-entropy loss，只计算所有unmask token和class token处的cross-entropy loss。

Token Reduction

DiffRatio-MoD

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

训练时就考虑cache加速采样。
学习MoD的方法：each DiT layer incorporates a lightweight router composed of a single linear layer with a sigmoid activation function, predicting each token’s importance on a scale from 0 to 1. 之后按比例选择top-K个token进行计算，其它不做计算。
比例也是学习出来的：we assign each layer a single learnable parameter and introduce discrete MoD compression ratio bins at 10% in tervals, ranging from 0% to 100%. During training, the learnable MoD ratio queries the nearest two discrete bins to retrieve the lower and upper bin ratios. For example, a 22% learnable ratio would correspond to 20% as the lower bin and 30% as the upper bin. The final output is a weighted linear combination of these branches, where the weights are determined by the proximity of the learnable ratio to each bin. For example, for 12%, the output would be 80% weighted towards the 10% branch and 20% towards the 20% branch. 推理时就用最近的那个bin对应的比例。

TREAD

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Patch

PatchDiffusion

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

$x$ $x_{i,j,s}$ $i,j$ $s$ $x_{i,j,s}$ $i,j,s$ 也作为输入条件。
EDM only sees local patches and may have not captured the global cross-region dependency between local patches, in other words, the learned scores from nearby patches should form a coherent score map to induce coherent image sampling. To resolve this issue, we propose two strategies: 1) random patch sizes and 2) involving a small ratio of full-size images.
采样时分patch采样后拼在一起。
$\ge 2$ faster training, while maintaining comparable or better generation quality.

Patch-DM

Patched Denoising Diffusion Models For High-Resolution Image Synthesis

Rather than using entire complete images for training, our model only takes patches for training and inference and uses feature collage to systematically combine partial features of neighboring patches.
$x_{t}$ $x_{t}$ 分成patch，相邻patch输入模型预测公共部分；第三种方法将第二种方法细化到UNet的feature上。

novel diffusion formula

一些具体任务本身就是某种过程，可以设计不同的马尔科夫转移链进行训练。

ShiftDDPMs

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

ContextDiff

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

GUD

GUD: Generation with Unified Diffusion

The choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis).
The prior distribution that data is transformed into during diffusion.
The scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule.

CARD

CARD: Classification and Regression Diffusion Models

公式类似PriorGrad，diffusion model输出regression的值或者classification的概率。

ExposureDiffusion

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

把照相机图像曝光过程当做一种扩散过程进行建模。

RDDM

Residual Denoising Diffusion Models

类似ResShift。

Beta-Diffusion

Beta Diffusion

Beta分布，优化KL-divergence upper bounds。

others

FDM

Fast Diffusion Model

与SGD建立联系，引入momentum，加快训练和采样。

DDDM

Directly Denoising Diffusion Model

DDDMs train the diffusion model conditioned on an es timated target that was generated from previous training iterations of its own.
$f_{\theta}(x_{0}, x_{t}, t) = x_{t} + \int_{t}^{0} - \frac{1}{2} \beta(s) [x_{s} - \nabla_{x_{s}} \log q(x_{s}) ] ds$ $t$ $0$ $F_{\theta}(x_{0}, x_{t}, t) = \int_{t}^{0} \frac{1}{2} \beta(s) [x_{s} - \nabla_{x_{s}} \log q(x_{s}) ] ds$ $f_{\theta}(x_{0}, x_{t}, t) = x_{t} - F_{\theta}(x_{0}, x_{t}, t)$ 。虽然使用了PF ODE，但不需要预训练的score model。

DeeDiff

DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation

early exiting策略：The basic assumption of early exiting is that the input samples of the test set are divided into easy and hard samples. The computation for easy samples terminates once some conditions are satisfied.
使用U-ViT，在训练时为每层的输出额外训练一个uncertainty estimation module，用于评估当前层作为输出的不确定度。UEM是一个预测标量的MLP，目标是当前层的输出与最后一层的输出的MSE Loss。
推理时，对于采样的每一步，一旦某层的输出的不确定度低于给定阈值，就将该层输出作为最终输出，达到加速的效果。

DMP

Diffusion Model Patching via Mixture-of-Prompts

$p^{i}$ $x^{i}$ $x^{i}$ 上。
$x^{i} = \text{block}^{i}(\sigma(G([t;i])) \odot p^{i-1} + x^{i-1})$ 。

Compensation

Compensation Sampling for Improved Convergence in Diffusion Models

额外训练一个UNet预测补全项。

CEP

Slight Corruption in Pre-training Data Makes Better Diffusion Models

类似CADS对condition进行操作，但CADS仅在采样时进行操作，二CEP在训练时进行操作。
初步实验：To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M. As a result, class and text-conditional models pre-trained with slight corruption achieve significantly lower FID and higher IS and CLIP score. More corruption in pre-training can potentially lead to quality and diversity degradation. As degration level increases, almost all metrics first become better and then degrade. However, the degraded measure with more corruption sometimes is still better than the clean ones.
$\mathcal{N}(0,\frac{\gamma}{\sqrt{d}}I)$ 的高斯噪声。

SADM

Structure-Guided Adversarial Training of Diffusion Models

$x_{0}$ $\hat{x}_{0}$ 也计算出batch内两两之间的manifold距离，优化使这两个距离极小化，目的是让diffusion预测出的图像保持和原数据集同样的manifold structure。
如果使用预训练好的编码网络会导致shortcut，所以引入对抗训练，训练编码网络极大化上述两个距离之间的差异（相当于区分fake和real的manifold structure）。

ConPreDiff

Improving Diffusion-Based Image Synthesis with Context Prediction

$x_{t}$ $x^{i}_{t-1}$ $x^{i}_{t-1}$ $x^{i}_{t-1}$ $x^{j}_{t-1}$ ，the ConPreDiff loss is an upper bound of the negative log likelihood. 这个loss的梯度也会传到self-denoising网络（UNet or Transformer）。
采样时只用self-denoising网络，和传统diffusion model一致。

QAC

Learning Quantized Adaptive Conditions for Diffusion Models

类似Diff-AE和PDAE的自编码器，只不过这里用一个bsq code作为表征，不需要post training建模latent，虽然这种condition不能完全复原图像，但至少提供了一些信息。
采样时随机一个binary vector code作为条件，可以加速采样，提高采样质量。

DCDM

Training Data Synthesis with Difficulty Controlled Diffusion Model

和CAD把coherence作为额外条件输入diffusion model训练类似，这里将difficulty作为额外条件输入diffusion model训练，可以控制生成不同复杂度的图像。

TDNN

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

探索更好的引入timestep embedding的方式。

Attention-Mediators

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps.
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately

Diff-Tuning

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

$t$ $t$ 较大时，diffusion model的泛化性受数据集影响较大。
$t$ $t$ 的增大单调递增。

SFUNet

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

WMGM

Multi-scale Generative Modeling for Fast Sampling

We propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands.

DDM

Decentralized Diffusion Models

Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of “compute islands,” lowering infrastructure costs and improving resilience to localized GPU failures.

enhancement of sampling

optimal sampling schedule

DP

Learning to Efficiently Sample from Diffusion Probabilistic Models

RL

Learning to Schedule in Diffusion Probabilistic Models

AdaDiff

AdaDiff: Adaptive Step Selection for Fast Diffusion

预定义一个步数集合，训练一个轻量级的步数选择网络，根据text embedding从集合中选择一个步数进行生成，根据生成结果打分，policy gradient优化网络。
有一个额外的loss鼓励小步数。

BudgetFusion

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

$x_{T}$ 和不同步数生成样本，计算metric，根据metric确定当前prompt的最efficient的生成步数，训练一个网络可以根据prompt预测最efficient的生成步数。

JYS

Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

CS

Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback

训练一个time-dependent reward model，采样时use the score of time-dependent reward function as guidance。

AYS

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

LD3

Learning to Discretize Denoising Diffusion ODEs

$\xi$ $T$ $0$ $\Psi$ $\Psi_{\star}$ $\Psi_{\xi}$ $\xi$ $\min_{\xi} \mathcal{L}_{\text{hard}} = \min_{\xi} E_{x_{T} \sim \mathcal{N}(0, \sigma_{T}^{2}I)} [\text{LPIPS}(\Psi_{\xi}(x_{T}), \Psi_{\star}(x_{T})) ]$ 。
$\mathcal{L}_{\text{hard}}$ $\Psi_{\star}(x_{T})$ $\Psi_{\xi}(x_{T})$ $x_{T}$ $\xi$ $x^{\prime}_{T}$ $x_{T}$ $\Psi_{\xi}(x^{\prime}_{T})$ $\Psi_{\star}(x_{T})$ $B(x,r \sigma_{T}) = \{x^{\prime} | \| x^{\prime} -x \|_{2} \le r\sigma_{T} \}$ $r\sigma_{T}$ $x$ $\min_{\xi} \mathcal{L}_{\text{soft}} = \min_{\xi} E_{x_{T} \sim \mathcal{N}(0, \sigma_{T}^{2}I), x^{\prime}_{T} \in B(x_{T},r \sigma_{T})} [\text{LPIPS}(\Psi_{\xi}(x^{\prime}_{T}), \Psi_{\star}(x_{T})) ]$ 。

OFS

Optimizing Few-step Sampler for Diffusion Probabilistic Model

Beta-Sampling

Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models using Stepwise Spectral Analysis

$x_{t}$ $x_{t-1}$ 之间的频谱的差异，发现在早期阶段，低频的差异较大，说明早期阶段主要合成低频，在后期阶段，高频的差异较大，说明后期阶段主要合成高频。
Instead of the traditional uniform distribution-based time step sampling, we introduce a Beta distribution-like sampling technique that prioritizes critical steps in the early and late stages of the process. 早期和后期步长小步数多，中间步长大步数少。

Distillation-based

大致可分为五类：Direct Distillation，Progressive Distillation，Adaversarial Distillation，Score Distillation（DI），Consistency Distillation

DenoisingStudent

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

$\mathcal{L} = \frac{1}{2} E_{x_{T}}[\| \mathcal{F}_{\text{student}}(x_{T}) - \mathcal{F}_{\text{teacher}}(x_{T}) \|^{2}_{2}]$ $\mathcal{F}_{\text{student}}(x_{T})$ $\mathcal{F}_{\text{teacher}}(x_{T})$ 是teacher多步DDIM生成的样本。
$(x_{T}, x_{0})$ pair数据集训练一步生成的student。

O2MKD

Accelerating Diffusion Models with One-to-Many Knowledge Distillation

分时间段进行蒸馏。

Diffusion2GAN

SDXL
We can significantly improve the quality of direct distillation by (1) scaling up the size of the ODE pair dataset and (2) using a perceptual loss, not MSE loss.
在SDXL的latent上重新训练一个VGG网络，优化student生成的latent和teacher生成的latent之间的LPIPS loss。
除了LPIPS loss，还是用对抗训练，使用类似GigaGAN的multi-scale discriminator。

InstaFlow

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

主要结论：2-Rectified Flow is a better teacher model to distill an one-step student model than the original SD.
先使用StableDiffusion训练一个k-Rectified Flow，再对该ReFlow进行direct distillation（一步拟合多步）。

PD

Progressive Distillation for Fast Sampling of Diffusion Models

训练student一步采样模拟teacher多步采样的效果。
$\epsilon$ $\hat{x}_{\theta} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(x_{t},t))$ $\sqrt{\bar{\alpha}_{t}} \rightarrow 0$ $\epsilon_{\theta}$ $\hat{x}_{\theta}$ $SNR$ $(\sqrt{\bar{\alpha}_{t}})^{2} + (\sqrt{1-\bar{\alpha}_{t}})^{2}=1$ $\sqrt{\bar{\alpha}_{t}}=cos(\phi)$ $\sqrt{1 - \bar{\alpha}_{t}} = sin(\phi)$ $x_{\phi} = cos(\phi) x_{0} + sin(\phi)\epsilon$ $v_{\phi} = \frac{d x_{\phi}}{d \phi} = cos(\phi)\epsilon - sin(\phi)x_{0}$ $x_{0} = \frac{cos(\phi)\epsilon - v_{\phi}}{sin(\phi)} = \frac{cos(\phi)\frac{x_{\phi}-cos(\phi)x_{0}}{sin(\phi)} - v_{\phi}}{sin(\phi)} = \frac{cos(\phi)x_{\phi}}{sin^{2}(\phi)} -\frac{cos^{2}(\phi)}{sin^{2}(\phi)}x_{0} -\frac{v_{\phi}}{sin(\phi)}$ $(sin^{2}(\phi)+cos^{2}(\phi))x_{0} = x_{0} = cos(\phi)x_{\phi}-sin(\phi)v_{\phi}$ $\epsilon = sin(\phi)x_{\phi} + cos(\phi)v_{\phi}$ $v_{\Omega}(x_{\phi_{t}},\phi_{t})$ $v_{\phi_{t}}$ $\hat{x}_{\Omega} = cos(\phi_{t})x_{\phi_{t}}-sin(\phi_{t})v_{\Omega}(x_{\phi_{t}},\phi_{t})$ $\epsilon_{\Omega} = sin(\phi_{t})x_{\phi_{t}} + cos(\phi_{t})v_{\Omega}(x_{\phi_{t}},\phi_{t})$ $\hat{x}_{\Omega}$ $x_{\phi_{s}} = cos(\phi_{s}) \hat{x}_{\Omega} + sin(\phi_{s})\epsilon_{\Omega} = cos(\phi_{s}) \left[ cos(\phi_{t})x_{\phi_{t}}-sin(\phi_{t})v_{\Omega} \right] + sin(\phi_{s}) \left[ sin(\phi_{t})x_{\phi_{t}} + cos(\phi_{t})v_{\Omega} \right]$ $x_{\phi_{s}} = cos(\phi_{s}-\phi_{t})x_{\phi_{t}} + sin(\phi_{s}-\phi_{t})v_{\Omega}$ $x_{\phi_{t} - \Delta} = cos(- \Delta)x_{\phi_{t}} + sin(- \Delta)v_{\theta} = cos(\Delta)x_{\phi_{t}} - sin(\Delta)v_{\Omega}$ $\epsilon$ $v_{\Omega}$ $\hat{x}_{\Omega}$ $\hat{x}_{\theta}$ $\hat{x}_{0}$ 之间的拟合）。

CFG-PD

On Distillation of Guided Diffusion Models

distill CFG teacher to student。
stage 1：训练一个和teacher相同步数的student，将guidance strength作为额外的条件，guidance strength也是随机均匀采样。
stage 2：和PD一样，迭代训练更少步数的student。
$N$ $2N$ 步的随机采样。

SFDDM

SFDDM: Single-fold Distillation for Diffusion models

PD是每次减少一半步数，直到目标步数，属于multi-fold。SFDDM一步到位，属于single-fold。
$T$ $T^{\prime}$ $\frac{T}{T^{\prime}} = c$ $q^{\prime}(x^{\prime}_{t} | x^{\prime}_{t-1}) = \mathcal{N}(\sqrt{\frac{\bar{\alpha}_{c \cdot t}}{\bar{\alpha}_{c \cdot t -c}}}x^{\prime}_{t-1} , (1-\frac{\bar{\alpha}_{c \cdot t}}{\bar{\alpha}_{c \cdot t -c}})I)$ $q^{\prime}(x^{\prime}_{t} | x^{\prime}_{0}) = q(x_{c \cdot t} | x_{0})$ $q^{\prime}(x^{\prime}_{t-1} | x^{\prime}_{t}, x^{\prime}_{0})$ ，使用student进行拟合。
本质上，student就是步数很少时的DDPM，只不过是用teacher DDPM做监督，感觉没有直接训练效果好？

SDXL-Lightning

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

PD时不是用MSE而是用对抗损失进行训练。
$D(x_{t}, x_{t-ns}, t, t-ns, c)$ $x_{t}, t, c$ $x_{t-ns}, t-ns, c$ $x_{t}$ $x_{t-ns}$ $x_{t}$ $x_{t-ns}$ $x_{t}$ , the discriminator learns the underlying ODE flow and the student must also follow the same flow to fool the discriminator.

TRACT

TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

SpeedUpNet

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

为StableDiffusion额外引入一个可训练的cross-attention与negtive prompt交互。
两个loss，一个学单步，一个学多步，是冲突的？

PDAE-PD

Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models

Imagine-Flash

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

$x_{t}$ $\hat{x}_{0}$ $\hat{x}_{0}$ 。
$q(x_{t} | x_{0})$ $x_{t}$ $x_{T}$ $x_{t}$ 。由于训练目标是加速采样，采样时是没有ground-truth signal的，所以使用forward distillation是有exposure bias的，for forward distillation, the model learns to denoise taking into account information from the ground-truth signal, backward distillation eliminates information leakage at all time steps, preventing the model from relying on a ground-truth signal.

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

SIDDM

Semi-Implicit Denoising Diffusion Models

改进DDGAN。

UFOGen

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

改进SIDDM。

YOSO

You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs

$p_{\theta}^{t}(x_{0}) = \int q(x_{t}) p_{\theta}(x_{0} | x_{t}) d x_{t}$ $p_{\theta}^{0}(x_{0}) = q(x_{0})$ $p_{\theta}(x_{0} | x_{t}) = \delta( G_{\theta}(x_{t}, t) )$ $E_{t}[D_{\text{adv}}(q(x)\parallel p_{\theta}^{t}(x)) + \lambda \cdot \text{KL}(q(x) \parallel p_{\theta}^{t}(x))]$ $G_{\theta}(x_{t},t)$ 直接预测clean data，前一项是distribution层面对齐，后一项是point层面对齐。
$p_{\theta}(x_{0})$ , curtailing the efficacy of one-step generation，这算一个两难问题。
$p_{\theta}^{t-1}(x)$ $p_{\theta}^{t}(x)$ $E_{t}[D_{\text{adv}}(p_{\theta}^{t-1}(\text{sg} (x)) \parallel p_{\theta}^{t}(x)) + \lambda \cdot \parallel G_{\theta}(x_{t},t) - x \parallel_{2}^{2} ]$ $p_{\theta}^{t-1}(\text{sg}(x))$ $p_{\theta}^{t}(x)$ $G_{\theta}$ $x_{t-1}$ $x_{t}$ $x_{t-1}$ $x_{t}$ 是独立采样。额外使用一个类似CM的MSE loss。
$G_{\theta}(x_{t},t)$ 。

HiPA

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

StableDiffusion用不同步数生成图像，FT提取高低频信息，组合不同高低频，ITF转换为图像，可以看出单步生成不好的主要原因是高频信息不够好。
使用LoRA finetune StableDiffusion，让LoRA+SD单步生成的图像的高频部分和SD多步生成的图像的高频部分尽量靠近。

ADD (SDXL-Turbo)

Adversarial Diffusion Distillation

$4$ $\{\tau_{1}, \tau_{2}, \tau_{3}, \tau_{4}\}$ $\tau_{4}=1000$ $4$ $\hat{x}_{\theta}$ ，两个loss训练。
GAN loss：We use a frozen pretrained feature network and a set of trainable lightweight discriminator heads. The trainable discriminator heads are applied on features at different layers of the feature network.
$t$ $\hat{x}_{\theta}$ $\hat{x}_{\theta,t} = \sqrt{\bar{\alpha}_{t}}\hat{x}_{\theta} + \sqrt{1-\bar{\alpha}_{t}}\epsilon$ $\frac{\hat{x}_{\theta,t} - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\psi}(\hat{x}_{\theta,t}, t)}{\sqrt{\bar{\alpha}_{t}}}$ $\hat{x}_{\theta}$ $\frac{\hat{x}_{\theta,t} - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\psi}(\hat{x}_{\theta,t}, t)}{\sqrt{\bar{\alpha}_{t}}} - \hat{x}_{\theta} = \frac{\sqrt{\bar{\alpha}_{t}}\hat{x}_{\theta} + \sqrt{1-\bar{\alpha}_{t}}\epsilon - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\psi}(\hat{x}_{\theta,t}, t)}{\sqrt{\bar{\alpha}_{t}}} - \hat{x}_{\theta} = \frac{\sqrt{1-\bar{\alpha}_{t}}(\epsilon - \epsilon_{\psi}(\hat{x}_{\theta,t}, t))}{\sqrt{\bar{\alpha}_{t}}}$ $g_{\theta}$ ，让其生成的样本与teacher网络生成的样本一致。
$4$ 步DDIM生成。

LADD

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

$z_{0}$ $z_{t}$ $\hat{z}_{\theta}$ $\hat{z}_{\theta}$ $\hat{z}_{\theta,t}$ $z_{t}$ $\hat{z}_{\theta,t}$ 分别输入teacher网络，用产生的feature做判别。

APT

Diffusion Adversarial Post-Training for One-Step Video Generation

先对diffusion model进行consistency distillation步数蒸馏，使用固定的cfg scale 7.5和固定的negative prompt，得到distilled model，再进行一步生成的对抗训练。直接对原diffusion model进行一步生成的对抗训练不行。
对抗训练的Generator由distilled model进行初始化，Discriminator由原diffusion model进行初始化，we find that initializing it with the original diffusion model weights, as opposed to the distilled model weights used by the generator, yields better results. We find that training all parameters without freezing improves the quality.
For discriminator, we modify the diffusion transformer architecture to produce logits. Specifically, we introduce new cross-attention-only transformer blocks at the 16th, 26th, and 36th layers of the transformer backbone. Each block uses a single learnable token as the query to cross-attend to all the visual tokens from the backbone as the key and value, producing a single token output. These tokens are then channel concatenated, normalized, and projected to yield a single scalar logit output. We find that using features from mul tiple layers enhances the structure and composition of the generated samples.
$t = 0$ $t = 0$ $t$ 作为输入，让它没有意义。
对Discriminator的梯度加L2正则。
图像bs=9062，video bs=2048。

NitroFusion

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

稳定Adaversarial Distillation中的对抗训练。
判别器由frozen teacher UNet encoder和trainable lightweight head构成，生成的图像加噪到相同timestep后输入判别器判别真假。
We use a dynamic discriminator pool to source these discriminator heads，head是分时间步的，每个head只负责某个时间步，每步训练时随机从pool中随机挑选一批head（相同时间步）进行训练，head优化更新后重新放回pool，the stochasticity of this process through random sampling ensures varied feedback, preventing any single head from dominating the generator’s learning and reducing bias. This diversifies feedback and enhances stability in GAN training.
每步训练后随机扔掉pool中1%的head，补充回相同数量的重新随机初始化的head，refreshing discriminator subsets helps maintain a balance between stable feedback from retained heads and variability from re-initialized ones to enhance generator performance.

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Substitute the NeRF rendering with a text-to-image generator that can directly synthesize a text-guided image in one step, effectively converting the text-to-3D generation training into one-step diffusion model distillation.
$\mathcal{L}_{\text{VSD}} = \mathbf{E}_{z \sim \mathcal{N}(0, I), c, x_{0}=g_{\theta}(z,c), t, \epsilon \sim \mathcal{N}(0, I), x_{t}=\sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1-\bar{\alpha}_{t}}\epsilon} \left[ \left( \epsilon_{\psi}(x_{t}, t, c, \omega) - \epsilon_{\phi}(x_{t}, t, c, \omega) \right) \frac{\partial g_{\theta}(z, c)}{\partial \theta} \right]$ $\omega$ 是CFG scale。
$\epsilon_{\psi}$ $\epsilon_{\phi}$ .
We train both the student model and the LoRA teacher alternately

SwiftBrushv2

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

SNOOPI

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

The CFG mechanism, though effective in multi-step models, is not directly translatable to one-step models without significant modifications.
VSD训练时，使用NASA而不是CFG。

DI

Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models

$s_{q^{(t)}} = \nabla_{x_{t}} \log q^{(t)}(x_{t})$ .
$g_{\theta}$ $p_{g}$ , matches that of the pre-trained diffusion model.
$s_{q^{(t)}}$ $p_{g}$ $p^{(t)}$ $t$ $s_{p^{(t)}} = \nabla_{x_{t}} \log p^{(t)}(x_{t})$ .
The IKL is tailored to incorporate knowledge of pre-trained diffusion models in multiple time levels. It generalizes the concept of KL divergence to involve all time levels of the diffusion process.
$\theta$ $\nabla_{x_{t}}$ $\frac{\partial }{\partial \theta} \log \frac{p^{(t)}(x_{t})}{q^{(t)}(x_{t})} = \frac{\partial }{\partial x_{t}} \log \frac{p^{(t)}(x_{t})}{q^{(t)}(x_{t})} \frac{\partial x_{t}}{\partial \theta} = [\nabla_{x_{t}} \log p^{(t)}(x_{t}) - \nabla_{x_{t}} \log q^{(t)}(x_{t})]\frac{\partial x_{t}}{\partial \theta}$ $s_{\phi}(x_{t},t)$ $s_{p^{(t)}} = \nabla_{x_{t}} \log p^{(t)}(x_{t})$ $\theta$ 优化IKL。
$g_{\theta}$ $s_{p^{(t)}}$ 了。这说明SDS是Diff-Instruct的一个特例。
ADD可以看成是Diff-Instruct和对抗训练的结合。

SDXS

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

使用BK-SDM精简student model。
$[\alpha T, T]$ $\mathcal{L}_{\text{DM}}$ $s_{\phi}(x_{t},t)$ $\mathcal{L}_{\text{IKL}}$ $[0, \alpha T]$ $\mathcal{L}_{\text{FM}}$ 进行训练。
$\mathcal{L}_{\text{FM}} = \sum_{l} w_{l} \cdot \text{SSIM}( f^{l}(x_{\theta}(\epsilon)), f^{l}(\psi(x_{\phi}(\epsilon))))$ $f^{l}$ $l$ $f$ $x_{\theta}$ $x_{\phi}$ $\psi$ $\mathcal{L}_{\text{FM}}$ yields favorable results with a comparison to MSE loss.

DMD

One-step Diffusion with Distribution Matching Distillation

Distribution Matching Loss就是DI的IKL。
As the distribution of our generated samples changes throughout training, we dynamically adjust the fake diffusion model，这就是为什么要额外训练一个diffusion model的原因。fake diffusion model和one-step generator是一起训练的。

DMD2

Improved Distribution Matching Distillation for Fast Image Synthesis

$x_{999}$ $\hat{x}_{0}$ $\hat{x}_{0}$ $x_{749}$ $G_{\theta}$ $\hat{x}_{0}$ 。
$G_{\theta}$ 生成的噪声图像。
Removing the regression loss: true distribution matching and easier large-scale training.
Stabilizing pure distribution matching with a Two Time-scale Update Rule. fake diffusion model和few-step generator是分开训练的。
Surpassing the teacher model using a GAN loss and real data.

MomentMatching

Multistep Distillation of Diffusion Models via Moment Matching

$X$ $f(x)$ $c$ $k$ $\int_{-\infin}^{+\infin} (x-c)^{k} f(x) dx = E_{x \sim p(x)} [(x-c)^{k}]$ $c=0$ $c=E(X)$ $1$ $X$ $2$ $X$ 的方差。矩的阶数可以一直到无穷大。如同PDF和CDF，矩生成函数MGF (Moment-Generating Function) 也能刻画随机变量的概率分布。随着矩阶数的升高，每一阶矩都提供了更细节的概率分布信息，与较低阶的矩一起对概率分布的刻画越完整(从均值、方差、偏度、峰度、……)，这点与泰勒级数和傅立叶级数的思想一致。
moment matching：通过拟合概率分布的moment来拟合概率分布，如拟合期望、方差等。
$h: R^{d} \rightarrow R^{d^{\prime}}$ $m = E_{x \sim p(x)}[h(x)] \in R^{d^{\prime}}$ $\| E_{x \sim p_{\theta}(x)} h(x) - E_{x \sim p_{\text{data}}(x)} h(x) \|^{2}$ 。
$g_{\theta}$ $g_{\eta}$ $g_{\eta}$ $g_{\theta}(x_{t},t)$ $g_{\eta}(x_{t}, t)$ $q(x_{s} | x_{t}, \hat{x}_{0} = g_{\theta}(x_{t},t))$ $x_{s}$ $\frac{1}{2} E_{x \sim q_{\text{data}}(x), x_{t} \sim q(x_{t}| x), \hat{x}_{0} \sim g_{\eta}(x_{t},t), x_{s} \sim q(x_{s} | x_{t}, \hat{x}_{0})}\|E_{g} [\hat{x}_{0} | x_{s}] - E_{q} [\hat{x}_{0} | x_{s}] \|^{2}$ $\eta$ $\hat{x}_{0}$ $x_{s}$ $E_{g}[\hat{x}_{0} | x_{s}]$ $g_{\phi}$ 拟合它。
In the case of one-step sampling, our method is a special case of Diff-Instruct, which distill a diffusion model by approximately minimizing the KL divergence between the distilled generator and the teacher model.

FGM

Flow Generator Matching

感觉类似FM上的DI。

CM

Consistency Models

$t_{n+1}$ $t_{n}$ 是这40步中相邻的两步。
For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.
$x_{0}$ $x_{t_{n+1}}$ $x_{0}$ $x_{t_{n}}$ $x_{t_{n+1}}$ $x_{0}$ $x_{t_{n+1}}$ $t_{n}=\epsilon$ $x_{0}$ $t_{n}=\epsilon$ 这一步决定的，并链式的影响到之后的时间步。

LCM

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

$\Psi$ 为ODE-Solver，如DDIM，DPM-Solver等。

$t_{n}−t_{n+1}$ $z_{t_{n}}$ $z_{t_{n+1}}$ $t_{n+1} \rightarrow t_{n}$ $k$ $t_{n+k} \rightarrow t_{n}$ $k$ $\frac{T}{k}$ 步的ODE上训练CM。

LCM-LoRA

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

StableDiffusion + LoRA作为Consistency Models。

RG-LCD

Reward Guided Latent Consistency Distillation

CTM

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

和TCD本质上是一样的，但故事不一样，TCD是先有后两步，再引入第一步，CTM是先有前后两步，再引入中间一步。
TCD抄袭CTM。

PCM

Phased Consistency Models

The learning objectives of CTMs are redundant, including many trajectories that will never be applied for inference.
$N$ $[s_{0}, s_{1}, s_{2}, \cdots, s_{N-1}, s_{N}]$ $s_{0} = \epsilon$ $s_{T} = N$ $[s_{n}, s_{n+1}]$ $t_{m+k}$ $z_{t_{m+k}}$ $z_{t_{m}}$ $t_{m}$ $[s_{n}, s_{n+1}]$ $\text{d}\left( f_{\theta}(z_{t_{m+k}}, t_{m+k}, s_{n}) - f_{\theta^{-}}(z_{t_{m}}, t_{m}, s_{n}) \right)$ 。

GCTM

Generalized Consistency Trajectory Models for Image Manipulation

CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs, which translate between arbitrary distributions via ODEs.
Flow Matching is another technique for learning PFODEs between two distributions. 在Flow Matching学到的PFODEs上运用CTMs。
支持translation、editing等。

SFD

Simple and Fast Distillation of Diffusion Models

可以加快TCD的训练速度。

TCD

Trajectory Consistency Distillation

$f_{\theta}(x_{t}, t, s) \rightarrow x_{s}$ $x_{0}$ $t_{n+k}$ $t_{n}$ $t_{m}$ 的consistency。
$x_{0}$ 再加噪，误差大且会累积，TCD的多步采样每一步预测到某一中间步，误差小。
使用LoRA fine-tune SDXL。

TSCD

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

$[0,T]$ $k$ $m$ $x_{t}$ $x_{t-1}$ $x_{t-k}$ $k$ $k=1$ $k=1$ 时就和TCD等价。
$k = 8, 4$ $k = 2, 1$ ).
训练完成后继续使用DMD进行enhancement。
使用LoRA fine-tune SDXL。

MCM

Multistep Consistency Models

$f$ $f$ $\hat{x}_{0}$ ）。

SCott

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

使用SDE-Solver而非ODE-Solver。
使用多步SDE。
Consistency Models被参数化为预测均值和方差的模型，此时输出就是一个分布，使用KL散度优化。

SCFlow

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Shortcut

One Step Diffusion via Shortcut Models

$t$ $d$ .
$x_{1}$ $x_{0}$ $x_{t+d} = x_{t} + s_{\theta}(x_{t}, t, d)d$ $d \rightarrow 0$ $s_{\theta}(x_{t}, t, d) = \frac{x_{t+d} - x_{t}}{d}=v_{\theta}(x_{t}, t)$ 即为Flow Matching。
$s_{\theta}(x_{t}, t, 2d) 2d = s_{\theta}(x_{t}, t, d)d + s_{\theta}(x_{t+d}, t+d, d)d$ $s_{\theta}(x_{t}, t, 2d) = s_{\theta}(x_{t}, t, d) /2 + s_{\theta}(x_{t+d}, t+d, d) / 2$ 。
$\mathcal{L} = \| s_{\theta}(x_{t}, t, 0) - (x_{1} - x_{0}) \|^{2} + \| s_{\theta}(x_{t}, t, 2d) - (s_{\theta}(x_{t}, t, d) /2 + s_{\theta}(x_{t+d}, t+d, d) / 2) \|^{2}$ 。

Consistency-FM

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

CM在flow matching上的应用。

iCD

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

$x_{t}$ $x_{0}$ $x_{t}$ $x_{T}$ ，可以加速编辑。
$x_{t}$ $x_{0}$ $x_{0}$ $x_{T}$ $\mathcal{N}(0,I)$ $x_{T}$ $x_{t}$ 进行训练。二是fCD无法进行多步采样，所以借鉴TCD和CTM的思想，给模型加一个额外的时间步输入，让模型预测指定时间步的latent，之后在trajectory上设置几个boundary时间步，每次只训练模型预测到最近的boundary时间步的latent。
$x_{t}$ $x_{0}$ $x_{t}$ $x_{t}$ 计算loss，只优化fCD；另一个preservation loss反过来同理，只优化CD。
We train fCD and CD separately from each other but initialize them with the same teacher model.
$\omega=1$ $\omega>1$ ) leads to the out-of distribution latent noise, which results in incorrect reconstruction.

SiD

Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation

SiDA

Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step

SiD with Adversarial Loss

SiD-LSG

Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation

ODE-based

可以分为single-step和multi-step，single-step只根据当前状态预测下一状态，如DDIM，EDM，DPM-Solver，优点是实现简单，可以自启动；multi-step需要额外的历史状态预测下一状态，如PNDM，DEIS，优点是估计更精准效果更好。

DDIM

Denoising Diffusion Implicit Models

PNDM

Pseudo Numerical Methods for Diffusion Models on Manifolds

DPM-Solver

DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

ODE-Distillation-based

D-ODE

Distilling ODE Solvers of Diffusion Models into Smaller Steps

We observe that predictions from neighboring timesteps exhibit high correlations in both denoising networks, with cosine similarities close to one. This observation suggests that denoising outputs contain redundant and duplicated information, allowing us to skip the evaluation of denoising networks for most timesteps.
We can combine the history of denoising outputs to better represent the next output, effectively reducing the number of steps required for accurate sampling. This idea is implemented in most ODE solvers, which are formulated based on the theoretical principles of solving differential equations. These solvers often adopt linear combinations or multi-step approaches, leveraging previous denoising outputs to precisely estimate the current prediction.
$t$ $t-1$ $t$ $C$ $t-1$ 步的预测（小跨步），前者以后者为目标进行蒸馏，优化组合系数。

Caching

AdaptiveDiffusion

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

PFDiff

PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

Based on two key observations: a significant similarity in the model’s outputs at time step size that is not excessively large during the denoising process of existing ODE solvers, and a high resemblance between the denoising process and SGD.
直接使用之前timestep的预测（也可以使用ODE算法组合多个timestep的预测）作为当前timestep的预测，因此当前timestep不需要NFE。

DeepCache

DeepCache: Accelerating Diffusion Models for Free

$N$ $N-1$ $\frac{T}{N}$ 次。

Flexiffusion

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

类似DeepCache的思想，利用NAS技术 to search for potential inference schedules with non-uniform steps and structures.

Unraveling

Unraveling the Temporal Dynamics of the Unet in Diffusion Models

Faster-Diffusion

Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models

The encoder features exhibit a subtle variation at adjacent time-steps, whereas the decoder features exhibit substantial variations across different timesteps，所以可以复用之前步数的UNet encoder的输出和feature，直接输入/skip-connect到下一步的UNet decoder。
The encoder feature change is larger in the initial inference phase compared to the later phases throughout the inference process，所以在复用集中在采样的中后期阶段。
还可以连续多步复用，这样多步就可以并行计算。

BlockCaching

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

UNet的block输出具有三个特点：smooth change over time, distinct patterns of change, small step-to-step difference. A lot of blocks are performing redundant computations during steps where their outputs change very little. Instead of computing new outputs at every step, we reuse the cached outputs from a previous step. Due to the nature of residual connections, we can perform caching at a per block level without interfering with the flow of information through the network otherwise.
重复利用之前时间步的某些block的输出，减少运算量。

SkipDiT

Accelerating Vision Diffusion Transformers with Skip Branches

$\Delta$ -DiT

Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

注意看剪刀，不同方法的区别在于省略的地方不同。
$\Delta$ -Cache caches the difference between feature maps.
$\Delta$ -Cache is applied to the back blocks in the DiT during the early outline generation stage of the diffusion model, and on front blocks during the detail generation stage.

DuCa

Accelerating Diffusion Transformers with Dual Feature Caching

TGATE

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

This study reveals that, in text-to-image diffusion models, cross-attention is crucial only in the early inference steps, allowing us to cache and reuse the cross-attention map in later steps.
节省了最耗计算量的cross-attention map的计算。

DiTFastAttn

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Post-training compression for self-attention. 可以用在ImageNet的DiT上，也可用在text-to-image的PixArt上。
self-attention values concentrate within a window along the diagonal region of the attention matrix. 计算采样的前两步的self-attention map的residual，之后只计算对角线附近的window self-attention，加上这个residual作为最后的self-attention map。
self-attention sharing直接共享self-attention的结果。

L2C

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

$T \times D$ $T$ $D$ 是DiT的层数，MHSA和FeedForward各算一层。
$s$ $m$ $x_{s}$ $\epsilon_{\theta}(x_{s},s)$ $x_{s}$ $x_{m}$ $\epsilon_{\theta}(x_{m},m)$ $\tilde{\epsilon}_{\theta}(x_{m},m)$ $h_{i+1}^{m}=h_{i}^{m} + g(m) \cdot (\beta_{m,i} \cdot f_{i}(h_{i}^{m}) + (1-\beta_{m,i})\cdot f_{i}(h_{i}^{s}))$ $f_{i}$ $h_{i}$ $g(m)$ $\| \epsilon_{\theta}(x_{m},m) - \tilde{\epsilon}_{\theta}(x_{m},m) \|_{2}^{2}$ $\beta_{m}$ 。
$\beta_{t,i}$ $\beta_{t,i}=0$ $f_{i}(h_{i}^{t})$ $0$ ，因此可以跳过当前层的计算。

HarmoniCa

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

改进L2C。

EOC

Accelerating Diffusion Transformer via Error-Optimized Cache

Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity.

LazyDiT

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

给MHSA和FeedForward各学习一个相似度估计器，相似度超过阈值时跳过，使用上一步的cache结果。

Token Reduction

ToMe

Token Merging: Your ViT But Faster

Token Merging for Fast Stable Diffusion

ImToMe

Importance-based Token Merging for Diffusion Models

$\| \epsilon_{\theta}(x_{t}, t, c) - \epsilon_{\theta}(x_{t}, t) \|$ 作为每个token的importance score，根据importance score选top-k个token作为dst set。

AT-EDM

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

根据attention map识别过剩的token进行merge。

ToDo

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

$\Sigma$ 的KV compression，但是training-free的。
Tokens in close spatial proximity exhibit higher similarity, thus providing a basis for merging without the extensive com putation of pairwise similarities.
We employ a downsampling function using the Nearest-Neighbor algorithm to the keys and values of the attention mechanism while preserving the original queries.

TokenCache

Token Caching for Diffusion Transformer Acceleration

ToCa

Accelerating Diffusion Transformers with Token-wise Feature Caching

本质和UNet那一套feature cache是一样的，只是对象换成了token。

DaTo

Token Pruning for Caching Better 9 Times Acceleration on Stable Diffusion for Free

others

DG

Refining Generative Process with Discriminator Guidance in Score-Based Diffusion Models

$x_{t}$ 。
$x_{t}$ $D(x_{t}, t)$ 进行判别。
$\nabla_{x_{t}} \log \frac{D(x_{t}, t)}{1 - D(x_{t}, t)}$ 指导采样。

DiffRS

Diffusion Rejection Sampling

$p_{\theta}(x_{t-1} | x_{t})$ $q(x_{t-1} | x_{t})$ $q(x_{t-1} | x_{t})$ 不一定高斯分布。
$p_{\theta}(x_{t-1} | x_{t})$ 作为proposal distribution，从其中采样并以一定概率拒绝，直到接受即可完成采样。
概率的计算最终还是使用DG的time-dependent discriminator。

LatentCRF

LatentCRF: Continuous CRF for Efficient Latent Diffusion

类似于PDAE是一个独立于diffusion model的轻量化的插件，容易训练，可以在推理过程的任一步介入，加速采样。

AMS

Score-based Generative Models with Adaptive Momentum

类似FDM但不需要重新训练，motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters.

Skip-Tuning

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

$d_{i}$ $u_{i}$ $(d_{i}, u_{i})$ $\frac{\| d_{i} \|_{2}}{\| u_{i} \|_{2}}$ $1$ $\rho_{i}$ $(\rho_{i} \cdot d_{i}, u_{i})$ ，会不会提升采样质量？
$\rho_{\text{bottom}}$ $\rho_{\text{top}}$ $\rho_{\text{bottom}} < \rho_{\text{top}}$ $\rho_{i}$ $5$ $\rho_{\text{bottom}}=0.55$ $\rho_{\text{top}}=1.0$ $\frac{1}{2}$ ，提升巨大。
$\rho$ $\rho$ ，也可以达到上述的提升。

MASF

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

把diffusion model生成的过程看成参数优化的过程，因此可以引入滑动平均提高稳定性和效果。
The denoising process often prioritizes reconstructing low-frequency component (layout) in the earlier stage, and then focuses on the recovery of high-frequency component (detail) later. 因此IDWT时，给不同component乘一个系数，给low-frequency component乘一个单调递减的常数，给high-frequency component乘一个单调递增的常数。
相同步数下，FID比DDIM好。

TimeTuner

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

类似TS-DDPM，at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, enforcing the sampling distribution towards the real one.

RL

Residual Learning in Diffusion Models

score-based generative models存在两种误差，离散化导致的误差和score network无法完全拟合导致的误差，所以可以在pre-trained diffusion model的基础上学习一个矫正网络来拟合这种误差。
$t=0$ 附近使用。

DICE

DICE: Staleness-Centric Optimizations for Efficient Diffusion MoE Inference

针对MoE网络结构的diffusion model采样加速。

IC

Informed Correctors for Discrete Diffusion Models

针对discrete diffusion model的采样算法。

Exposure Bias

$t$ $x_{t}$ $x_{t}$ 的分布是不同的，或者说是有domain shift的。

TS-DDPM

Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps

We search for such a time step within a window surrounding the current time step to restrict the denoising progress.

IP

Input Perturbation Reduces Exposure Bias in Diffusion Models

$x_{t}$ $x_{t}$ 的分布的gap。

DREAM

DREAM: Diffusion Rectification and Estimation-Adaptive Models

SS

Markup-to-Image Diffusion Models with Scheduled Sampling

$q(x_{t+m} | x_{0})$ $x_{t+m}$ $x_{t}$ $\epsilon_{\theta}(x_{t},t)$ $\frac{x_{t}-\sqrt{\bar{\alpha}_{t}}x_{0}}{\sqrt{1-\bar{\alpha}_{t}}}$ ，为了简便，忽略采样时产生的梯度。
该方法原本是用来解决自回归文本生成的exposure bias问题的。

E2EDiff

E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

MDSS

Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models

Low-Density Sampling

LD

Generating High Fidelity Data from Low-density Regions using Diffusion Models

Minority-Guidance

Don’t Play Favorites: Minority Guidance for Diffusion Models

SG-Minority

Self-Guided Generation of Minority Samples Using Diffusion Models

HD

Diffusion Models as Cartoonists! The Curious Case of High Density Regions

We propose a practical high probability sampler that consistently generates images of higher likelihood than usual samplers.

LTG

Generative Data Mining with Longtail-Guided Diffusion

$\hat{x}_{0}$ ，计算出长尾信号，求梯度作为guidance指导采样。

others

MC

Manifold-Guided Sampling in Diffusion Models for Unbiased Image Generation

encourage the generated images to be uniformly distributed on the data manifold, without changing the model architecture or requiring labels or retraining.

利用guidance。

BayesDiff

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

利用Last-layer Laplace Approximation (LLLA)技术估计diffusion model生成样本的不确定度，which can indicate the level of clutter and the degree of subject prominence in the image.不确定度高的样本背景较为混杂，可以过滤掉。

CADS

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

diversity低的原因：模型本身在小数据集上训练；cfg太大。
$y$ $\hat{y}=\sqrt{\gamma(t)}y+s\sqrt{1-\gamma(t)}n$ $\hat{y}_{rescaled} = \frac{\hat{y} - \text{mean}(\hat{y})}{\text{std}(\hat{y})}\text{std}(y) + \text{mean}(y)$ $\hat{y}_{final} = \psi \hat{y}_{rescaled} + (1-\psi)\hat{y}$ $\gamma(t)$ $[ t_{2}, T ]$ $[ t_{1}, t_{2}]$ $[ 0, t_{1} ]$ 为1。the diffusion model initially only follows the unconditional score and ignores the condition. As we reduce the noise, the influence of the conditional term increases. This progression ensures more exploration of the space in the early stages and results in high-quality samples with improved diversity.
$y$ $y$ $y$ 为image condition。

FPDM

Fixed Point Diffusion Models

$x^{\star} = f_{fp}^{\theta}(x^{\star}, x_{input}, t)$ $x_{input}$ $f_{pre}$ $f_{fp}^{\theta}$ 的梯度。

可以根据精度要求或者计算时间需求动态调整不动点网络迭代次数。

DistriFusion

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

分布式推理。
Patch Parallelism (PP), where a single image is divided into patches and distributed across multiple GPUs for individual and parallel computations.
Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step.

PCPP

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices.
PCPP decreases the communication cost by around 70% compared to DistriFusion.

LCSC

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

Evolutionary Search
类似集成学习的效果。

Hallucinations

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Diffusion models smoothly “interpolate” between nearby data modes in the training set, to generate samples that are completely outside the support of the original training distribution; this phenomenon leads diffusion models to generate artifacts that never existed in real data (i.e., hallucinations).

Guidance

Classifier

Noisy-Classifier

Diffusion Models Beat GANs on Image Synthesis

$x_{t}$ $x_{t}$ 。

DA

Training Diffusion Classifiers with Denoising Assistance

$\hat{x}_{0}$ 也当作条件，guidance效果更好。

CFG

Classifier-Free Diffusion Guidance

GFCG

Gradient-Free Classifier Guidance for Diffusion Model Sampling

ICG+TSG

No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

训练conditional diffusion model时不需要随机drop condition as a null condition进行训练。
$p(x) = \sum_{y} p(x|y) p(y)$ ，但这样需要forward多次。ICG使用随机采样的高斯噪声作为null condition直接输入conditional diffusion model得到unconditional score。
$\epsilon_{\theta}(x_{t}, \tilde{t}) + \omega_{\text{TSG}} \cdot (\epsilon_{\theta}(x_{t}, t) - \epsilon_{\theta}(x_{t}, \tilde{t}))$ $\tilde{t}$ $t$ 的embedding上随机加一个高斯噪声。

PCG

Classifier-Free Guidance is a Predictor-Corrector

CFG可以看成一种Score-based Generative Models中Predictor-Corrector采样的过程。

CG

Compress Guidance in Conditional Diffusion Sampling

去噪过程可以看成是对KL散度梯度下降优化的过程。
分类器指导采样同样可以看成类似的过程。

DLSM

Denoising Likelihood Score Matching for Conditional Score-Based Data Generation

CDM

Classification Diffusion Models Revitalizing Density Ratio Estimation

$x_{t}$ $t$ $0$ $T+1$ $\{ 0, [1,T], T+1 \}$ 中随机挑选一个训练分类器。
$E[\epsilon | x_{t}=\sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1-\bar{\alpha}_{t}} \epsilon] = \sqrt{1-\bar{\alpha}_{t}}(\nabla_{x_{t}}F_{\theta}(x_{t}, t)+x_{t})$ ，用它进行diffusion训练。

DDG

Simple Guidance Mechanisms for Discrete Diffusion Models

应用于discrete diffusion model的CFG方法。

Any Distance Estimator

$l(x)$ $x_{t}$ produces inaccurate gradient guidance.

$x_{t}$ $\epsilon_{\theta}(x_{t},t)$ $\hat{x}_{0}$ $l(\hat{x}_{0})$ $x_{t}$ 的梯度作为guidance。

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

FreeDoM

FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model

训练noisy data和condition之间的distance function并用其梯度做guidance过于消耗计算量，可以用每一步预测的噪声去计算出预测的clean data，使用现有的clean data和condition之间的distance function，即：

D_{ϕ} (c, x_{t}, t) \approx E_{p (x_{0} | x_{t})} D_{θ} (c, x_{0})

这种做法很普遍，但是效果却不稳定，对small domain（如人脸）效果很好，但对large domain（ImageNet）效果很差。原因是：The direction of unconditional score generated by diffusion models in large data domains has more freedom, making it easier to deviate from the direction of conditional control。

$x_{t} \stackrel{guidance}\longrightarrow x_{t-1} \stackrel{diffuse}\longrightarrow x_{t}$ $x_{t}$ $x_{t}$ 更加informative，aligned，harmonized

UGD

Universal Guidance for Diffusion Models

CoDe

CoDe: Blockwise Control for Denoising Diffusion Models

$V$ 是某种reward model，评判样本符合条件的程度。

LCD-MC

Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation

MPGD

Manifold Preserving Guided Diffusion

FIGD

Fisher Information Improved Training-Free Conditional Diffusion Model

ADMM

Decoupling Training-Free Guided Diffusion by ADMM

Training-Free-Guidance

Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

两种改进方法。

TFG

TFG: Unified Training-Free Guidance for Diffusion Models

GeoGuide

GeoGuide: Geometric Guidance of Diffusion Models

$\epsilon_{i} \sim \mathcal{N}(0, 1)$ $E[\epsilon_{i}^{2}] = E[(\epsilon_{i}-0)^{2}] = E[(\epsilon_{i}-E(\epsilon_{i}))^{2}] = \sigma_{i}^{2}$ $\epsilon_{1}^{2} + \epsilon_{2}^{2} + \cdots + \epsilon_{n}^{2}$ $n$ $n$ $\epsilon \in \mathbb{R}^{D} \sim \mathcal{N}(0, I)$ $\| \epsilon \|_{2}^{2} =\epsilon_{1}^{2} + \epsilon_{2}^{2} + \cdots + \epsilon_{D}^{2}$ $D$ $D$ $\| \epsilon \|_{2}$ $D$ $\sqrt{D}$ 。
$M \in \mathbb{R}^{D}$ $x_{t} = \sqrt{\bar{\alpha}_{t}} x_{0} + \sqrt{1-\bar{\alpha}_{t}}\epsilon$ $d(x_{t}, \sqrt{\bar{\alpha}_{t}}M) \approx \sqrt{1-\bar{\alpha}_{t}} \|\epsilon\|_{2} = \sqrt{1-\bar{\alpha}_{t}} \sqrt{D}$ $\sqrt{1-\bar{\alpha}_{t}} \sqrt{D}$ 。

EluCD

Elucidating The Design Space of Classifier-Guided Diffusion Generation

矫正，不过只能用于off-the-shelf的离散分类器上。

PnP

Diffusion Models as Plug-and-Play Priors

Variational Inference

采样过程是对引入的variational distribution的点估计采样过程，也是对negtive ELBO最小化的过程，即对variational distribution和真实后验分布之间的KL散度的最小化。

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

$p_{\theta}(x_{t-1}|x_{t}, c) \propto \frac{p_{\theta}(x_{t-1}|x_{t})p_{\theta}(c|x_{t-1})}{p_{\theta}(c|x_{t})}$

$\nabla_{x_{t}} \log p_{\theta}(x_{t-1}|x_{t}, c) = \nabla_{x_{t}} \log p_{\theta}(x_{t-1}|x_{t}) - \nabla_{x_{t}} V_{1}(x_{t}, c) + \nabla_{x_{t}} V_{2}(x_{t-1}, c)$

$x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}} \big( x_{t} - \frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon_{\theta}(x_{t}, t) \big) - \nabla_{x_{t}} V_{1}(x_{t}, c) + \nabla_{x_{t}} V_{2}(x_{t-1}, c) + \sigma_{t}\epsilon$

DSG

Guidance with Spherical Gaussian Constraint for Conditional Diffusion

DSG enhanced DPS by normalizing gradients in the constraint guidance term and implementing a step size schedule inspired by Spherical Gaussians.

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

求梯度不需要过diffusion network，降低了计算量。
受SGD算法启发使用动态scale，不需要handcrafted parameter tuning on a case-by-case basis.

Asymmetric Reverse Process

$P_{t}$ $D_{t}$ 非对称。

Asyrp

Diffusion Models Already Have a Semantic Latent Space

$l$ $P_{t}$ 。

DDS

Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

$x_{t}$ $\epsilon_{\theta}(x_{t},t)$ $\hat{x}_{0} = \frac{x_{t} - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(x_{t}, t)}{\sqrt{\bar{\alpha}_{t}}}$ $\Delta x_{0}$ $l(\hat{x}_{0}+\Delta x_{0})$ $\hat{x}_{0}+\Delta x_{0}$ $P_{t}$ $\hat{x}_{0}$ 是unconditional采样的结果，条件是在优化的过程中引入的。
$\hat{x}_{0}$ $x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \left( \hat{x}_{0} - \gamma_{t} \nabla_{\hat{x}_{0}} l(\hat{x}_{0}) \right) + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{\theta}(x_{t}, t)$ 。

CFG++

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

将diffusion model conditional sampling看成是measurement为条件的inverse problem。
$\hat{x}_{0}$ $l(\hat{x}_{0}) = \| \epsilon - \epsilon_{\theta}(x_{t}, t, c) \|^{2}$ $\hat{x}_{0}$ $x_{t}=\sqrt{\bar{\alpha}_{t}}\hat{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon$ $\hat{x}_{0}$ 再进行加噪。该loss就是SDS loss，这和DreamSampler类似。
$l(\hat{x}_{0}) = \| \epsilon - \epsilon_{\theta}(x_{t}, t, c) \|^{2} = \| \frac{x_{t} - \sqrt{\bar{\alpha}_{t}}\hat{x}_{0}}{\sqrt{1-\bar{\alpha}_{t}}} - \frac{x_{t} - \sqrt{\bar{\alpha}_{t}}\hat{x}^{c}_{0}}{\sqrt{1-\bar{\alpha}_{t}}} \|^{2} = \frac{\bar{\alpha}_{t}}{1-\bar{\alpha}_{t}} \| \hat{x}_{0} - \hat{x}^{c}_{0} \|^{2}$ $\hat{x}^{c}_{0}= \frac{x_{t} - \sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(x_{t}, t, c)}{\sqrt{\bar{\alpha}_{t}}}$ $x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \left( \hat{x}_{0} + \lambda_{t} (\hat{x}^{c}_{0} - \hat{x}_{0}) \right) + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_{\theta}(x_{t}, t)$ $\lambda_{t} = 2 \frac{\bar{\alpha}_{t}}{1-\bar{\alpha}_{t}} \gamma_{t}$ 。
$\hat{x}_{0} + \lambda_{t} (\hat{x}^{c}_{0} - \hat{x}_{0})$ $\epsilon^{\lambda_{t}}_{\theta}$ $\epsilon^{\lambda_{t}}_{\theta}=\epsilon_{\theta}(x_{t}, t) + \lambda_{t} \left[ \epsilon_{\theta}(x_{t},t,c) - \epsilon_{\theta}(x_{t},t) \right]$ ，即为CFG的形式。
$P_{t}$ $\epsilon^{\lambda_{t}}_{\theta}$ $D_{t}$ $\epsilon_{\theta}(x_{t}, t)$ 计算。
$\hat{x}_{0} + \lambda_{t} (\hat{x}^{c}_{0} - \hat{x}_{0})=\lambda_{t}\hat{x}^{c}_{0} + (1-\lambda_{t})\hat{x}_{0}$ $\hat{x}_{0}$ $\lambda_{t} \in [0, 1]$ $\hat{x}_{0}$ $\hat{x}^{c}_{0}$ $P_{t}$ $\lambda_{t}$ $[0,1]$ $D_{t}$ $\epsilon_{\theta}(x_{t}, t)$ 防止renoising过程中的manifold offset。
$\hat{x}_{0}$ $x_{t}$ $\hat{x}^{c}_{0}$ 。
$\omega$ 过大时CFG的DDIM Inversion不准的问题。

AsyGG

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

Inversion

why

生成模型是先有隐变量（一般是随机采样的噪声）再有生成样本，Inversion是先有真实数据（非生成的）再找到能生成它本身的隐变量。
动机是对真实数据做编辑。

GAN Inversion

由于mode collapse，GAN Inversion效果相对较差，过程较为复杂。

DDIM Inversion

$a_t=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{t}}}$ $b_t=-\sqrt{\frac{\bar{\alpha}_{t-1}(1-\bar{\alpha}_{t})}{\bar{\alpha}_{t}}}+\sqrt{1-\bar{\alpha}_{t-1}}$ $\epsilon(x_t,t)=\epsilon_{\theta}(x_t,t,\phi)+\omega \cdot [\epsilon_{\theta}(x_t,t,c)-\epsilon_{\theta}(x_t,t,\phi)]$ $x_{t-1}=a_t x_t + b_t \epsilon(x_t,t)$ $x_{t-1}$ $x_t$ $\epsilon(x_{t},t) \approx \epsilon(x_{t-1},t-1)$ $x_{t-1}=a_t x_t + b_t \epsilon(x_{t-1},t-1)$ $x_t=\frac{x_{t-1}-b_t \epsilon(x_{t-1},t-1)}{a_t}$ ，这就是DDIM Inversion（reverse ODE）时使用的公式。
$\omega=0$ $\omega=1$ $x_{0} \stackrel{DDIM\ Inversion}\longrightarrow x_{t} \stackrel{DDIM}\longrightarrow \hat{x}_{0}$ $\omega>1$ $50$ $\omega=7$ $\omega=0$ $\omega=7$ $\omega=0$ $\omega=7$ $\epsilon(x_{t},t)$ $\epsilon(x_{t-1},t-1)$ 之间的cosine similarity曲线图，表明了上述近似假设的成立程度。

$\omega$ $x_{0} \stackrel{DDIM\ Inversion}\longrightarrow x_{t}$ $x_{t} \stackrel{DDIM}\longrightarrow \hat{x}_{0}$ $\omega$ $\omega$ $\omega_{enc}=0$ $\omega_{dec}$ 重构的结果：

$\omega_{dec}$ $\omega_{enc}$ $\omega_{enc}$ $\omega_{dec}$ $\omega_{enc}$ 。

$x_T$ $x_T$ $\epsilon(x_{t},t) \approx \epsilon(x_{t-1},t)$ 的近似更精确，而100步时这个近似误差就相对较大了。
Inversion做好就可以很好的编辑了，有一些专门做精确Inversion的工作，如EDICT, Null-text Inversion, Prompt Tuning, AIDI等，参考Image Editing部分。

Regularized DDIM Inversion

Zero-shot Image-to-Image Translation

$\epsilon_{\theta}$ 的预测结果，一个loss计算不同位置之间的相关性，另一个loss计算每个位置和标准高斯分布的KL散度。

Exact Inversion

EDICT

EDICT: Exact Diffusion Inversion via Coupled Transformations

AIDI

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

BDIA

Exact Diffusion Inversion via Bi-directional Integration Approximation

BELM

BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models

High-Order

On Exact Inversion of DPM-Solvers

高阶采样器的inversion

SPDInv

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

EasyInv

EasyInv: Toward Fast and Better DDIM Inversion

Parameter-Efficient Fine-Tuning

可以用在不同任务上，比如data-driven fine-tune，RLHF fine-tune，TI fine-tune等。

LoRA

LoRA: Low-rank adaptation of large language models

$W = W_{o} + BA$ $h = h_{o} + \Delta h = W_{o}x + BAx$

AttnLoRA

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

The standard U-Net architecture for diffusion models conditions convolutional layers in residual blocks with scale-and-shift but does not condition attention blocks. Simply adding LoRA conditioning on attention layers improves the image generation quality.

TriLoRA

TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation

$A \in R^{m \times n} = U_{r} \Sigma_{r} V^{T}$ $U_{r} \in R^{m \times r}$ $V \in R^{n \times r}$ $\Sigma_{r} \in R^{r \times r}$ $r$ 个奇异值组成的对角矩阵。
$W = W_{o} + U_{r} \Sigma_{r} V^{T}$ $h = h_{o} + \Delta h = W_{o}x + U\Sigma V^{T}x$ ，学习三个矩阵。

PaRa

PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

$W = W_{0} - QQ^{T}W_{0}$ $Q$ $B$ $W$ $W_{0}$ , effectively reducing the dimension of the output while maintaining the key features learned by the model.

Terra

Time-Varying LoRA Towards Effective Cross-Domain Fine-Tuning of Diffusion Models

SVDiff

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

$W=U\Sigma V^{T}$ $\Sigma=diag(\sigma)$ $\Sigma$ $\Sigma_{\delta}=diag(ReLU(\sigma+\delta))$ $W_{\delta}=U\Sigma_{\delta}V^{T}$ 。因为参与训练的参数少，更不容易overfitting。

LoRA-X

LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation

$W = U \Sigma V^{T} \in \mathbb{R}^{m \times n}$ $\Delta W = \widetilde{U} \Delta \Sigma \widetilde{V}^{T} \in \mathbb{R}^{m \times n}$ $\widetilde{U} \in \mathbb{R}^{m \times r}$ $\Delta \Sigma \in \mathbb{R}^{r \times r}$ $\widetilde{V} \in \mathbb{R}^{n \times r}$ $\widetilde{U}$ $\widetilde{V}$ $n − r$ $U$ $V$ $\Delta \Sigma$ 。
$\Delta W$ $W$ 张开的子空间中。
$W_{s}$ $W_{t}$ $W_{s}$ $\Delta W_{s}$ $W_{t}$ $\Delta W_{t}$ 。

PET

A Closer Look at Parameter-Efficient Tuning in Diffusion Models

为预训练StableDiffusion加小参数量的可训练的adapter进行transfer learning，探索了adapter位置和网络结构对训练的影响。

StyleInject

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

$W = W_{o} + BA$ $h = h_{o} + \Delta h = W_{o}x + BAx$

LyCORIS

Navigating Text-To-Image Customization From LyCORIS Fine-Tuning to Model Evaluation

OFT

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

$z = W^{T} x = (R \cdot W^{0})^{T} x \ \ \ s.t. \ \ \ R^{T}R = RR^{T} = I, \| R - I \| \le \epsilon$ $R$ 。
OFT时另一种fine-tune方法，比LoRA效果更好，参数更少收敛更快。
和OrthoAdaptation没有关系

BOFT

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

改进OFT。

SODA

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

SCEdit

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

只用SC-Tuner。

DiffFit

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

针对DiT的PEFT方法。
$(i,j)$ $(\frac{i}{2}, \frac{j}{2})$ 。

Diffscaler

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

DiT for incremental class-conditional generation.
为incremental class添加class embedding。
$Wx + \hat{b} \rightarrow (1 + a)Wx + \hat{b} + b + s \cdot W_{\text{up}} \text{ReLU}(W_{\text{down}}x)$ . For transformer models, we add our Affiner block for each key, query, value weights and bias parameters as well as the MLP block.

SaRA

SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

借助减枝理论，模型中有一些ineffective parameter，即绝对值小于某个阈值的那些参数。
These currently ineffective parameters are caused by the training process and can become effective again in following training.

FINE

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

奇异值分解，只fine-tune奇异值。

Text-to-Image

Awesome

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

VQVAE + multinomial diffusion

$x_{t-1}$ $x_t$

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

$\rightarrow$ $\rightarrow$ 256x256 的生成模式

$\rightarrow$ $K$ $K \times d$ ），两种condition方法一起使用：

第一种：最后一个vector作为ADM中AdaGN的class embedding的替代。

$K$ $d\rightarrow d_c$ $K$ $2 \times d_c \times K$ $x_t$ $d_c \times h \times h$ $3 \times d_c \times h^{2}$ $d_c$ $h^2 \times (h^2+K)$ $(h^2 \times h^2)$ $(h^2 \times K)$ 组合起来的hybrid-attention。

$\rightarrow$ 256x256：使用ADM的super resolution方法，使用和上面相同的condition方法，但是用了一个维度更小的TransformerEncoder编码text。

classifier-free guidance

上面的conditional模型训练后，再以20%概率用空串代替文本的方式对其进行fine-tune，得到classifier-free模型。

Text-Guided Inpainting Model

$x_{t}$ $q(x_{t}|x_{0})$ 的采样样本，但这样做的话，每一步采样时模型是看不到完整的unmask部分的信息的，只有noisy version的，这样会造成生成样本在mask边界不自然的现象。

$x_{0}$ $x_{0}$ $x_{0}$ $x_t$ 上，输入UNet预测，只计算mask region的loss，fine-tune得到一个inpainting model。模型只增加了第一个Conv层的输入通道数，其余不变。

$x_{t}$ $q(x_{t}|x_{0})$ 的采样样本。

$\cdot$ E-2 (unCLIP)

Hierarchical Text-Conditional Image Generation with CLIP Latents

$\text{text} \rightarrow 64 \times 64 \rightarrow 256 \times 256 \rightarrow 1024 \times 1024$ $x_{t}$ $64 \times 64 \rightarrow 256 \times 256$ $256 \times 256 \rightarrow 1024 \times 1024$ 使用Diverse BSR）。To reduce training compute and improve numerical stability, we train upsamplers on random crops of images that are one-fourth the target size. We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution.
$\epsilon-$ $z_{i}-$ $z_i$ $z_i \cdot z_t$ $z_i$ 。

$K+4$ ），之后和GLIDE一样使用hybrid-attention。
CFG：Prior：randomly dropping text conditioning information 10% of the time during training. Decoder：randomly setting the CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of the time during training

DALL·E-3

Improving Image Generation with Better Captions

Existing text-to-image models struggle to follow detailed image descriptions and often ignore words or confuse the meaning of prompts. We hypothesize that this issue stems from noisy and inaccurate image captions in the training dataset. We address this by training a bespoke image captioner and use it to recaption the training dataset. We then train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability. 使用recaptioned dataset训练StableDiffusion，95%使用recaption，5%使用原caption。
A text-conditioned convolutional UNet latent diffusion model on top of the latent space learned by the VAE.
Once trained, we used the consistency distillation process to bring it down to two denoising steps.

CogView3

CogView3: Finer and Faster Text-to-Image via Relay Diffusion

类似DALL·E-3使用recaptioned dataset进行训练。
$512 \times 512$ $8$ 倍的EDM StableDiffusion。
$[0,T_{r}]$ $T_{r}$ $1024 \times 1024$ $x$ $512 \times 512$ $1024 \times 1024$ $x^{L}$ $z$ $z^{L}$ $\frac{T_{r} - t}{T_{r}}z + \frac{t}{T_{r}}z^{L} + \sigma \epsilon$ ，相当于用插值取代了blur，也是为了解决直接上采样后有gap的问题。
$1024 \times 1024$ ，编码到latent space并加噪后输入RDM进行采样。

SD

High-Resolution Image Synthesis with Latent Diffusion Models

SDXL

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

架构上：借鉴SimpleDiffusion第3条经验，架构上采用不均一的block分布，SD是[1,1,1,1]，即4层每层1个block，downsample 3次，SDXL是[0,2,10]，即3层，第一层直接降维，不做其余处理，第二第三层各2个和10个block，downsample 2次。使用了两个text encoder，输出concat在一起。参数量是原SD的3倍。
Micro-Conditioning Image Size：由于数据集图像尺寸不统一，SD的做法是直接丢弃小尺寸的数据，但这样会损失很大一部分数据；另一种做法是把小尺寸数据upsample到大尺寸，但这种数据比真的大尺寸图像模糊，会导致模型输出图像模糊。SDXL将原图尺寸作为condition输入网络，加到time embedding上，注意：网络输出的还是目标尺寸的图像，但其模糊程度由这个condition决定。The image quality clearly increases when conditioning on larger image sizes.
Micro-Conditioning Cropping Parameter：SD的一个很大的问题就是有些输出图像会截掉某个物体一部分，这是由于数据处理时，将图像长宽中较小的那一个resize后目标尺寸再对较长的那一个进行crop造成的。SDXL将crop位置作为condition输入网络，加到time embedding上，inference时输入(0,0)就能得到物体比较完整的图像。

Multi-Aspect Training：SD使用固定的输出尺寸。SDXL在目标尺寸上预训练好后，在多比例尺寸图像上进行fine-tune，做法是划分一些尺寸bucket，将图像装入最近的bucket，同一个bucket内的图像resize到bucket对应的尺寸，每次随机选一个bucket采样batch进行训练。Inference时就可以生成不同尺寸的图像（只要输入目标尺寸的噪声即可）。
improved VAE-Autoencder，batch size 256（之前是9） + EMA
先在256x256上训练（带Micro-Conditioning），再在512x512上训练（带Micro-Conditioning），再在1024x1024的分辨率上进行Multi-Aspect Training（划分bucket：以1024x1024为中心，64为步长增减长宽，保持pixel总数和1024x1024接近）。
Refinement Stage：We train a separate LDM in the same latent space, which is specialized on high-quality, high resolution data and employ a noising-denoising process as introduced by SDEdit on the samples from the base model. We follow and specialize this refinement model on the first 200 (discrete) noise scales. During inference, we render latents from the base SDXL, and directly diffuse and denoise them in latent space with the refinement model, using the same text input.

iSDXL

On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

Straightforward implementation of control conditions in DiT may cause interference between the time-step and class-level or control conditions (macro-conditioning) if their corresponding embeddings are additively combined in the adaptive layer norm conditioning.
For class, we move the class embedding to be fed through the attention layers present in the DiT blocks.
For control conditions, we zero out the control embedding in early denoising steps, and gradually increase its strength.

SD3

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

$z_{t} = a_{t} z_{0} + b_{t} \epsilon$ $a_{0}=1, b_{0}=0, a_{1}=0, b_{1}=1$ $\frac{d}{dt}z_{t} = a^{\prime}_{t} z_{0} + b^{\prime}_{t}\epsilon = a^{\prime}_{t} \frac{z_{t} - b_{t}\epsilon}{a_{t}} + b^{\prime}_{t}\epsilon = \frac{a^{\prime}_{t}}{a_{t}}z_{t} - b_{t}(\frac{a^{\prime}_{t}}{a_{t}} - \frac{b^{\prime}_{t}}{b_{t}})\epsilon$ $\mathcal{L}_{\text{CFM}} = E_{t, \epsilon}\parallel v_{\theta}(z_{t},t) - \frac{a^{\prime}_{t}}{a_{t}}z_{t} + b_{t}(\frac{a^{\prime}_{t}}{a_{t}} - \frac{b^{\prime}_{t}}{b_{t}})\epsilon \parallel^{2}$ $\epsilon$ $\mathcal{L}_{\text{CFM}} = E_{t, \epsilon} (-b_{t} (\frac{a^{\prime}_{t}}{a_{t}} - \frac{b^{\prime}_{t}}{b_{t}}))^{2} \parallel \epsilon_{\theta}(z_{t},t) -\epsilon \parallel^{2}$ $\lambda_{t} = \log \frac{a_{t}^{2}}{b_{t}^{2}}$ $\lambda^{\prime}_{t} = 2 (\frac{a^{\prime}_{t}}{a_{t}} - \frac{b^{\prime}_{t}}{b_{t}})$ $\mathcal{L}_{\text{CFM}} = E_{t, \epsilon} (-\frac{b_{t}}{2} \lambda^{\prime}_{t})^{2} \parallel \epsilon_{\theta}(z_{t},t) -\epsilon \parallel^{2}$ $\mathcal{L}_{\omega} = -\frac{1}{2}E_{t,\epsilon}[ \omega_{t} \lambda^{\prime}_{t} \parallel \epsilon_{\theta}(z_{t},t) - \epsilon \parallel^{2} ]$ $\omega_{t} = -\frac{1}{2} \lambda^{\prime}_{t}b^{2}_{t}$ $z_{t}$ $w_{t}$ 不同的CFM。
MMDiT架构，ViT in latent space，latent channel取16，since text and image embeddings are conceptually quite different, we use two separate sets of weights for the two modalities.
训练时SNR遵循什么样的分布采样很重要。
$z_{t} = (1-t) z_{0} + t\epsilon$ $\omega_{t} = \frac{t}{1-t}$ ) formulations generally perform well and compared to other formulations, their performance degrades less when reducing the number of sampling steps.

PGv3

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

We believe that the continuity of information flow through every layer of the LLM is what enables its generative power and that the knowledge within the LLM spans across all its layers, rather than being encapsulated by the output of any single layer.

CLEAR

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

LiT

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

SANA

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

$32 \times 32$ $1024 \times 1024$ 。

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

$1024 \rightarrow 256$ 。
$16 \times 24 \times 24$ 。
stage B：diffusion建模stage A中图像quantize之前的embedding，以图像经过semantic compressor的输出为条件（Wuerstchen还以text为额外条件），相当于self-condition。
stage C：diffusion建模图像经过semantic compressor的输出，以text为条件。
$C \rightarrow B \rightarrow A$ 。

Imagen

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

$\text{text} \rightarrow 64 \times 64 \rightarrow 256 \times 256 \rightarrow 1024 \times 1024$ ，三个模型都要condition text。

$\text{text} \rightarrow 64 \times 64$ ：GLIDE的两种conditon方法。

$64 \times 64 \rightarrow 256 \times 256$ ：Efficient UNet. GLIDE的两种conditon方法。

$256 \times 256 \rightarrow 1024 \times 1024$ ：Efficient UNet. 不用self-attention，只用cross-attention，降低运算量。

Use noise conditioning augmentation for both super resolution models。

三个模型都有CFG。
Dynamic thresholding（只针对采样）

$\hat{x}_{0}$ $(-1, 1)$ $\hat{x}_{0}$ $s$ $s\gt 1$ $(-s, s)$ $s$ ，否则还是按原来的方法截取。

纯text encoder比image-text联合训练出来的text encoder要好。

YaART

YaART: Yet Another ART Rendering Technology

$\text{text} \rightarrow 64 \times 64 \rightarrow 256 \times 256 \rightarrow 1024 \times 1024$ ，前两个模型都要condition text，最后一个模型不用condition text。

$\text{text} \rightarrow 64 \times 64$ ：GLIDE的两种conditon方法。

$64 \times 64 \rightarrow 256 \times 256$ ：只使用AdaGN的condition方法。

$256 \times 256 \rightarrow 1024 \times 1024$ ：Efficient UNet，不condition text。

$\text{text} \rightarrow 64 \times 64$ $64 \times 64 \rightarrow 256 \times 256$ $256 \times 256 \rightarrow 1024 \times 1024$ with SR dataset.
$\text{text} \rightarrow 64 \times 64$ .

eDiffi

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

$\text{text} \rightarrow 64 \times 64 \rightarrow 256 \times 256 \rightarrow 1024 \times 1024$ ，同Imagen。
同时使用T5 text encoder和CLIP text encoder。
发现了不同时间步对文本的利用程度不同

提出模型分裂法，每个子模型只针对某个子level的noise进行训练，称为expert，最终模型为Ensemble of Expert Denoisers。

RAPHAEL

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

eDiffi是使用不同timesteps的experts进行生成，这里还使用不同space的experts进行生成。
Space MoE：根据cross-attention map使用阈值法确定某个word的mask，再根据word由route网络选择某个expert，由该expert生成该word对应的feature，所有word的feature乘上对应的mask取平均作为输出。

$\alpha$

PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

DiT加入cross-attention引入text。
$27\%$ $6$ $6$ 的AdaLN参数，和global AdaLN参数相加得到最终的AdaLN参数，极大地降低了参数量。
三阶段训练：使用一个预训练的class-conditional ImageNet模型作为初始化，一方面可以节省text-to-image的训练时间，一方面class-conditional模型训练起来较为容易且不费时；使用高度对齐的、高密度信息的文本的数据集进行训练，实现text-image alignment；类似Emu，使用少量的高质量图像进行fine-tune。

$\Sigma$

PixArt-sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

使用更高分辨率的图像和细粒度的caption进行训练。
$R \times R$ $2 \times 2$ $\frac{1}{R^{2}}$ ，让它一开始就等价于一个average pooling。Q还是保持不变，以保留信息。
$\alpha$ 作为weak模型。直接将VAE替换为高分辨率图像的VAE；切换到高分辨率时使用DiffFit的positional embedding插值；即使weak模型上没有使用KV compression，也可以直接在strong模型训练时使用KV compression。

Flag-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Next-DiT

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

GenTron

GenTron: Diffusion Transformers for Image and Video Generation

adaLN design yields superior results in terms of the FID, outperforming both cross-attention and in-context conditioning in efficiency for class-based scenarios. However, our observations reveal a limitation of adaLN in handling free-form text conditioning. Cross-attention uniformly excels over adaLN in all evaluated metrics.

PanGu-Draw

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Cascaded Training：不同分辨率的三个模型分别训练。Resolution Boost Training：先在低分辨率上训练，再在高分辨率上训练。
$x_{T_{\text{struct}}}$ ，其本来就是带噪的，使用低分辨率上采样后的模糊图像并不影响效果。后一阶段在低分辨率上进行训练，在高分辨率上进行采样。
Coop Diffusion：不同隐空间和不同分辨率训练的扩散模型可以一起用于采样，以image space为中介进行转换。

ParaDiffusion

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

解决长文本复杂场景的生成问题。
使用decoder-only的language model训练t2i模型，好处是gpt已经展现出了强大的能力，对长文本已经有很好的建模，且训练数据多，缺点是pre-trained decoder-only模型feature extraction能力不太行，所以需要adaption。efficiently fine-tuning a more powerful decoder-only language model can yield stronger performance in long-text alignment (up to 512 tokens)

KNN-Diffusion

KNN-Diffusion Image Generation via Large-Scale Retrieval

不需要text-image pair进行训练，用image做条件，CLIP做桥梁。训练时根据image间CLIP编码的cosine距离，使用KNN算法找出和训练image相似的N个image作为条件。采样时根据text和image间CLIP编码的cosine距离，使用KNN算法找出和采样text相似的N个image作为条件。

RDM

Retrieval-Augmented Diffusion Models

使用训练数据的k-NN的CLIP embedding作为条件进行训练，采样时，可以根据文本挑选k-NN进行生成，也可以直接使用文本的CLIP embedding。

Enhancement

Re-Imagen

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities. A generative model that uses retrieved information can produce high-fidelity and faithful images, even for rare or unseen entities.
Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities’ visual appearances.
在UNet encoder后加一个cross-attention与neighbors做交互，同样使用UNet encoder编码neighbors作为key-value，t设为0，所有参数一起训练。
采样时可以自己提供reference image作为neighbor，实现类似Textual Inversion的效果。

Latent Transparency

Transparent Image Layer Diffusion using Latent Transparency

用透明图像数据训练一个编码器和一个解码器：编码器根据RGB图像和alpha图像预测一个VAE latent空间的偏移量latent transparency，该偏移量加在RGB图像的latent上，相当于对latent distribution做修改，这么做的目的是为了让解码器可以根据修改后的latent预测出RGB图像和alpha图像，但同时应该尽可能少地影响VAE重构效果，让StableDiffusion可以正常运行。loss分为两部分，第一部分是解码器重构RGB图像和alpha图像的loss，第二部分是VAE重构loss，约束编码器预测的偏移量latent transparency不要影响latent distribution。
在新的latent distribution上fine-tune StableDiffusion。

LayerDiff

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

$K$ 个foreground layer，每个foreground layer对应一个image和mask，foreground layer之间不重叠，最终图像是所有foreground layer拼在一起，然后background layer填补空隙。
使用InstructBLIP、SAM、StableDiffusion inpainting模型造数据训练。

LayeringDiff

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Begins by generating a composite image using an off-the-shelf image generative model, followed by disassembling the image into its constituent foreground and background layers.

AFA

Ensembling Diffusion Models via Adaptive Feature Aggregation

集成学习。
AFA dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages.

Diffusion-Soup

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

Diffusion Soup enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging.
$n$ $n$ data sources respectively, and then averaging the parameters.

NSO

Not All Noises Are Created Equally: Diffusion Noise Selection and Optimization

一个好的noise应该是生成再Inversion后保持不变，以此为标准可以为某个prompt选择好的noise进行生成，也可以优化出一个好的noise。

AsyVQGAN

Designing a Better Asymmetric VQGAN for StableDiffusion

改进StableDiffusion要建模的隐空间。
为decoder设计了一个conditional branch，输入task-specific prior，如unmasked image in inpainting。
decoder远比encoder大，提升细节重构能力。

CG

Counting Guidance for High Fidelity Text-to-Image Synthesis

$\hat{x}_{0}$ ，输出的count与期望的count作差计算loss，求梯度作为guidance。

IoCo

Iterative Object Count Optimization for Text-to-image Diffusion Models

TI的思路解决count问题。

QUOTA

QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

TI的思路解决count问题，meta-learning。

MuLan

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

a training-free Multimodal-LLM agent that can progressively generate multi-object with planning and feedback control, like a human painter.

DIFFNAT

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

We propose a generic "naturalness" preserving loss function, kurtosis concentration (KC) loss，和diffusion loss一起训练。

FreeU

FreeU: Free Lunch in Diffusion U-Net

training-free，只用两个系数提高生成效果。
UNet的decoder的feature由两部分组成，一个是自己网络生成的backbone feature，另一个是同分辨率下encoder skip connection过来的skip feature。
$b$ $b$ 的增大，UNet的denoising能力增强，生成图像的质量变高，但高频信息被抑制。
实验证明了skip feature含有更多高频细节的信息，所以给skip feature的FFT乘一个稍大的系数，复原被抑制的高频信息。

Omegance

Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

$z_{t-1} = a_{t} z_{t} + \omega \cdot b_{t} \epsilon_{\theta}(z_{t}, t)$ ，给DDIM采样公式direction项乘一个系数。
$\omega < 1$ enhances detail, making it well-suited for generating a busier crowd in a marketplace, intricate patterns in clothing design, or fine textures in elements like sand or waves.
$\omega>1$ produces smoother, simpler visuals, ideal for scenes with clear skies, calm waters, or minimalist designs, where a streamlined aesthetic is preferred.
$\omega$ 进行生成。

SR

Fine-grained Text-to-Image Synthesis with Semantic Refinement

$x_{t}$ 的CLIP embedding的点积，计算其梯度作为guidance。

ConceptSliders

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

对于target concept的纠错或者编辑。
$\eta$ $W + \alpha \Delta W$ $\alpha$ 。
可以通过prompt engineering设计enhanced and suppressed attribute，可以解决hands生成等问题。

PromptSliders

Prompt Sliders for Fine-Grained Control, Editing and Erasing of Concepts in Diffusion Models

类似ConceptSliders，但是是在text encoder上训练LoRA。
$\alpha$ $\eta$ 是CFG scale。

LaVi-Bridge

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

训练一个adapter以结合不同预训练语言模型和预训练文生图模型。
$f$ $g$ $y$ $c=f(y)$ $g(h(c))$ $g$ $h$ $f$ $g$ ，只需要少量text-image pair即可完成adaptation。

Multi-LoRA

Multi-LoRA Composition for Image Generation

training-free
$\epsilon_{\theta,\theta_{i}}$ 相加进行采样（而不是将LoRA参数相加）。

Position

CompFuser

Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

对于含有左右位置关系的两个物体的prompt，先正常生成其中一个物体，再利用place * on the left这样的instruction进行编辑。
编辑模型类似InstructPix2Pix，使用LLM-grounded diffusion，生成两个物体的layout，只用其中一个layout生成原图，两个layout都用生成目标图，instruction一起，LoRA fine-tune InstructPix2Pix。

CoMPaSS

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

给定一个带方位的prompt，对其进行改写（语义不变），再对其进行反义、交换等操作（改变语义），之后都送入文本编码器计算编码结果的相似度，理论上应该是改写后的prompt与原prompt最相似，但发现目前流行的文本编码器都有90%以上的失败率。
构造方法数据集SCOP fine-tune diffusion model，fine-tune时类似RoPE给QK加positional embedding to augment the conditioning text signals.

Noise

GoldenNoise

Golden Noise for Diffusion Models: A Learning Framework

$x_{T}$ $\alpha$ $\beta$ $x_{T}^{\prime}$ 计算MSE loss进行训练。

NoiseDiffusion

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

利用diffusion chain回传LVLM的梯度优化更新initial noise。
$\epsilon_{\theta}(z_{t})$ $\frac{\partial z_{t-1}}{\partial z_{t}} = \sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{t}}} + \sqrt{\frac{\bar{\alpha}_{t} - \bar{\alpha}_{t-1}}{\bar{\alpha}_{t}}} \frac{\partial \epsilon_{\theta}(z_{t})}{\partial z_{t}}$ $\frac{\partial \epsilon_{\theta}(z_{t})}{\partial z_{t}} = 0$ ，整个diffusion chain回传梯度相当于只乘了一个常数系数。

NoiseQuery

The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

When the model generates images without any specific prompt, it depends solely on the noise characteristics, producing what we term “generative posterioe".
CLIP feature

ITS

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

We explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process.

Attention

ERNIE-ViLG 2.0

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

通过引入先验知识提高image-text对齐程度的优化训练算法。
利用NLP工具标注出text中的关键词，并在cross-attention中提高其与image token的attention的权重。
利用object detection检测出text中的object的区域，提高这一区域的diffusion loss的权重。

TokenCompose

TokenCompose: Grounding Diffusion with Token-level Supervision

利用SAM提取prompt中名词对应的object的mask，fine-tune StableDiffusion，除了diffusion loss，还加了两个cross-attention map的辅助loss。
$L_{token}=\frac{1}{N}\sum_{i}^{N}(1 - \frac{\sum_{u \in M_{i}} CAM_{i,u}}{\sum_{u} CAM_{i,u}})$ $M_{i}$ $L_{pixel} = - \frac{1}{NL} \sum_{i} \sum_u M_{i,u} \log (CAM_{i,u}) + (1 - M_{i,u}) \log (1 - CAM_{i,u})$ 。

Local-Control

Local Conditional Controlling for Text-to-Image Diffusion Models

StableDiffusion + ControlNet
training-free
如果ControlNet的输入只包含一个物体的控制信息，比如对于prompt"a dog and a cat"，ControlNet的输入只包含了cat的bounding box，the prompt concept that is most related to the local control condition dominates the generation process, while other prompt concepts are ignored. Consequently, the generated image cannot align with the input prompt. dog容易消失。
对于有local control的物体，使用控制信息大致估算出一个mask，计算该物体对应的token的cross-attention map在mask内最大值和mask外最大值的差作为loss，对于没有local control的物体，将mask外视为自己的区域，mask外视为非自己的区域，用同样的方法计算loss，loss求和，求梯度作为guidance。
将mask用在ControlNet的skip connection feature上，使得ControlNet只影响mask内的feature。

SAG

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

相比于classifier-free guidance借助condtional score计算guidance，self-attention guidance使用internal information计算guidance，training-free and condition-free，所以比较通用，可用于任何diffusion model的enhancement。
$\hat{x}_{0}$ 的Gaussian Blur的噪声版本的score作为u，称为Blur Guidance: Gaussian blur reduces the fine-scale details within the input signals and smooths them towards constant, resulting in locally indistinguishable ones. 但这样会导致生成图像含有噪声，We assume that this is because global blur introduces structural ambiguity across entire regions. 所以提出只在显著位置使用Gaussian blur。
$R^{N \times (HW) \times (HW)}$ $N$ $N \times (HW)$ 维使用global average pooling (GAP) ，再reshape and upsample到图像尺寸。使用均值作为阈值确定一个mask，该mask对应图像的高频部分，只取Gaussian blur的mask部分。

PAG

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

$I$ 替代self-attention map作为CFG的unconditional score进行采样。

SEG

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

破坏CFG的unconditional prediction的self-attention，和PAG一个思想。
By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction of CFG.

Self-Guidance

Guided Diffusion from Self-Supervised Diffusion Features

类似SAG，利用数据本身的UNet feature做guidance。
Our method leverages the inherent guidance capabilities of diffusion models during training by incorporating the optimal-transport loss. In the sampling phase, we can condition the generation on either the learned prototype or by an exemplar image.
需要全部重新训练。

Attention-Regulation

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

某个token的cross-attention的dominance导致了其它token的semantic的丢失。
$90\%$ $S$ $\text{Softmax}(\frac{QK^{T} + S}{\sqrt{d}})$ $S$ 。
We choose cross-attention layers in the last down-sampling layers and the first up-sampling layers in the U-Net for optimization.
为了稳定，使用EMA更新。

Attention-Modulation

Towards Better Text-to-Image Generation Alignment via Attention Modulation

training-free
self-attention temperature control：计算attention时使用较小的temperature，让softmax的分布更加集中，high attention values between patches with strong correlations are emphasized, while low attention values between unrelated patches are suppressed. After temperature control, the patch only corresponds with patches within a smaller surrounding area, leading to the correct outlines being constructed in the final generated image. We apply the temperature operation to the early generation stage of the diffusion model in the self-attention layer.
object-focused masking mechanism：对prompt进行拆分，分为带形容词的物体、动词、介词等主体，计算prompt中不同主体对应的cross-attention map之和（每个主体可能不止一个word）作为该主体的cross-attention map，之后遍历所有pixel，对于每个pixel，选出其cross-attention map响应值最大的那个主体，将该pixel分配给该主体，在其它主体的所有word的cross-attention map上mask掉该pixel（响应值设为0）。With this masking mechanism, for each patch, we retain semantic information for only the entity group with the highest probability, along with the global information related to the layout. This approach helps reduce occurrences of object dissolution and misalignment of attributes.

MaskDiffusion

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

StableDiffusion
training-free
只在16x16的分辨率上进行操作。
cross-attention map有三种bad case：

$softmax(\frac{QK^{T}}{\sqrt{d}}+M)$ $M$ $i$ $j$ $M_{ji} = M_{ji} + w_{0}$ $w_{0}$ 为预设常数。

A&E

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

StableDiffusion
training-free
只在16x16的分辨率上进行操作。
$z_{t}$ $z_{t}$ $z_{t-1}$ ，循环往复。

D&B

Divide and Bind Your Attention for Improved Generative Semantic Nursing

StableDiffusion
training-free
用total variation loss代替上面的loss，这样就不局限在某个patch点上了，激励整个区域。
另外引入了一个bind loss，其动机是prompt中还存在一些修饰subject token的形容词，这些形容词对应的cross-attention map应该和对应名词的cross-attention map是对齐的，所以引入它们（归一化后）之间的JS散度作为loss。

SynGen

Linguistic Binding in Diffusion Models Enhancing Attribute Correspondence through Attention Map Alignment

StableDiffusion
training-free
利用cross-attention map计算loss，求梯度作为guidance。

EBAMA

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

StableDiffusion
training-free
利用cross-attention map计算loss，求梯度作为guidance。
indensity loss: 负的cross-attention map的最大值，类似A&E。
binding loss: maximize the cosine similarity between the given object and its syntactically-related modifier tokens, while enforcing the repulsion of grammatically unrelated ones in the feature space.

EBCA

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

PAC-Bayes

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

StableDiffusion
training-free
利用cross-attention map计算loss，求梯度作为guidance。

ELA

Easing Concept Bleeding in Diffusion via Entity Localization and Anchoring

类似DiffEdit使用cross-attention map估计出mask，之后进行自我增强。

INITNO

INITNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

根据第一步生成时的cross-attention map和self-attention map优化initial noise的重参数分布，保证物体的存在性，解决subject mixing问题。
$S_{\text{CrossAttn}}$ 是cross-attention response score，和A&E中的loss一样，保证物体的存在性。
$S_{\text{SelfAttn}}$ $H \times W$ $HW$ 个位置，计算两个self-attention map在每个位置上两者中的最小值除以两者的和，目的是让两个self-attention map在相同位置上的响应值一高一低，减少overlap。
两个score如果都低于各自的阈值，则说明不需要继续优化，直接采样并进行生成。
$\mathcal{L}_{\text{joint}}$ $\mathcal{N}(0, I)$ 。

EnMMDiT

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

针对MMDiT的解决subject neglect/mixing的方法。
计算三个loss，求梯度作为guidance。
Block Alignment Loss: The blocks in the later layers gradually remove the ambiguities present in the earlier ones, 因此使用深层的cross-attention map与浅层的cross-attention map计算相似度。
Text Encoder Alignment Loss: T5与CLIP可能冲突，因此计算两者编码相同token得到对应的cross-attention map的相似度。
Overlap Loss: 计算不同token对应的cross-attention map的overlap，T5和CLIP各算一个，再两两各计算一个。

A-STAR

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

StableDiffusion
training-free
attention overlap问题。解决：计算不同token对应的cross-atttenion map的IoU。
attention decay问题：作者发现StableDiffusion生成早期的cross-atttenion map的布局是比较清晰的，但越到后期这种布局越模糊，没有保持住。所以可以利用上一步的cross-attention map估算一个mask，计算这一步的cross-attention map与这个mask的IoU。
3中的loss减去4中的loss，求梯度作为guidance。

Prompt

VP (text -> bounding box -> image)

Visual Programming for Text-to-Image Generation and Evaluation

fine-tune LLM on text-layout paris，使其可以将text转换为layout，和text一起作为条件输入GLIGEN，辅助精确可控生成。

PCIG (text -> graph -> image)

Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models

类似VP。

GenArtist (text -> bounding box -> image)

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

类似VP。

CxD (text -> bounding box -> image)

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

类似VP。

LayoutLLM-T2I (text -> bounding box -> image)

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

in-context learning：从训练集（COCO，带prompt和bounding box标注）中随机采样一批样本作为candidate set，训练一个策略网络，该策略网络根据查询prompt，从candidate set选取几个样本作为in-context examples，为ChatGPT输入in-context examples和查询prompt，生成prompt中object的bounding box（文本形式）。策略网络根据mIoU和CLIP相似度等reward训练。
GLIGEN fine-tune StableDiffusion。

DivCon (text -> bounding box -> image)

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

LLM-Blueprint (text -> bounding box -> image)

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

RealCompo (text -> bounding box -> image)

RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models

利用ChatGPT生成layout后，利用L2I模型（如GLIGEN）和T2I模型的一起生成，做法是每一步生成时使用系数组合两个模型预测的噪声作为DDIM计算下一步的噪声，并根据DDIM的计算结果定义一个loss更新系数作为下一步的系数，以动态调整真实性和组合性。

ReasonLayout (text -> bounding box -> image)

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

CoT reasoning：in-context让GPT3.5根据prompt生成layout。
在StableDiffusion的self-attention和cross-attention之间插入一个可训练的Layout-Aware Cross-Attention，用layout生成mask作用于cross-attention map上。

SimM (text -> bounding box -> image)

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

StableDiffusion
training-free
从带有位置关系的text中解析出一个粗糙的layout（比如middle对应图像中央一个方框，left对应左边占1/3的框，都是固定大小的），与第一步生成时产生的cross-attention map做比对，阈值法看是否有layout的不匹配，如果匹配就不介入，直接生成；如果不匹配，则进行介入。
$T$ $T_{loc}$ $[T, T_{loc}]$ $T_{loc}$ 之后的生成开始修改每一步的cross-attention map，对于某个token，由于分配的layout框和计算出的layout框大小一样，所以可以直接将计算出的layout框中的响应值直接复制到分配的layout框中，同时对框内响应值做增强，对框外响应值做抑制。

SPDiffusion (text -> bounding box -> image)

SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

StableDiffusion
training-free

SceneGenie (text -> scene graph -> image)

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

T2I-Salad (text -> scene graph -> image)

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion

generating intricate visual content from simple abstract text prompts。
自监督训练一个scene graph的discrete diffusion model，根据simple abstract text prompts生成语义更丰富的scene graph。
给StableDiffusion插入scene graph attn进行训练。

FG-DM (text -> any -> image)

Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis

$K$ 个condition作为中介生图，建模它们的联合分布。
对于某个condition加噪，使用预训练的StableDiffusion对其进行去噪，使用可训练的T2I-Adapter引入之前的所有condition，输出输入到下一个T2I-Adapter继续往后传递。
$K+1$ 个diffusion loss训练。
训练时随机置空条件，这样采样时可以挑选任意子图进行生成。

ITI-Gen

ITI-GEN: Inclusive Text-to-Image Generation

make the pre-trained StableDiffusion to generate images which are uniformly distributed across attributes of interest.
有点类似TIME和UCE那种model editing，但这里只是修改prompt (prompt tuning)，不对模型做任何更改，需要提供一个reference image dataset作为attributes of interest。

RPG

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

对某个较长的caption，使用ChatGPT将其分解为n个sub-caption，再对每个sub-caption进行recaption，并为每个sub-caption在图中分配一个layout。
生成时，分别使用这n个sub-caption进行去噪，之后将每个sub-caption对应的去噪结果按照layout进行resize并重新拼成原来的空间尺寸，为了确保concat边界的一致性，使用原caption输出的latent，与拼出来的latent做插值。

RAG

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

类似RPG。

ContextCanvas

Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

利用knowledge graph的retrieval-augmented generation。

R2F

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

ConceptDiffusion

Semantic Guidance Tuning for Text-To-Image Diffusion Models

将prompt的score拆成不同concept的score的组合，subject concept的score直接计算，abstract concept的score由正交投影计算，组合时计算不同concept的score和prompt的score的相似度决定weight。

RCN

Is Your Text-to-Image Model Robust to Caption Noise?

TweedieMix

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

SGOOL

Saliency Guided Optimization of Diffusion Latents

TweedieMix的升级版。

DreamWalk

DreamWalk: Style Space Exploration using Diffusion Guidance

将prompt分解为不同的子prompt，使用不同子prompt的CFG的线性组合进行生成。

MCT2I

Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else

StableDiffusion
training-free，只在text embedding上做文章。
text中首先出现的concept往往在生成中占主导地位，可能抢占其它concept，并且这些首先出现的concept的token embedding往往有比较大的normalization，通过scale down可以缓解。
某些concept的生成可能和它对应的embedding没关系，而是根据其它embedding生成。计算当前embedding和其它embedding的相似性，用其它embedding的加权和表示当前embedding。

Magnet

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

training-free

ToMe

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

training-free
$\hat{c}_{k} = n_{k} + \sum_{i} a_{k}^{i}$ $n_{k}$ $a_{k}^{i}$ 是attribute。
ETS: As the semantic information contained in [EOT] can interfere with attribute expression, we mitigate this interference by replacing [EOT] to eliminate attribute information contained within them, retaining only the semantic information of each subject.
$C$ $k$ $C_{K}$ $C^{\prime}$ $\mathcal{L}_{\text{sem}} = \sum_{k} \| \epsilon_{\theta}(z_{t}, t, \hat{c}_{k}) - \epsilon_{\theta}(z_{t}, t, C_{k}) \|_{2}^{2}$ $\mathcal{L}_{\text{ent}}$ $C^{\prime}$ 每个token对应的cross-attention map的熵的和，decreasing the entropy of the cross-attention maps can help ensure that tokens focus exclusively on their designated regions, thereby preventing the cross-attention map from becoming overly divergent.
During generation, we compute these two novel losses to update the composite token during each time.

MoCE

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

prompt="a tea cup of iced coke"，现有的模型大多生成glass cup而非tea cup，这是因为训练数据中iced coke一般和glass cup一起出现，所以提出Mixture of Concept Expert，让GPT规划先生成tea cup再生成iced coke。

FDR

On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Prompt LLM Encoding

LI-DiT

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

不同的LLM各有优劣，比如encoder-decoder架构的T5和decoder-only架构的GPT，后者在文本理解上更好，但是用它们训练出来的text-to-image模型，后者在图像和文本对齐程度上远没有前者好。
将不同LLM集成在一起：使用不同LLM分别对prompt进行编码，使用refiner融合它们输出的feature，使用融合后的feature训练text-to-image DiT。

LLMDiff

Decoder-Only LLMs are Better Controllers for Diffusion Models

由于decoder-only LLM更丰富的语义，使用它编码text训练text-to-image效果会更好。
训练一个MLP将LLM text embedding转换为CLIP text embedding输入预训练的cross-attention，同时类似IP-Adapter训练一个并行的cross-attention。

LLM4GEN

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Cross-Adapter Module和UNet一起使用diffusion loss训练。
$\cdot$ 3。

SUR-adapter

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

收集一些complex prompt，使用T2I模型生成图像，使用BLIP进行caption，得到simple prompt，得到simple-complex prompt pair。
Train SUR-adapter to transfer the semantic understanding and reasoning capabilities of large language models and achieves the representation alignment between complex prompts and simple prompts. 让LLM编码simple prompt达到complex prompt的图像生成效果。

Prompt Rewrite

BeautifulPrompt

BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis

收集low-quality prompt和high-quality prompt pair的数据集，训练一个语言模型，根据low-quality prompt生成high-quality prompt，使得prompt engineer自动化。

DiffChat

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality.
LLM和用户对话，根据用户需求，只对prompt进行修改，不涉及image识别。

ChatGen

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

fine-tune一个MLLM，可以改写prompt，生成选择哪个模型，生成推理配置参数。

Promptist

Optimizing Prompts for Text-to-Image Generation

训练优化LLM成为一个prompt改写模型。

NeuroPrompts

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

类似Promptist。

PRIP

Prompt Refinement with Image Pivot for Text-to-Image Generation

使用HPSv2数据集训练模型refine input prompt。

Patcher

Repairing Catastrophic-Neglect in Text-to-Image Diffusion Models via Attention-Guided Feature Enhancement

自动检测生成结果中丢失的object，并重写prompt。

AP-Adapter

AP-Adapter: Improving Generalization of Automatic Prompts on Unseen Text-to-Image Diffusion Models

Negtive Prompt

ContrastivePrompt

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

相比于negative prompt使用一些抽象的prompt例如low quality和ugly，contrastive prompt针对prompt设计，去除一些形容词，或使用一些反义prompt，比如with改为without。

DPO-Diff

On Discrete Prompt Optimization for Diffusion Models

利用prompt engineering找到合适的negative prompt。
Our main insight is that prompt engineering can be formulated as a discrete optimization problem in the language space.
To the best of our knowledge, this is the first exploratory work on automated negative prompt optimization.

DNP

Improving Image Synthesis with Diffusion-Negative Sampling

$\epsilon_{\theta}(x_{t}, t, \phi) + s \cdot (\epsilon_{\theta}(x_{t}, t, \phi) - \epsilon_{\theta}(x_{t}, t, p))$ 采样一个负样本。

ReNeg

ReNeg: Learning Negative Embedding with Reward Guidance

An end-to-end method designed to learn improved Negative embeddings guided by a Reward model.

Syntax

StructureDiffusion

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

增强属性绑定。
$k$ $k+1$ 个output，取平均作为输出。

SG-Adapter

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

对CLIP text embedding进行adaptation，使得生成图像的语义更准确。
使用NLP parser提取text中的subject-relation-object三元组（可能有多个），每个三元组构成一个scene graph，对于每个scene graph，concat三元组单词的CLIP text embedding，过一个线性层得到scene graph embeding，原CLIP text embedding作为Q，scene graph embeding作为KV，进行cross-attention，得到refined text embedding。计算cross-attention map时，只有Q当前的token属于K当前的scene graph时才计算，其余都mask掉。

Memorization

MemAttn

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

$HW \times L$ $t$ 的减小更加集中；而对于memorization图像，每一行的cross-attention score在begining token上分配的很少，但会集中在某个特定的token上。
$HW \times L$ $t$ 的减小快速下降；对于non-memorization图像，attention entropy比non-memorization的要高。
利用这些发现可以做检测。
缓解memorization：直接调节cross-attention的logits，给begining token的logits乘一个较大的数，让cross-attention score大都集中在begining token上。

AMG

Towards Memorization-Free Diffusion Models

Anti-Memorization Guidance：设计了三个防止生成memorization sample的度量函数，求梯度作为guidance。

NeMo

Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

We propose to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers.
By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data.

IET-AGC

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

MemBench

MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods.

ProCreate

ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation

ProCreate operates on a set of reference images and actively propels the generated image embedding away from the reference embeddings during the generation process.

BEA

Exploring Local Memorization in Diffusion Models via Bright Ending Attention

In this paper, we identify and leverage a novel ‘bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models.
BE refers to a distinct cross attention pattern observed in text-to-image generations using diffusion models.
Memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches.

MFC

Memories of Forgotten Concepts

Guidance

how

SAG

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

PAG

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

SEG

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

Self-Guidance

Guided Diffusion from Self-Supervised Diffusion Features

AutoGuidance

Guiding a Diffusion Model with a Bad Version of Itself

Guiding a high-quality model with a poor model trained on the same task, conditioning, and data distribution, but suffering from certain additional degradations, such as low capacity and/or under-training.
$D_{0}(x_{t}, t, c) + \omega [D_{1}(x_{t}, t, c) - D_{0}(x_{t}, t, c)]$ $D_{1}(x_{t}, t, c)$ $D_{0}(x_{t}, t, c)$ 是没有训练好的或者参数量少很多的模型。

SIMS

Self-Improving Diffusion Models with Synthetic Data

Use self-synthesized data to provide negative guidance during the generation process to steer a model’s generative process away from the non-ideal synthetic data manifold and towards the real data distribution.
使用训练集训练一个diffusion model，训练完成后使用diffusion model生成一个数据集，用这个生成的数据集再训练一个diffusion model，类似AutoGuidance做CFG。

Self-Guidance

Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

$\epsilon_{\theta}(x_{t}, y, t) + s \cdot \left( \epsilon_{\theta}(x_{t}, y, t) - \epsilon_{\theta}(x_{t}, y, t + \delta t) \right)$ ，输入shifted timestep作为uncondition score。

FABRIC

FABRIC: Personalizing Diffusion Models with Iterative Feedback

reference image加噪过UNet，保留所有self-attention的key-value，生成时将这些key-value concat在生成时的self-attention的key-value后进行计算。
高分的reference image作为cfg的conditional，低分的reference image作为cfg的unconditional，使用上述方法进行生成。

CAD

Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion

$0 \sim 1$ 的coherence score。
训练diffusion model时，将coherence作为额外的条件。
$\epsilon_{\theta}(x_{t}, y, 1, t) + \omega (\epsilon_{\theta}(x_{t}, y, 1, t) - \epsilon_{\theta}(x_{t}, y, 0, t))$ 。

when

GuidanceInterval

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

$(\sigma_{\text{lo}}, \sigma_{\text{hi}})$ 区间，只在该区间内进行CFG，其余区间进行正常的条件采样。

DynamicGuidance

Analysis of Classifier-Free Guidance Weight Schedulers

Simple, monotonically increasing weight schedulers consistently lead to improved performances.

WorkingMechanism

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

During the denoising process of the stable diffusion model, the overall shape and details of generated images are respectively reconstructed in the early and final stages of it.
The special token [EOS] dominates the influence of text prompt in the early (overall shape reconstruction) stage of denoising process, when the information from text prompt is also conveyed. Subsequently, the model works on filling the details of generated images mainly depending on themselves.
在early stage使用CFG，在final stage只使用unconditional score，因此减少了final stage一半的计算量。

where

S-CFG

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

We argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.
$HW \times L$ $1$ 。先沿列做normalization，让每一列和为1，之后选出每一行最大的那个响应值的位置所在的token作为这个pixel对应的token。这个normalization很重要，不然响应值会集中在begining token上，这一点在MemAttn中也有展现。
cross-attention map的segmantation很粗糙，所以使用self-attention map进行refine，做法是直接将self-attention map和cross-attention map相乘，得到的结果再进行1中的操作。
$\bar{C} = \frac{1}{R}\sum_{i=1}^{R} S^{r}C$ $R$ $4$ 。
$\epsilon_{\theta}(z_{t},t,c) - \epsilon_{\theta}(z_{t},t,\phi)$ $M$ $\sum_{i=1}^{M}\gamma_{t, i}m_{t, i} [\epsilon_{\theta}(z_{t},t,c) - \epsilon_{\theta}(z_{t},t,\phi)]$ $m_{t, i}$ $i$ $\gamma_{t,i}$ $\eta_{t} = \parallel \epsilon_{\theta}(z_{t},t,c) - \epsilon_{\theta}(z_{t},t,\phi) \parallel_{2} \in R^{HW}$ $\gamma_{t,i} = \gamma \frac{|m_{t,b}\odot \eta_{t}|}{|m_{b,i}|} \frac{|m_{t,i}|}{|m_{t,i}\odot \eta_{t}|}$ $m_{b,i}$ 是用begining token估计出的background的mask，相当于把token对应区域的CFG的均值rescale到background的均值。

SFG

Segmentation-Free Guidance for Text-to-Image Diffusion Models

对于某个prompt（a dog on a cough in an office），如果在生成时negative prompt就是prompt去掉某个object的话（a dog in an office），那么最终生成的图像中，这个object（cough）就会变的更显著。
利用这一特点，在采样时，可以利用cross-attention map估计出每个pixel对应的object，在prompt中去掉这个object作为这个pixel对应的negative prompt。
$\omega=2.5$ 就够了。

GuideModel

Plug-and-Play Diffusion Distillation

CFG需要两次forward，计算量太大，因此给模型学习一个guide model作为adapter，与ControlNet对称，将scale作为参数输入，蒸馏CFG。

DICE

DICE: Distilling Classifier-Free Guidance into Text Embeddings

$\omega$ $\omega$ 并没有作为输入。

Character

TCO

The Chosen One Consistent Characters in Text-to-Image Diffusion Models

形容某个character的不同prompt生成具有相同特征的character。
生成、聚类、用选中的类别（具有相同特征的character的图像）进行LoRA fine-tune。

OneActor

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Create consistent images of the same character.
$\Delta c$ ，三个loss都是diffusion loss。

SFT

SFT是在一个固定的数据集上对模型进行fine-tune，只有text-image pair，类似Emu，鼓励模型在这个text上生成对应的image，一般是收集一个高质量的数据集对模型进行fine-tune。

$r_{\phi}(x_{0}, c) \log p_{\theta}(x_{0} | c)$ ，鼓励模型在这个text上生成reward高的image，并且RLFT每轮优化用的是上一轮优化过的模型生成的样本（SFT一直用的是fine-tune前模型生成的样本），即online。

一些评判标准有现成的模型，如评判text-image alignment的CLIP，可以作为reward model直接使用，一些评判标准没有现成的模型，如human feedback，此时需要训练一个reward model，一般做法是通过样本之间的rank学习一个reward model（类似CLIP），比如下面的HPS。

DPO避开了reward model的训练，只需要两个样本之间的rank关系就可以训练，所以一般是SFT那样在一个固定的数据集上对模型进行fine-tune。

RLFT是一类方法，RLHF是指评判标准是human feedback并且应用RLFT方法的一种应用。

Emu

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

LLM可以通过在高质量小数据集上fine-tune的方式显著提高模型输出质量，并且不会影响其泛化能力。
假设StableDiffusion本身已经具备生成高质量图像的能力，但并没有被有效发掘，导致生成质量参差不齐，Emu通过人工筛选2000张极高质量的图像对StableDiffusion进行fine-tune，让StableDiffusion保持生成高质量图像的能力，同时不失对文本的泛化性。
early stopping（<15k iterations）避免过拟合。
该方法很通用，还适用于pixel-level diffusion models（Imagen）和masked generative models（Muse）。

EvoGen

Progressive Compositionality In Text-to-Image Generative Models

构造contrastive数据集。
$\mathcal{L} = -\log \frac{\exp(\text{sim}(h_{t}^{+}, f(c)))}{\exp(\text{sim}(h_{t}^{+}, f(c))) + \exp(\text{sim}(h_{t}^{-}, f(c)))}$ $h_{t}$ $f(c)$ 是CLIP text embedding。

RLFT

DPOK

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

$x_{t}$ $x_{0}$ $z$ 为text或其它条件，这个梯度是diffusion model每一步sampling的梯度的加和，不是一条梯度链。
$x_{0}$ $x_{0}$ 之间的KL正则。

DDPO

Training Diffusion Models with Reinforcement Learning

Policy Gradient fine-tune pre-trained diffusion model，公式和DPOK一样，DDPO和DPOK基本是同一时间放出来的。

Flow-GRPO

Flow-GRPO: Training Flow Matching Models via Online RL

类似DPOK和DDPO，Policy Gradient fine-tune pre-trained flow model
利用类似SDE和Probability Flow ODE转换的证明过程，再利用velocity和score之间的转换公式，将Flow ODE转换为具有相同marginal probability density function at all timesteps的SDE。

$\mathrm{d} x = v \mathrm{d} t$ $\mathrm{d} x = \left[ f\left( x, t \right) - \frac{1}{2} g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right) \right] \mathrm{d} t$ $v = f\left( x, t \right) - \frac{1}{2} g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right)$ $f\left( x, t \right) = v + \frac{1}{2} g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right)$ $\mathrm{d} x = \left[ f\left( x, t \right) -g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right) \right] \mathrm{d} t + g\left( t \right) \mathrm{d} \bar{\mathrm{w}}$ $f\left( x, t \right)$ $\mathrm{d} x = \left[ v + \frac{1}{2} g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right) -g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right) \right] \mathrm{d} t + g\left( t \right) \mathrm{d} \bar{\mathrm{w}} = \left[ v - \frac{1}{2} g^{2}\left( t \right) \nabla_{x} \log p_{t}\left( x \right) \right] \mathrm{d} t + g\left( t \right) \mathrm{d} \bar{\mathrm{w}}$ $v$ $g(t)$ $a\sqrt{\frac{t}{1-t}}$ ，一个常数乘信噪比。

ImageReward

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

$p$ $\mathcal{L} = - \mathbb{E}_{p, x_{i}, x_{j}} = \log \sigma( f_{\theta}(p, x_{i}) - f_{\theta}(p, x_{j}) )$ $x_{i}$ $x_{j}$ 排名高。
目前只用于模型评测，还未用于RLFT。

LVLM-ImageReward

Improving Compositional Text-to-image Generation with Large Vision-Language Models

使用Large Vision-Language Models评定生成图像与文本的对齐性，主要是object number, attribute binding, spatial relationship, aesthetic quality四个方面的对齐。
RLFT模型（online）。

HPS

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

Human Preference Dataset (HPD)：一个prompt生成多张image，其中一张被用户选为preference。
Train human reference classifier：类似CLIP，分别编码image和text到同一embedding空间，然后计算相似度。
$\text{HPS} = 100 \cdot \cos(E_{v}(\text{img}),E_{t}(\text{text})$ 。
LoRA fine-tune StableDiffusion：不仅使用high-HPS数据进行fine-tune，还使用low-HPS数据，此时给prompt加一个识别符，在采样时给prompt加一个识别符作为negative prompt。

HPSv2

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Dataset v2 (HPDv2)：使用不同数据集的prompt，使用ChatGPT进行过滤，得到一个质量不错的prompt数据集，每个prompt输入不同text-to-image模型生成多张image，人工标注preference。
$1$ $0$ ，优化KL散度（交叉熵）进行训练。
Human Preference Score v2 (HPSv2)同HPS。

RAHF

Rich Human Feedback for Text-to-Image Generation

RichHF-18K dataset includes two heatmaps (artifact/implausibility and misalignment), four fine-grained scores (plausibility, alignment, aesthetics, overall), and one text sequence (misaligned keywords).

RLHF

Aligning Text-to-Image Models using Human Feedback

通过引入人工标注反馈提高image-text对齐程度的fine-tune pre-trained StableDiffusion算法。
StableDiffusion对于一些概念生成还是会时好时坏的，比如count和color，为此可以使用count和color进行造句（可以选其它你认为没有对齐好的概念使用该算法，这里仅以count和color举例），再用每个text生成60多张image，由labeler进行0-1标注，0代表没有对齐（count错了或color错了），1代表对齐。
训练一个reward function，根据上述image和text的CLIP编码去预测对齐程度（输出0~1），用标注数据进行训练，使用MSE Loss；同时使用数据增强方法（prompt classification）提升reward function性能：对每个已经标注为对齐的image-text pair，将text中的count或color进行更改，生成N-1个与imgae非对齐的text，输入image和N个text到reward function中并输出N个预测值，softmax后使用交叉熵进行分类训练。
使用reward function RLFT模型（online）。

BoigSD

Behavior Optimized Image Generation

利用DDPO，align SD with a proposed BoigLLM-defined reward。

HRF

Avoiding Mode Collapse in Diffusion Models Fine-tuned with Reinforcement Learning

改进DDPO。

Diffusion-KTO

Aligning Diffusion Models by Optimizing Human Utility

PRDP

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

RLCM

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

TexForce

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

现有的T2I模型大都使用预训练的text encoder，且生成时都需要prompt engineering，这都说明text encoder是suboptimal的，所以可以将T2I生成时的不对齐归因于suboptimal text encoder，所以提出使用DDPO LoRA fine-tune text encoder，让text更具visual特征。
还可以搭配上DPOK fine-tune UNet的方法一起使用，效果更佳。可以用于fix hands。

TextCraftor

TextCraftor: Your Text Encoder Can be Image Quality Controller

类似于TexForce。

PAHI

Model-Agnostic Human Preference Inversion in Diffusion Models

使用蒸馏出的一步生成的模型进和打分模型，重参数法优化初始噪声的高斯分布的均值和方差。
对于某个prompt，从标准高斯分布中随机一个噪声，再从重参数法的高斯分布中随机一个噪声，使用模型分别生成两个样本，使用打分模型分别打分，交叉熵优化均值和方差，使得后者得分更高。
可以对某个prompt专门优化，也可以使用prompt数据集进行优化。

SynArtifact

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

DRaFT

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

LoRA + gradient checkpointing，使用reward function fine-tune StableDiffusion。

AlignProp

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

gradient checkpointing，使用reward function fine-tune StableDiffusion。

Focus-N-Fix

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

DRTune

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

$x_{t-1} = a_{t} x_{t} + b_{t} \epsilon_{\theta}(x_{t},t) + c_{t}\epsilon$ 。
$\epsilon_{\theta}(x_{t},t)$ $x_{t}$ $x_{t-1} = a_{t} x_{t} + b_{t} \epsilon_{\theta}(\text{sg}(x_{t}),t) + c_{t}\epsilon$ $\frac{\partial x_{t-1}}{\partial x_{t}} = a_{t}$ ，不需要存储UNet的中间计算结果。

DiffDoctor

DiffDoctor: Diagnosing Image Diffusion Models Before Treating

Following studies of direct reward fine-tuning such as DRaFT and AlignProp, which also directly back-propagating diffusion models but with global reward models when doing full-chain sampling, we truncate the gradients in the last few steps in the denoising chain to save memory.

Parrot

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

预定义K个指标，训练时随机选择一个指标，在prompt前prepend这个指标的reward-specific identifier，使用DDPO进行训练。
生成时把K个reward-specific identifier concat在一起prepend到prompt。

VersaT2I

VersaT2I: Improving Text-to-Image Models with Versatile Reward

$N$ $K$ $N$ 对prompt和图像，LoRA fine-tune StableDiffusion，LoRA加在所有cross-attention上。
$\Delta W_{i}$ $o = W_{0}x + \sum_{i=1}^{L}\text{Softmax}(x W_{g})\Delta W_{i}x$ 。

CoMat

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

DPOK，类似fine-tune版本的TokenCompose。
$\mathcal{L}_{\text{cap}}$ AR teacher forcing的next token prediction loss之和。
$\mathcal{L}_{\text{token}}$ forces the model to activate the attention of the object tokens only inside the region. 即让cross-attention map中object token那一列的响应值集中于mask的那几行。
$\mathcal{L}_{\text{pixel}}$ forces every pixel in the region to attend only to the object tokens by a binary cross-entropy loss. 即让cross-attention map中mask那几行的响应值集中于object token那一列。

DPT

Discriminative Probing and Tuning for Text-to-Image Generation

提取StableDiffusion的feature，送入一个Q-Former，使用global matching（CLIP loss）和local grounding（classification，bounding box）任务训练Q-Former。
训练完成后，给StableDiffusion的所有cross-attention加上LoRA，使用相同的loss一起训练Q-Former和LoRA。
生成时进行self-correction，对global matching的CLIP loss求梯度作为guidance。

SELMA

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

IterComp

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

A framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation.
We develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations.

LongAlign

Improving Long-Text Alignment for Text-to-Image Diffusion Models

利用DRTune做长文本对齐。

FaceScore

Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation

专为人脸设计的score微调模型。

F-Bench

F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

AIG

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

An inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs.

RID

Reward Incremental Learning in Text-to-Image Generation

RLHF连续学习，解决遗忘问题。

DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

DPO的背景是LLM预训练后的RLHF pipeline，一般包含三个阶段。
第一阶段SFT：fine-tune a pre-trained LLM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.)。
$(x, y_{w}, y_{l})$ $r_{\phi}(x, y_{w})$ $P(y_{w} \succ y_{l} | x) = \frac{\exp(r_{\phi}(x, y_{w}))}{\exp(r_{\phi}(x, y_{w})) + \exp(r_{\phi}(x, y_{l}))} = \frac{1}{1 + \exp(r_{\phi}(x, y_{l}) - r_{\phi}(x, y_{w}))} = \sigma (r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l}))$ $\mathcal{L} = - \mathbb{E} \left[1 \cdot \log P(y_{w} \succ y_{l} | x) \right] = - \mathbb{E} \left[ \log \sigma (r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l})) \right]$ 。
$\max_{\pi_{\theta}} \mathbb{E}_{x \sim D, y \sim \pi_{\theta}(y|x)}\left[ r_{\phi}(x, y)\right] - \beta D_{\text{KL}}(\pi_{\theta}(y|x) \| \pi_{\text{ref}}(y|x))$ ，the added constraint is important, as it prevents the model from deviating too far from the distribution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers. 一般使用PPO算法。
DPO省略了第二阶段，直接利用preference pair对数据优化模型。
$w$ $l$ $\text{ref}$ $\theta$ 是要训练的模型。
DPO的好处是不需要额外训练一个reward model去给样本打分，只需要有正负样本对就可以，当然也可以用已有的reward model给样本打分，然后根据分数大小得到正负样本对；不再需要基于Policy Gradient的PPO算法（Actor-Critic）那样同时跑4个模型（Actor、Reference、Reward、Critic），而是只用跑Actor和Reference 2个模型。
$\max_{\pi_{\theta}} \mathbb{E}_{x \sim D, y \sim \pi_{\theta}(y|x)}\left[ r_{\phi}(x, y)\right] - \beta D_{\text{KL}}(\pi_{\theta}(y|x) \| \pi_{\text{ref}}(y|x))$ $r(x,y)$ ，该目标可以改写为：

\begin{aligned} max_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [r (x, y)] - β D_{KL} (π_{θ} (y | x) ∥ π_{ref} (y | x)) \\ = max_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [r (x, y) - β \log \frac{π_{θ} (y | x)}{π_{ref} (y | x)}] \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [β \log \frac{π_{θ} (y | x)}{π_{ref} (y | x)} - r (x, y)] \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{π_{ref} (y | x)} - \frac{1}{β} r (x, y)] \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{π_{ref} (y | x)} - \log \exp (\frac{1}{β} r (x, y))] \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{π_{ref} (y | x)} + \log \frac{1}{\exp (\frac{1}{β} r (x, y))}] \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{π_{ref} (y | x) \exp (\frac{1}{β} r (x, y))}] 定 义 Z (x) = \sum_{y} π_{ref} (y | x) \exp (\frac{1}{β} r (x, y)) \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{\frac{1}{Z (x)} π_{ref} (y | x) \exp (\frac{1}{β} r (x, y))} - \log Z (x)] 定 义 π^{*} (y | x) = \frac{1}{Z (x)} π_{ref} (y | x) \exp (\frac{1}{β} r (x, y)) \\ = min_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [\log \frac{π_{θ} (y | x)}{π^{*} (y | x)} - \log Z (x)] \\ = min_{π_{θ}} E_{x \sim D} [D_{KL} π_{θ} (y | x) ∥ π^{*} (y | x) - \log Z (x)] Z (x) 不 是 y 的 函 数 \end{aligned}

$\pi^{*}(y|x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y|x) \exp\left( \frac{1}{\beta} r(x, y)\right)$ $\sum_{y} \pi^{*}(y|x) = 1$ $\pi^{*}(y|x)$ $\pi_{\theta}(y|x) = \pi^{*}(y|x)$ 。

$\pi^{*}(y|x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y|x) \exp\left( \frac{1}{\beta} r(x, y)\right)$ $r(x,y) = \beta \log \frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} Z(x) = \beta \log \frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$ 。

$P(y_{w} \succ y_{l} | x) = \sigma (r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l}))$ $\pi_{\theta}(y | x)$ $\pi_{\theta}(y | x) = \pi^{*}(y | x)$ $r(x, y)$ $P(y_{w} \succ y_{l} | x) = \sigma (r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l})) = \sigma \left( \beta \log \frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} - \beta \log \frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)} \right)$ $\mathcal{L} = - \mathbb{E} \left[1 \cdot \log P(y_{w} \succ y_{l} | x) \right] = - \mathbb{E} \left[ \log \sigma \left( \beta \log \frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} - \beta \log \frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)} \right) \right]$ ，这就是DPO的loss。

$\pi_{\theta}(y | x)$ 最优的implicit reward model。

Diffusion-DPO

Diffusion Model Alignment Using Direct Preference Optimization

$\log$ $p_{\theta}$ $p_{\text{ref}}$ $q$ $\epsilon$ 的形式。

D3PO

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

SPO

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

$x_{0}$ $x_{0}$ $x_{t}$ , by denoising from random Gaussian noise, instead of selecting a pair of win-lose samples.
$k$ sampled denoised samples
$x_{T}$ $x_{T}$ $x_{t}$ $k$ $x_{t-1}$ $x_{t-1}$ 作为win，最低的作为lose，应用Diffusion-DPO优化模型。
$k$ $x_{t-1}$ 中随机选一个可以循环上面的步骤，提高效率。

TailorDPO

Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking

SDPO

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. SDPO is a novel alignment method tailored for few-step diffusion models.

PSO

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

为timestep-distilled diffusion model设计的DPO算法。This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution.

LaSRO

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

为timestep-distilled diffusion model设计的DPO算法。

RankDPO

$s_{i} = s(x^{i}, c, t, \theta) = \| \epsilon^{i} - \epsilon_{\theta}(x_{t}^{i}, t, c) \|_{2}^{2} - \| \epsilon^{i} - \epsilon_{\text{ref}}(x_{t}^{i}, t, c) \|_{2}^{2}$ $c$ ，score越小代表样本越被当前模型prefer。实际上负的score就是DPO的implicit reward。
$\mathcal{L} = -\mathbb{E}_{\left(c, x_{1}, x_{2}, \cdots, x_{k} \right) \sim D, t \sim [0, T]} \left[ \sum_{i>j} \log \sigma \left( -\beta \left( s_{i} - s_{j} \right) \right) \right]$ $x_{1}, x_{2}, \cdots, x_{k}$ $c$ $x_{i}$ $x_{j}$ $i > j$ $s_{i}$ $s_{j}$ ，也就是越来越prefer更好的图像。

CaPO

Calibrated Multi-Preference Optimization for Aligning Diffusion Models

$\mathcal{L} = \mathbb{E} \left[ g\left( \beta \frac{p_{\theta}(x^{+}|c)}{p_{\text{ref}}(x^{+}|c)} - \beta \frac{p_{\theta}(x^{-}|c)}{p_{\text{ref}}(x^{-}|c)} \right) \right]$ $g(\cdot)$ $g(u) = -\log (\sigma(u))$ $g(u) = (1 - u)^{2}$ 就是IPO。
$R_{\theta}(x_{t}, t, c, \epsilon) = \omega(\lambda_{t}) \left( \| \epsilon_{\theta}(x_{t}, t, c) - \epsilon \|_{2}^{2} - \| \epsilon_{\text{ref}}(x_{t}, t, c) - \epsilon \|_{2}^{2} \right)$ $c$ ，score越小代表样本越被当前模型prefer。实际上负的score就是DPO的implicit reward。
$\mathcal{L} = \mathbb{E}_{t, \epsilon^{+}, \epsilon^{-}} [g\left( \beta R_{\theta}(x^{+}_{t}, t, c, \epsilon^{+}) - \beta R_{\theta}(x^{-}_{t}, t, c, \epsilon^{-})\right)]$ 。
$g(u) = (\Delta R_{\text{ca}} - u)^{2}$ $R_{\text{ca}}(x_{i}, c) = \frac{1}{N-1} \sum_{j \ne i} \sigma \left( r(x_{i}, c) - r(x_{j}, c) \right)$ 是个实数值。

VisionReward

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score.
We develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. 只有postive image在所有preference dimension上都优于negative image的pair才会被用于进行DPO训练，防止混淆。

PopAlign

PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

之前的preference都是两个单独的样本之间的比较，PopAlign将其拓展到两个群体样本之间的比较。

NCPPO

Aligning Diffusion Models with Noise-Conditioned Perception

A method that utilizes the U-Net encoder’s embedding space for preference optimization. Perform diffusion preference optimization in a more informative perceptual embedding space.
将Diffusion-DPO的四个diffusion loss改为UNet encoder feature之间的MSE loss。

Curriculum-DPO

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

结合了Curriculum Learning的Diffusion-DPO，先学简单的（分差大的）再学难的（分差小的）。

PatchDPO

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

We propose PatchDPO, an advanced model alignment method for personalized image generation by estimating patch quality instead of image quality for model training.

Self-Consuming

Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences

If the data is curated according to a reward model, then the expected reward of the iterative retraining procedure is maximized.

Gene-DPO

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through f-divergence Minimization

DDE

Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation

SafetyDPO

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2.

GFlowNets

$P_{F} (s^{\prime} | s)$ $s_{0}$ $s \rightarrow s^{\prime}$ in a direct acyclic graph of states, and eventually reach a terminal state according to an unnormalized probability density (e.g., a non-negative reward).

$\nabla$ -GFlowNet

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Language

BDM

Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities

利用ControlNet实现StableDiffusion的中文控制。

$x_{t}$ ，但在cross-attention层使用Chinese CLIP引入中文，训练时StableDiffusion的text输入设为空串，否则会impede对中文的建模。

Taiyi-Diffusion-XL

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

PEA-Diffusion

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

用feature之间的L2 loss代替KD loss。

AltDiffusion

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

LLMDiffusion

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

$77$ . Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation.
$77$ 时，切割成多句使用CLIP编码再concat在一起。

Resolution

MD

Mixture of Diffusers for Scene Composition and High Resolution Image Generation

分区域生成，每个区域对应一个prompt，进行harmonization组合。
harmonization的关键在于：每一步都进行融合，且相邻区域要有overlap，overlap部分进行weighted sum，harmonization就是通过overlap部分传递的

MultiDiffusion

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

$z_{t-1}$ 进行padding，而Mixture of Diffusers是对预测的噪声进行padding。

DemoFusion

DemoFusion: Democratising High-Resolution Image Generation With No $$$

$c_{1}$ 的插值，以注入全局信息。
$3 \times 3$ $5 \times 5$ $3 \times 3$ $5 \times 5$ 的采样结果。
$z_{t}^{\text{local}}$ $z_{t}^{\text{global}}$ $c_{2}$ 的插值，类似CFG的思想。
代码中的dilation sampling就是如右下图中只取了四块。

MegaFusion

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

cascade的思想。
$x_{t_{i-1}}$ $t_{i}$ $t_{i}$ $x_{t_{i}}$ .
类似ScaleCrafter，采样时将standard convolution layer改造成dilated convolution layer提高感受野。
类似RDM，不同分辨率使用不同的noise schedule进行diffuse，以方便relay。

FAM

FAM: Diffusion Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

类似DemoFusion。
使用low-res的高频部分来保持structure，high-res的低频部分来refine detail。

FreeScale

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

和DemoFusion一个范式。
restrained dilated convolution: 去噪高分辨率latent时，类似ScaleCrafter使用dilated convolution，but we only apply dilated convolution in the layers of down-blocks and mid-blocks。
$\text{global} - G(\text{global})$ $G(\text{local})$ 是local attention的低频部分，两者相加。While local attention tends to produce better local results, it can bring unexpected small object repetition globally. These artifacts mainly arise from dispersed high-frequency signals, which will originally be gathered to the right area through global sampling. Therefore, we replace the high frequency signals in the local representations with those from the global level.

AccDiffusion

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

MultiDiffusion这种分patch进行采样再组合的方法很容易出现object repetition的问题，主要原因是不同patch在生成时都是用了相同的prompt，所以每个patch都被迫使去生成prompt中的object，AccDiffusion decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch.
AccDiffusion introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation.

AccDiffusion2

AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

低分辨率生成，上采样后提取每个patch的canny，ControlNet控制配合patch-content-aware prompt进行生成。

ResMaster

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

$N$ $\{z^{L}_{i}\}_{i=1}^{N}$ 。
$z_{t}$ $\{z^{H}_{i,t}\}_{i=1}^{N}$ ，生成时，每个patch使用两个guidance进行采样。
$z^{L}_{i}$ $\epsilon_{\theta}(z^{H}_{i,t}, t, c)$ $z^{L}_{i}$ ，编码结果类似IP-Adapter那样引入并行的cross-attention（两个并行的cross-attention layer是相同的，都是原模型的text cross-attention layer），实验发现这样training-free的推理是可行的。
$\epsilon_{\theta}(z^{H}_{i,t}, t, c)$ $z^{H}_{i,t}$ $\hat{z}^{H}_{i,0}$ $z^{L}_{i}$ $\hat{z}^{H}_{i,0}$ $x_{0}$ 。

HiPrompt

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

类似ResMaster。

StreamMultiDiffusion

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

加速MultiDiffusion。

SyncDiffusion

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

MultiDiffusion只能保证相邻的子区域的图片风格一致，无法保证全局风格一致。
$x_{t}$ $x_{t}$ $x_{t}$ $\hat{x}_{0}$ $x_{t}$ ，更新好后再进行MultiDiffusion的流程。

SyncTweedies

SyncTweedies: A General Generative Framework Based on Synchronized Diffusions

$Z$ $\{W_{i}\}_{i}^{n}$ $W_{i}$ $n$ $W_{i}$ $f$ $g$ $\phi$ $\hat{x}_{0}$ $\psi$ 是计算后验均值的公式。
MultiDiffusion和SyncDiffusion对应case 3。
本文发现case 2效果最好。

SSL-Guidance

Learned representation-guided diffusion models for large-image generation

用图像的某个patch和这个patch对应的预训练SSL模型提取的feature训练diffusion model。
生成时先生成feature，再利用MultiDiffusion的方法，逐个patch进行overlap生成。

CutDiffusion

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

$[T, T^{\prime}]$ $[T^{\prime}, 0]$ ，每个patch就是原diffusion model的生成尺寸。
$[T, T^{\prime}]$ $4$ $1,5,9,13$ $9,5,1,13$ $9$ $5$ ，依此类推），enabling pixels to contribute to the denoising of other images and promoting similarity in content generation across patches.
$[T^{\prime}, 0]$ 负责refine，类似MultiDiffusion，使用overlap的patch进行采样后取overlapping area的平均。

ASGDiffusion

ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance

多GPU并行加速CutDiffusion。

InstantAS

InstantAS: Minimum Coverage Sampling for Arbitrary-Size Image Generation

MultiDiffusion慢的原因是相邻patch需要overlap以传递信息。
InstantAS使用non-overlap的patch进行生成，每一步生成后重新划分patch，这样既传递了信息，又加快了速度。思想有点类似GoodDrag，边生成边优化。

VSD

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

根据attention entropy理论，只需要修改attention的scaling factor就可以使模型生成不同大小的图片。

DiffuseHigh

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

$x_{0}$ $\tilde{x}_{0}$ $\tilde{x}_{0}$ $\tilde{x}_{0}$ $\tilde{z}_{0}$ $N$ $\hat{z}_{N}$ $\hat{z}_{0}$ $x_{0}$ 进行采样。
上述方法可以重复进行，直到得到目标分辨率的图像。
Low-frequency component represents the low-frequency details of the image, encompassing global structures, uniformly-colored regions, and smooth textures. 所以该方法又称为DWT-based Structure Guidance，避免了使用StableDiffusion直接生成高分辨率图像时structure的不和谐。

ScaleCrafter

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

预训练好的StableDiffusion不能直接生成更高分辨率图片的原因是卷积核感受野受限。
We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.

FouriScale

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

和ScaleCrafer类似，都把问题归因于卷积核，在生成更高分辨率图片对feature map就行低通滤波并对卷积核进行dilation。

HiDiffusion

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

The generated image is highly correlated with the feature map of deep Blocks in structures and feature duplication happens in the deep Blocks. As the higher-resolution feature size of deep blocks is larger than the corresponding size in training, these blocks may fail to incorporate feature information globally to generate a reasonable structure. We contend that if the size of the higher-resolution features of deep blocks is reduced to the corresponding size in training, these blocks can generate reasonable structural information and alleviate feature duplication. Inspired by this motivation, we propose Resolution-aware U-Net (RAU-Net), a simple yet effective method to dynamically resize the features to match the deep blocks.
RAD根据输入分辨率调整第一个conv层的dilation rate，使输出的feature size匹配原模型训练时的feature size。
RAU根据输入分辨率将最后一个conv层前的插值倍数，使输出的feature size匹配当前分辨率。
Both RAD and RAU do not introduce additional trainable parameters. Therefore, RAD and RAU can be integrated into vanilla U-Net without further fine-tuning.

Any-Size-Diffusion

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

VAE不动，LoRA fine-tune StableDiffusion，预定义一些长宽比，每个长宽比对应一个图像长宽，训练时，根据图像长宽比找到一个最近的预定义长宽比，将图像resize到其对应的图像长宽，进行训练。这样就可以给定任意长宽比的噪声生成图像。
利用StableSR的tiled sampling进行超分，类似MultiDiffusion，可以超分到任意分辨率。

Self-Cascade

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

定义一个递增的分辨率序列，只需要最低分辨率上训练好的diffusion model。
$x_{0}$ $x_{t}$ $x_{t}$ $x_{0}$ $x_{t}$ $x_{0}$ 作为条件。
采样时，先从最低分辨率采样得到样本，加噪到某一中间步后上采样到下一个更高的分辨率，继续采样，以此循环，直到最高分辨率。

DiffCollage

DiffCollage: Parallel Generation of Large Content with Diffusion Models

$[x_1,x_2,x_3]$ $[x_1,x_2]$ $[x_2]$ $[x_3]$ 。
$p(x_1,x_2,x_3) = p(x_1,x_2)p(x_3|x_2) = \frac{p(x_1,x_2)p(x_2,x_3)}{p(x_2)}$
$\nabla \log p(x_1,x_2,x_3) = \nabla \log p(x_1,x_2) + \nabla \log p(x_2,x_3) - \nabla \log p(x_2)$
$[x_1, x_2]$ $[x_2]$ ，然后进行联合采样。

ElasticDiffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

$H \times W$ $\bar{H} \times \bar{W}$ 的图像。
$\epsilon_{\theta}(x_{t}) + (1+\omega) \cdot (\epsilon_{\theta}(x_{t},c) - \epsilon_{\theta}(x_{t}))$ $\epsilon_{\theta}(x_{t})$ $\epsilon_{\theta}(x_{t},c) - \epsilon_{\theta}(x_{t})$ ，we use two key insights. First, the class direction score primarily dictates the image’s overall composition, while the unconditional score enhances detail at the pixel level in a more local manner. Second, the unconditional score requires a pixel-specific precision, contributing to the image’s fine-grained details, while class direction score affects pixels collectively, defining the image’s overall composition. 因此unconditional score需要精确计算，class direction score只需要大概计算。
$H \times W$ $\bar{H} \times \bar{W}$ $\bar{H} \times \bar{W}$ $h \times w$ $h < H, w < W$ $H \times W$ $\epsilon_{\theta}$ ，输出只保留当前patch的预测，最终所有patch的预测拼在一起即为unconditional score。相比于overlap采样，极大地较少了网络调用次数。
$\bar{x}_{t} \in R^{\bar{H} \times \bar{W} \times 3}$ $x_{t} \in R^{N \times M \times 3}$ $\frac{\bar{H}}{\bar{W}} = \frac{N}{M}$ $N \times M$ $H \times W$ $x_{t}$ $H \times W$ $\epsilon_{\theta}$ $\bar{H} \times \bar{W}$ ，输入网络时输入条件和不输入条件各预测一次，两者相减得到class direction score。为了防止latent signal的统计量发生变化，这里的上下采样都使用nearest-neighbors mode.
$\bar{H} \times \bar{W}$ $R$ $20\%$ 的位置的结果。
$\hat{x}_{0}^{u}$ $\hat{x}_{0}^{c} \in R^{N \times M \times C}$ $\bar{H} \times \bar{W}$ $\hat{x}_{0}^{u}$ $s$ $s=200$ $60\%$ for the diffusion steps are completed.

MagicScroll

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

MultiDiffusion

FiT

FiT: Flexible Vision Transformer for Diffusion Model

BeyondScene

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

针对有pose和prompt的高分辨率人物生成。

Personalization

direct: 使用一个已有的token，对其token embedding进行优化或适配

transform: 由一个网络将视觉信息转换为token embedding或residual

attach: 附在已有prompt之后

no pseudo word: 不需要使用已有的token或新添加token

Subject

TI (direct)

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

StableDiffusion
$S_{\star}$ , which a new token
diffusion loss只优化token embedding（text encoder前的embedding）。

DreamBooth (direct)

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Imagen
[V] class, where [V] is a rare token
token embedding毕竟表达能力有限，效果不好，所以选择优化token embedding的同时也fine-tune整个模型（包括text encoder）。
fine-tuning有overfitting + language drift的缺点，所以提出Class-specific Prior Preservation Loss，类似Continual Learning的replay方法，生成一些样本和新样本一起作为训练集进行训练，防止过拟合。
改进版使用LoRA fine-tune diffusion model。

CustomDiffusion (direct)

Multi-Concept Customization of Text-to-Image Diffusion

StableDiffusion
[V] class
同时训练token embedding和cross-attention KV projection matrix。类似DreamBooth，构造一个regularization dataset解决language drift问题。相当于只fine-tune cross-attention KV projection matrix的StableDiffusion版本的DreamBooth。
可以同时在多组reference images上进行训练，生成时可以使用多个pseudo words构造prompt。

DreamBooth++ (direct)

DreamBooth++: Boosting Subject-Driven Generation via Region-Level References Packing

DreamBooth
组图，修改UNet的计算方式，convolution和self-attention的计算限制在各自的region内，训练时优化pseudo word embedding并且fine-tune整个UNet。
除了DreamBooth的两个loss，还加了一个cross-attention map之间的MSE loss。

Improved (direct)

An Improved Method for Personalizing Diffusion Models

StableDiffusion
[V] class
借鉴Imagic的两阶段训练法，第一阶段只训练token embedding，第二阶段只fine-tune diffusion model。

ViCo (direct)

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

StableDiffusion
$S_{\star}$
将reference image作为visual condition引入网络。
$z_{t}$ $z_{t}$ 的text cross-attention的输出作为Q，reference image的text cross-attention的输出作为KV。
使用reference image的text cross-attention map估计出一个mask，用这个mask过滤KV，只保留mask内的KV（KV长度变小），让Q只与有object的KV进行计算。

HyperDreamBooth (no pseudo word)

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

使用CelebA-HQ数据集，训练一个HyperNetwork预测StableDiffusion的所有attention层的LoRA参数去重构图像。StableDiffusion输入统一的"a [V] face"的prompt，其中"[V]"是稀有单词，这里不优化"[V]"的token embedding，因为作者发现只需要LoRA参数，就可以用"[V]"随意造句进行生成了。
测试时，先使用HyperNetwork预测LoRA参数作为初始化，然后再进行LoRA fine-tune，fine-tune速度比DreamBooth快25倍。
HyperNetwork架构类似Q-Former，使用迭代法从零初始化的参数生成最终参数，预测出的LoRA参数加在StableDiffusion上计算diffusion loss优化HyperNetwork。

HyperNetFields (no pseudo word)

HyperNet Fields Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

$T$ 是优化次数，不是diffusion timestep。
$\theta_{t}$ $\theta_{t+1}$ $\theta_{t} - \nabla \theta_{t}$ 类似diffusion reverse process。

DiffLoRA (no pseudo word)

DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

diffusion model生成LoRA参数。

XTI (direct)

P+: Extended Textual Conditioning in Text-to-Image Generation

StableDiffusion
定义P+空间：UNet每层cross-attention使用的text embedding的集合。不同层使用不同text embedding有不同的效果。
P+空间的TI：对于某个concept，不同cross-attention层使用不同token embedding进行优化，在StableDiffusion中就是16个不同的token embedding。
只优化token embedding，不优化模型参数。
不同层输入不同concept的TI得到的token embedding，还可以达到semantic composition的效果。

CustomContrast (transform)

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

P+空间的TI
reference image作为positive sample，找一些同类的其它图片作为negative sample。
Textual QFormer根据sample提取P+空间的token embedding序列，Visual QFormer根据sample提取visual feature，两个QFormer的query里都有一个cls token，使用两个QFormer的cls token位置的输出计算contrastive loss。
Visual QFormer提取到的reference image的feature以IP-Adapter的形式引入diffusion model。
在P+空间的token embedding序列引入contrastive loss，拉进positive sample之间相同cross-attention层的token embedding之间的距离，拉远positive sample和negative sample之间相同cross-attention层的token embedding之间的距离。
使用dffusion loss和两个contrastive loss训练模型。
使用multi-view数据集训练，同一个物体的使用三个view，第一个作为reference image，第二个作为reference image的positive sample，第三个作为reconstruction target。

ProSpect (direct)

ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

StableDiffusion
$P^{\star}$ $1000$ $10$ 个阶段，每个阶段使用一个单独的token embedding进行训练。
不同阶段使用不同reference的token embedding，可以实现material、style、layout的transfer生成与编辑。

NeTI (direct)

A Neural Space-Time Representation for Text-to-Image Personalization

StableDiffusion
$P^*$ $t$ $l$ ，输出一个768维的向量作为token embedding。
$\frac{M(t,l)}{\| M(t, l) \|}\| v_{cat} \|$ $v_{cat}$ 为单词cat的word embedding。
$h$ $d_{h}$ $d_{h}$ $t\sim U(0, d_{h})$ $h[i>t]$ 部分全部dropout为0，which encourages the network to be robust to different dimensionality sizes and encode more information into the first set of output vectors, which have a lower truncation frequency. 采样时也可以控制这个dropout，如果使用大dropout进行采样，生成的concept就比较粗糙，但可编辑性更强。
Inverting a concept directly into the UNet’s input space, without going through the text encoder, could potentially lead to much quicker convergence and more accurate reconstructions. 所以让neural mapper输出两个向量，一个向量是token embedding，和其他单词一起送入CLIP text encoder，另一个向量不过CLIP text encoder，而是直接加在该CLIP text encoder输出的text embedding上，同样使用上面的normalization，防止过拟合。但是这个额外的向量只加在UNet的cross-attention层的value上，key使用不加额外向量的text encoder的输出，原理同key-locking。

ED (direct)

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

将原图分解为低频和高频分量，分别对应三个pseudo word embedding，原图的pseudo word embedding等于低频和高频分量的pseudo word embedding之和，训练时从低频分量、高频分量和原图三者中随机选一个。通过分解并且分别学习，学习效果更好。
生成时使用原图的pseudo word embedding，可以结合style描述进行生成，效果比别的方法要好。

PerFusion (direct)

Key-Locked Rank One Editing for Text-to-Image Personalization

StableDiffusion
Personlization的两个主要目标是avoid overfitting和preserve the identity，但这两个目标天然存在trade-off，to improve both of these goals simultaneously, our key insight is that models need to disentangle what is generated from where it is generated.
$W_{V}$ $W_{K}$ $W_{K}$ $W_{K}$ 与supercategory word embedding (teddy)的计算结果靠近。
$W_{V}$ $W_{K}$ $i_{\text{Hugsy}}$ $K$ $o^{K}_{\text{Hugsy}} = K^{\text{teddy}}$ $i_{\text{Hugsy}}$ $V$ $o^{V}_{\text{Hugsy}} = V^{\text{teddy}}$ .

InFusion (direct)

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

$W_{V}$ 。

CrossInitialization (direct)

Cross Initialization for Personalized Text-to-Image Generation

$v^{\star}$ $v^{\star}$ $v^{\star}$ 的初始化方法是不够好的，导致优化慢、过拟合、缺乏可编辑性。
$v^{\star}$ $E(v^{\star})$ $v^{\star}$ $E(v^{\star})$ 的大小和方向会很相似。
$v$ $E(v)$ 再输入text encoder，最终生成效果相似，有点不动点的意思。

$v^{\star} = E(v^{\star})$ $v$ $E(v)$ $v^{\star}$ $v^{\star}$ $v$ 太远。

DC (direct)

Learning to Customize Text-to-Image Diffusion In Diverse Context

利用MLM加强pseudo word embedding的语言特性。

DP (direct)

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

DreamBooth。
构造更好的regularization dataset。

UFC (direct)

User-Friendly Customized Generation with Multi-Modal Prompts

使用BLIP和ChatGPT构造更好的regularization prompt、customized prompt和generation prompt。

SID (direct)

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

DreamBooth。
类似DP，在训练时使用尽可能详细的描述，这样可以以减少pseudo中对不相关内容的bias。
作者总结了几种常见的bias，利用VLM生成含有这些bias描述的句子。

SAG (direct)

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

$c$ $c_{0}$ $c$ $\bar{\epsilon}=\epsilon_{\theta}(x_{t}, c) + \omega_{t} \cdot (\epsilon_{\theta}(x_{t}, c) - \epsilon_{\theta}(x_{t}, c_{0}))$ $\epsilon = \bar{\epsilon} + \omega \cdot (\bar{\epsilon} - \epsilon_{\theta}(x_{t}, \phi))$ 进行采样，增强一致性。

AlignIT (direct)

AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

不同方法在不同stage进行操作，比如TI在stage 1训练token embedding，CatVersion在stage 2训练text encoder，CustomDiffusion在stage 3训练cross-attention KV projection matrix。这三类方法最终都是为了修改最后的KV，送入cross-attention影响最后的图像。
对于现有的方法，对比模型关于"a cat playing with ball"和"a <sks> playing with ball"的cross-attention map会发现，由于pseudo word的引入，其它没有变的word的cross-attention map也被影响了，这是这些方法效果不好的原因。
生成时，分别使用原模型+"a cat playing with ball"和customized model+"* <sks> * * *"进行生成，将前者的pseudo word的cross-attention map替换为后者，可以使用在多种TI方法上。

PALP (direct)

PALP: Prompt Aligned Personalization of Text-to-Image Models

LoRA版的DreamBooth
test-time fine-tune时，不仅要提供reference image，还要提供生成时需要的prompt，比如"a sketch of [V]"，即每次生成前都要进行fine-tune。
Personlization的一个问题是过拟合，过拟合的模型只需要一步就可以从纯噪声预测出subject的形状和特征。Our key idea is to encourage the model’s denoising prediction towards the target prompt。

除了diffusion loss，加入SDS loss，让根据prompt的预测靠近根据clean prompt的预测。

CLiC (direct)

CLiC: Concept Learning in Context

StableDiffusion
Custom-Diffusion的RoI版本，对RoI区域的物体进行TI，同时优化cross-attention KV projection matrix。
$l_{con}$ $l_{attn}$ $l_{RoI}$ 使用只包含token的句子和只包含RoI区域的图像进行TI。
SDEdit + Blended进行编辑。

MagicTailor (direct)

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

EM-Optimization (direct)

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

使用CLIP text encoder编码super class name初始化token，然后使用EM算法优化。
E-step：随机选择50个步数，对reference image加噪，和带pseudo word的prompt一起送入StableDiffusion，提取pseudo word对应的cross-attention map，取平均，阈值法求出一个mask。
M-step：使用上述mask，masked diffusion loss + masked cross-attention loss优化pseudo word embedding。

RPO (direct)

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Diffusion-DPO fine-tune diffusion model，目标是让模型在以含有pseudo word的prompt为条件时，更加prefer reference image。
similar loss就是reference image上的diffusion loss。

CustomSketching (direct)

CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing

IP-Adapter (no pseudo word, no test-time fine-tuning)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

MIP-Adapter (no pseudo word, no test-time fine-tuning)

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

多reference image的IP-Adapter。

DreamTuner (no pseudo word, no test-time fine-tuning / direct)

DreamTuner: Single Image is Enough for Subject-Driven Generation

类似ViCo的思想，将reference image的特征引入StableDiffusion就能进行subject-driven generation。
Subject-Encoder：为了解耦内容和背景特征，使用Salient Object Detection去除背景；为了解耦内容和位置特征，可以用预训练的ControlNet引入位置信息，这样学到的都是content特征。
Subject-Encoder-Attention：StableDiffusion的self-attention和cross-attention之间插入一个可训练的cross-attention层（S-E Attention），对reference image进行重构，reference image的self-attention附加到generated image的self-attention中提供参考。

Self-Subject-Attention：The features of reference image extracted by the text-to-image U-Net model are injected to the self-attention layers, which can provide refined and detailed reference because they share the same resolution with the generated features. 生成时每一步直接对reference image随机加噪，输入UNet，提取self-attention layers的key和value，与生成时的self-attention layers的key和value进行如上交互。
使用上述方法，即使不训练pseudo word embedding，也能进行personlization生成。但使用DreamBooth的方法训练一个pseudo word embedding+fine-tune diffusion model，效果更好。

FreeTuner (no pseudo word, no test-time fine-tuning)

FreeTuner: Any Subject in Any Style with Training-free Diffusion

类似DreamTuner。
$M_{\text{sub}}$ $M_{\text{sub}} \cdot \text{SA}_{t} + (1-M_{\text{sub}}) \widetilde{\text{SA}}_{t}$ $z_{t}$ $M_{\text{sub}}$ $M_{\text{sub}} \cdot z_{t} + (1-M_{\text{sub}}) \widetilde{z}_{t}$ 。这样即使不训练pseudo word embedding，也能进行personlization生成。
如果还有style image，使用VGG-19提取feature计算相似度，求梯度作为guidance。

SSR-Encoder (no pseudo word, no test-time fine-tuning)

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

类似ViCo的思想，将reference image的特征引入StableDiffusion就能进行subject-driven generation。
$K$ $K$ ，IP-Adapter引入StableDiffusion，训练引入的各种projection matrix。
$x_{t}$ 就是reference image的加噪结果。
这说明CLIP image encoder编码图像得到的sequence feature也是可以用于计算相似度的，不只是CLS token feature可以。

Mask-ControlNet (no pseudo word, no test-time fine-tuning)

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

DiptychPrompting (no pseudo word, no test-time fine-tuning)

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

diptych: 双连画。

DreamMatcher (direct)

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

针对TI生成过程的优化，对于不同TI方法都适用，如DreamBooth和CustomDiffusion等。
self-attention有两个作用，一是由QK计算出的attention map控制的图像结构，二是由V控制的visual attributes，如颜色、纹理。
TI方法生成的图像，concept的结构都比较好，但是一些具体细节，如颜色、纹理，都和reference image中的concept有出入，所以本方法通过修改TI生成过程中self-attention的V做appearance保持。
具体做法是dual branch，先对reference image进行DDIM Inversion再重构，得到reconstructive trajectory，另一条从随机噪声出发，带pseudo word的prompt为条件，得到generative trajectory。由于生成图像中concept的位置不确定，和reference image中的concept位置不一致，所以直接用reconstructive trajectory中的V替换generative trajectory中对应的V会出现位置不匹配的问题，所以使用两条trajectory的UNet decoder的一些feature做semantic corresponce，根据semantic corresponce计算出dense displacement field，根据dense displacement field对reconstructive trajectory中的V做warp，使用warp后的V替换generative trajectory中对应的V。

DVAR (direct)

Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation

提出一种early stopping criterion，加速TI接近15倍，并且效果没有明显下降。

PACGen (direct)

Generate Anything Anywhere in Any Scene

DreamBooth学到的word也可以用在GLIGEN这种plug-and-play模型。但DreamBooth的一个缺点是不能解耦object和位置的信息，使用GLIGEN这种有额外layout信息的模型进行生成时，一旦修改了位置，就无法很好的生成object。
实用数据增强方法训练DreamBooth：by incorporating a data augmentation technique that involves aggressive random resizing and repositioning of training images, PACGen effectively disentangles object identity and spatial information in personalized image generation.

CI (direct)

Compositional Inversion for Stable Diffusion Models

existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space.
Textual Inversion will make the new (pseudo-)embeddings OOD and incompatible to other concepts in the embedding space, because it does not have enough interactions with others during the post-training learning。加入正则项，使得学到的embedding和一些已知的且相关的concept的embedding不要太远，比如给定和猫相关的reference images时，使得学到的embedding和cat, pet等的embedding靠近。这样学到的embedding更具一般性，和其他单词组合造句时就像用cat造句一样，模型可以识别，也可以和其他学到的embedding组合造句进行multi-concept generation。

CoRe (direct)

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

$x_{t}$ ，使用三种不同的prompt输入网络，计算loss。

SuDe (direct)

SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

如果用传统方法学到的pseudo word进行造句，比如"[V] is running"，模型不能正确生成running，但如果使用base class造同样的句子却可以正确生成，这说明学出来的pseudo word并不能继承base class的属性。
$[x_{\theta}(x_{t}, p_{[V]}, t) - x_{\bar{\theta}}(x_{t}, p_{\text{base}}, t)]$ $\bar{\theta}$ 是deatch的意思，表明这一项的梯度并不回传。可以用在不同方法上。思想与PALP中防止过拟合类似。

ProFusion (direct)

Enhancing Detail Preservation for Customized Text-to-Image Generation A Regularization-Free Approach

使用不加任何正则项的TI得到token embedding。
之前的工作加正则项是为了防止过拟合，但也导致了信息提取不充分。本论文提出Fusion Sampling解决这一问题。

DisenBooth (direct)

DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation

DreamBooth
之前的工作如TI和DreamBooth都是为reference images优化一个token，DisenBooth除此之外还为每张reference image编码一个独立的subject-unrelated token，这样有助于学习到所有reference images共有的subject的特征，而忽略每张reference image其它细节（如背景等）。
$\{x_{i}\}_{i=1}^{K}$ $P$ $f_{s}=E_{T}(P)$ $f_{i}=mask*E_{I}(x_{i})+MLP(mask*E_{I}(x_{i}))$ $mask$ $L_{1}=\sum_{i=1}^{K}\| \epsilon_{i} - \epsilon_{\theta}(z_{i,t_{i}}, t_{i}, f_{s}+f_{i}) \|_{2}^{2}$ $L_{2}=\sum_{i=1}^{K}\| \epsilon_{i} - \epsilon_{\theta}(z_{i,t_{i}}, t_{i}, f_{s}) \|_{2}^{2}$ $L_{3}=\sum_{i=1}^{K}cos<f_{s}, f_{i}>$ 降低subject-related与subject-unrelated的相似性。
使用LoRA进行fine-tune。

DreamArtist (direct)

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning

类似DisenBooth
$y$ $\phi$ $\epsilon_{\theta}$ $z_{t}$ $\epsilon_{\theta}$ $\hat{z}_{0}$ $z$ 的MSE Loss和StableDiffusion的VAE decoder解码后的pixel L1 Loss，促使pseudo-word学习pixel-level的细节。
生成时使用negtive pseudo word的输出作为u进行cfg生成。

StyO (direct)

StyO: Stylize Your Face in Only One-Shot

StableDiffusion
one-shot face stylization: applying the style of a single target image to the source image。
构造content和style单词，使用三个数据集进行TI，同时也fine-tune StableDiffusion，其中target和source都只有一张图像。

之后使用该prompt进行生成：

DreamDistribution (direct on prompt)

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

有点类似De-Diffusion，但不是显式的caption。
$K$ 个prompt，每个prompt是一个可训练的word embedding序列，编码后求均值和方差，重参数采样后送入预训练StableDiffusion，优化所有prompt。
生成时采样即可。

SingleInsert (transform)

SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing

StableDiffusion
借鉴Break-A-Scene，使用DINO或SAM对intended concept做分割，使用masked diffusion loss进行训练。
两阶段训练：第一阶段做TI，只训练image encoder；第二阶段fine-tune encoder+diffusion model。
$L_{bg}$ to minimize the impact of the learned embedding on the background area.

ELITE (transform)

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

StableDiffusion
两阶段训练：global和local。
global：使用CLIP作为feature extractor提取reference image feature，使用一个global mapping network将CLIP不同层的feature映射为不同token embedding，最深层的feature预测的token embedding对应subject-related information，浅层的feature预测的token embedding对应subject-unrelated information，同时训练global mapping network和cross-attention KV projection matrix。
local：去除reference image背景，使用CLIP作为feature extractor提取其feature，使用一个local mapping network将CLIP feature映射为token embedding，这里只使用最深层的word，额外添加一组cross-attention KV projection matrix进行训练，同时训练local mapping network和new cross-attention KV projection matrix。此时cross-attention的输出是global与local cross-attention的输出的和，global cross-attention依然使用global阶段生成的token embedding作为输入，且只使用最深层的word，但不参与训练。这一阶段类似LoRA，让模型将更多细节绑定到global阶段生成的word embedding上。

E4T (transform)

Designing an Encoder for Fast Personalization of Text-to-Image Models

StableDiffusion
Texutal Inversion shows that the word embedding space exhibits a trade-off between reconstruction and editability. This is because more accurate concept representations typically reside far from the real word embeddings, leading to poorer performance when using them in novel prompts. StyleGAN inversion也有这种问题， a two-step solution which consists of approximate-inversion followed by model tuning. The initial inversion can be constrained to an editable region of the latent space, at the cost of providing only an approximate match for the concept. The generator can then be briefly tuned to shift the content in this region of the latent space, so that the approximate reconstruction becomes more accurate.
$E$ $I_{c}$ $e_{c}=e_{domain}+s\cdot E(I_{c})$ $e_{domain}$ ，输入StableDiffusion，同时使用LoRA fine-tune cross-attention projection matrix，重构图像，类似Custom-Diffusion。
$\|E(I_{c})\|_{2}^{2}$ 作为正则项。
每个domain先在各自的大数据集上进行预训练，再在给定的几张图像上进行test-time fine-tuning，都用一样的训练方法。

DA-E4T (transform)

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

引入contrastive-based regularization technique，让encoder可以处理不同domain的数据。

Cones (direct, no test-time fine-tuning)

Cones: Concept Neurons in Diffusion Models for Customized Generation

StableDiffusion
training-free
对每一组concepts，在cross-attention层的KV参数中，找到那些屏蔽掉后能够降低DreamBooth Loss（Reconstruction Loss+Preservation Loss）的神经元（Concept Neurons），不用训练，直接屏蔽掉这些神经元，就能得到对这组concepts敏感的text2img模型。pseudo word用一些已有但不常用的单词，比如AK47等。

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

对于某个class的subject，学习一个该class的token的residual token embedding。做法是TI训练text encoder，但这样会使整个句子中的单词偏向subject。加入正则项：使用ChatGPT对class进行造句，分别使用训练后的text encoder和原text encoder对每个句子进行编码，使得句子中非class的单词的token embedding训练前后尽量靠近。最后的residual token embedding也是所有造句中class token embedding的差的均值。（注意，某个单词单独的embedding和其在句子中的embedding是不同的）
这样每个residual token embedding都是可重复利用的，也可以和别的residual token embedding同时使用，还可以操作cross-attention map指定concept的位置。

HiPer (attach)

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

StableDiffusion
为参考图写一句话，但不包含pseudo word，而是利用text embedding后面的空位，加上personalized embedding，训练时只优化personalized embedding。

HiFi-Tuner (attach)

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

和HiPer类似，优化最后5个embedding。

CatVersion (no pseudo word)

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

StableDiffusion
$Kf+\Delta K$ $Vf + \Delta V$ $\Delta K$ $\Delta V$ 。

DPG (no pseudo word)

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

在reference image上利用RL直接fine-tune StableDiffusion。
reward定义为diffusion loss。

SuTI (no pseudo word, no test-time fine-tuning)

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Imagen
$\{$ $\}$ $x_{0}$ $\{$ $\}$ 为条件。
这样，使用训练好的大模型，给定3-10张unseen concept的图片和这个concept对应的文本，使用这个文本随便构造prompt，就可以生成和prompt和3-10张unseen concept图片都对齐的图像。

Obeject-Encoder (no pseudo word, no test-time fine-tuning)

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Imagen
只适用于人脸和动物等domain的个性化，并不能做到open domain的个性化。
对于每个domain，使用该domain的数据集进行训练：去除每张image的背景，训练一个object encoder提取object特征，并使用caption模型生成image的text，使用object特征和text两个条件fine-tune Imagen，使用一些正则防止过拟合。
训练好的模型可以根据reference image的物体特征和用户写的prompt自由生成。

InstantBooth (transform, no test-time fine-tuning)

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

StableDiffusion
类似Object Encoder，只适用于人脸和动物等domain的个性化，并不能做到open domain的个性化。
对于每个domain，使用该domain的数据集进行训练：把每张image看成一个concept进行训练，训练一个encoder编码image得到两个特征，一个concept特征，一个visual特征，concept特征替换text embedding中pseudo word所在位置的embedding，同时将visual特征通过GLIGEN引入StableDiffusion，同时训练encoder和GLIGEN的adapter，使用数据增强和去除背景等方法防止过拟合。并不优化pseudo word的token embedding。
推理时可以使用pseudo word构造prompt，encoder编码reference images得到的concept特征取均值后替换text embedding中pseudo word所在位置的embedding。

Instruct-Imagen (no pseudo word, no test-time fine-tuning)

Instruct-Imagen: Image Generation with Multi-modal Instruction

Re-Imagen + Instruction Tuning

Re-Imagen的目的是为了让模型condition on multi-modal input

BootPIG (no pseudo word, no test-time fine-tuning)

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

类似DreamTuner，训练网络直接识别reference image就可以直接生成，不需要test-time fine-tuning。训练整个Reference UNet和Base UNet的self-attention layers里的四个矩阵。
造数据进行自监督训练。

JeDi (no pseudo word, no test-time fine-tuning)

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

类似M2M的image sequence生成方法。

Multi-Subject

Break-A-Scene (direct)

Break-A-Scene: Extracting Multiple Concepts from a Single Image

提取一张图中多个concept。
给定有分割标注的图片，一次性提取图片中不同object的pseudo word，利用masked diffusion loss + masked cross-attention loss进行训练。

UCD (direct)

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

$N$ $K$ $K$ 个concept，可以组合生成。
use the combination of the score of different concepts (a learnable word embedding) to reconstruct images using diffusion loss.

ConceptExpress (direct)

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

无监督版的Break-A-Scene。
利用聚类得到多实例的大致分割，为每个实例分配一个可学习的token embedding，使用masked diffusion loss进行学习。
使用了对比损失和正则项进行辅助和增强。

DisenDiff (direct)

Attention Calibration for Disentangled Text-to-Image Personalization

CustomDiffusion
提取一张图中多个concept。
$\mathcal{L}_{\text{bind}}$ $V^{*}_{1}$ $V^{*}_{2}$ $\mathcal{L}_{s\&s}$ 减小cat和dog的cross-attention map之间的IoU。
suppress：cross-attention map的平方（element-wise multiplication），抑制low response，增强high response。

AttenCraft (direct)

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

CustomDiffusion
提取一张图中多个concept。
先不使用pseudo word，使用concept对应的class word，在某个较小的时间步，使用DatasetDiffusion的方法提取每个concept的mask，使用CustomDiffusion的方法学习时，优化每个pseudo word的cross-attention map和对应的mask之间的KL散度。

UnZipLoRA (direct)

UnZipLoRA: Separating Content and Style from a Single Image

$L_{c}$ $L_{s}$ 两个LoRA。

Multi-Subject Composition

Mix-of-Show (direct)

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

TI和P+这种只优化token embedding的方法，如果reference image是in-domain的，那就够用了，但如果是out-of-domain reference image效果就不好了。
(c) 对于token embedding和模型参数都优化的方法（如DreamBooth和CustomDiffusion），如果只用优化好的token embedding和原模型参数进行生成，生成的都较为相似，说明token embedding捕捉的还是in-domain的信息，out-of-domain的信息蕴藏在更新的模型参数中。
(d) 为了将更多的信息转移到token中，采用P+的layer-wise embedding并使用multi-word embedding。
$\Delta W_{i}$ $\Delta W_{i}$ $W = \text{arg}\ \text{min}_{W} \sum_{i}^{n} \| (W_{0} + \Delta W_{i})X_{i} -WX_{i} \|_{F}^{2}$ 。

LoRA-Composer (direct)

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题，如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA，会导致concept confusion和concept vanishing。
training-free方法，需要提供不同concept的bounding box。
$z_{t}$ $z_{t}$ $z_{t}$ 输入原StableDiffusion去噪，依此循环，目的是让concept生成在对应bounding box内且让不同concept互不影响。

ZipLoRA (direct)

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

$D$ $L_{c}$ $L_{s}$ $L_{m}$ $L{c}$ $L_{s}$ $\| (D \oplus L_{m})(x_{c}, p_{c}) - (D \oplus L_{c})(x_{c}, p_{c}) \|_{2} + \| (D \oplus L_{m})(x_{s}, p_{s}) - (D \oplus L_{s})(x_{s}, p_{s}) \|_{2}$ 。
$\sum_{i} \| m_{c}^{i} \cdot m_{s}^{i} \|$ enforces an orthogonality constraint between the columns of the individual LoRA weights.

LoRA.rar (direct)

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

$L_{c}$ $L_{s}$ $m_{c}$ $m_{s}$ 。
使用大量不同LoRA训练，这样可以zero-shot，不需要像ZipLoRA那样每对LoRA都要重新训练。

LoRACLR (direct)

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

$V_{i}$ $X_{i}$ $Y_{i}$ , respectively.

CP (no pseudo word)

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Continual Learning setup: a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns.

MultiBooth (direct)

MultiBooth: Towards Generating All Your Concepts in an Image from Text

思想类似LoRA-Composer。

MC2 (direct)

MC2: Multi-concept Guidance for Customized Multi-concept Generation

类似LoRA-Composer，解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题，如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA，会导致concept confusion和concept vanishing。
training-free方法，但不需要提供不同concept的bounding box。
$z_{t}$ $\frac{n(n-1)}{2}$ $z_{t}$ $z_{t}$ $z^{c,i}_{t-1}$ $z^{u}_{t-1}$ $z^{u}_{t-1} + \sum_{i=1}^{n}\omega_{i}(z^{c,i}_{t-1} - z^{u}_{t-1})$ $z_{t-1}$ 。

OMG (direct)

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

类似LoRA-Composer，解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题，是一种通用的方法，可以用在不同TI方法上，甚至不同TI方法训练出来的pseudo word和LoRA也可以一起生成。
使用两阶段进行生成。第一阶段先用general class word替代pseudo word，使用原StableDiffusion进行生成，保留生成过程中所有general class word对应的cross-attention map，使用SAM得到生成结果中general class word对应的mask；第二阶段和第一阶段进行一样的生成过程，但在每一步，对于每个concept，使用pseudo word和对应的LoRA进行生成，所有concept预测的噪声使用第一阶段的mask进行blending，同时也使用第一阶段的cross-attention map替换pseudo word对应的cross-attention map，以做到layout preservation。

AnyStory (no pseudo word, no test-time fine-tuning)

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

类似UniPortrait。
routing map作为attention mask。

FreeCustom (no pseudo word, no test-time fine-tuning)

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

training-free方法，需要提供不同concept的mask。
MRSA：inject KV of self-attention in reference path into composition path。

OrthoAdaptation (direct)

Orthogonal Adaptation for Modular Customization of Diffusion Models

LoRA fine-tune时，不同concept使用互相正交的B，固定B，只训练A，这样学到的多个concept可以同时生成，正交性使得不同concept的LoRA参数可以直接相加在一起使用。

MLoE (direct)

Mixture of LoRA Experts

解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题，如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA，会导致concept confusion和concept vanishing。
类似MoE，训练一个gating function，其根据LoRA的输出计算一个gating value，使用gating value线性组合不同LoRA的输出，使用训练LoRA时的数据和loss进行训练。

Break-for-Make (direct)

Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization

DreamBooth+LoRA（加在cross-attention上），需要同时学两个pseudo word（不同的reference image），一个是content，一个是style。有两个baseline：一个是公用同一个LoRA联合训练，另一个是分开学LoRA然后直接加在一起使用。
$W_{\text{up}} = \begin{bmatrix} A\\ B \end{bmatrix}$ $W_{\text{down}} = \begin{bmatrix} C & D \end{bmatrix}$ $A \in R^{d \times r}$ $B \in R^{（m-d） \times r}$ $C \in R^{r \times d}$ $D \in R^{r \times (n-d)}$ $\Delta W =\begin{bmatrix} AC & AD \\ BC & BD \end{bmatrix}$ $A$ $C$ $d$ $d < \text{min}(m,n)$ $B$ $D$ $AD$ $D$ $BC$ $B$ $BD$ 学习两者交互的特征。

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

UMM (transform, no test-time fine-tuning)

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

StableDiffusion
任务：给一个句子，和句子中某些单词对应的图像，生成句子对应的图像，其中给定图像的单词对应的object要和给定图像相似，相当于可以做composition。类似PbE的self-supervised learning：利用预训练目标检测模型，在LAION数据集上，标注出句子中具体单词对应的object在图像中的位置，构建新的数据集。
不fine-tune模型，只训练一个MLP，将给定图像的CLIP image embedding转换为token embedding，用TI方法训练这个MLP，类似FastComposer。

Subject-Diffusion (transform, no test-time fine-tuning)

Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

StableDiffusion
训练一个open domain并且不需要test-time fine-tuning的模型。
数据集：使用BLIP为图像生成caption，提取caption中的subject，使用DINO+SAM分割出每个subject对应的bounding box，在caption后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...，构成数据集。
训练时，使用CLIP image encoder编码每个subject对应的bounding box内的内容，使用编码结果直接替换上述的[placeholder_i]的embedding，并且重新训练text encoder，这样就在建模text前引入融合了图像信息（实验发现这样比建模句子后再融合要好）；同时训练cross-attention KV projection matrix（因为他们负责转换text feature）；类似GLIGEN在self-attention和cross-attention之间加一个adapter，引入bounding box信息（帮助识别区分多物体）。
推理时，给定一个caption，在caption之后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...，为每个[placeholder_i]提供一张reference image，还可以为每个[placeholder_i]指定一个bounding box。

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

SE-Guidance (no pseudo word, no test-time fine-tuning)

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

基于IP-Adapter的多subject组合生成。
为text prompt中某些subject token提供image prompt，阈值法使用subject token对应的text cross-attention map估计出一个mask，乘到对应的image prompt的image cross-attention map上，所有image prompt的image cross-attention的输出加权和。
A&E防止object missing。

Concept Discovery

侧重于发掘之前没有的concept。

Survey

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Conceptor

The Hidden Language of Diffusion Models

decomposing an input text prompt into a small set of interpretable elements.
对于某个concept，造句生成的100张图像，找一堆base word，学习一个MLP，为每个base word预测一个权重，所有base word的线性组合去重构这100张图像。目的是学习这个concept可以由哪些base word解释。

CusConcept

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

类似Conceptor。

CGCD

Exploiting Interpretable Capabilities with Concept-Enhanced Diffusion and Prototype Networks

ConceptLab (direct)

ConceptLab: Creative Generation using Diffusion Prior Constraints

$\cdot$ E-2生成一些没有过的concept，比如生成和所有已知pet都不同的pet。

PartCraft (direct)

PartCraft: Crafting Creative Objects by Parts

StableDiffusion
$k=2$ ，分离前背景，第二阶段对前景进行k-means，分出part，第三阶段对每个part进行k-means，得到label，相同part被聚为一类的使用相同的pseudo word。不同part使用TI一起进行学习。
可以实现不同part的任意组合，生成新物种。

Non-Subject Inversion

ReVersion (direct)

ReVersion: Diffusion-Based Relation Inversion from Images

TI训练优化一个relation token，提取reference images中共同存在的relation特征而不是object特征，比如握手，之后用relation token造句可以生成具有相同relation的图像。
Relation-Steering Contrastive Learning：relation token应该具有介词词性，使用一个contrastive loss，拉近relation token与已有的介词的距离，拉远relation token与其它词性的单词的距离。

ReInter (direct)

Customizing Text-to-Image Generation with Inverted Interaction

类似ReVersion，TI学习物体之间交互关系。

Lego (direct)

Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

invert any concepts in exemplar images, such as "frozen in ice", "burnt and melted", and "closed eyes"
使用contrastive learning，构造concept的同义词作为positive，反义词作为negtive，计算InfoNCE loss。

ADI (direct)

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

TI训练优化一个action token embedding，提取reference images中共同存在的action特征而不是object特征，比如倒立，之后用action token造句可以生成具有相同relation的图像。
对于某个要学习的action，在每一个cross-attention层都优化一个token，这样就不必局限于单个token，语义更丰富。
$(a,c)$ $a$ $c$ $(a,\bar c)$ $c$ $(\bar a,c)$ $(a,c)$ 计算出的梯度，更新token。

ImPoster (no pseudo word)

ImPoster: Text and Frequency Guidance for Personalization in Diffusion Models

左上角是source image，左下角是driving image，先用这两张image fine-tune diffusion model。
$G_{a}$ $L_{2}$ distance between the amplitude of the generated latents and the amplitude of the latents of the source image.
$G_{p}$ $L_{2}$ distance between the phase of the generated latents and the phase of the latents of the driving image.

ViewNeTI (direct)

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

FSViewFusion (direct)

FSViewFusion: Few-Shots View Generation of Novel Objects

CustomDiffusion360

Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Given multi-view images of a new object, we create a customized text-to-image diffusion model with camera pose control.

3D-words (direct)

Learning Continuous 3D Words for Text-to-Image Generatio

Learn a continuous function that maps a set of attributes from some continuous domain to the token embedding domain.

CustomNet (transform)

CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

构造同一object不同view的图像作为数据，编码object和view作为条件，使用IP-Adapter进行训练。

Face

FastComposer (transform, no test-time fine-tuning)

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

UniPortrait (transform, no test-time fine-tuning)

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

DreamIdentity (transform, no test-time fine-tuning)

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

使用预训练的ViT架构的人脸识别模型，提取3，6，9，12和最后一层的CLS token位置的feature，concat在一起，分别使用2个MLP将其转化成2个token embedding，使用diffusion loss和token embedding的L2正则进行训练。
使用不同层的feature的原因是最后一层的feature蕴含的都是比较高级的语义信息，缺少一些细节。

Face2Diffusion (transform, no test-time fine-tuning)

类似DreamIdentity利用预训练人脸模型的multi-scale feature，同时使用一个预训练expression encoder提取表情feature，以20%概率替换为一个可学习的代表无表情的向量，两个feature concat在一起，使用mapping network转化为token embedding，使用diffusion loss进行训练。

PhotoMaker (transform, no test-time fine-tuning)

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

StableDiffusion
$N$ $N$ $L$ $N$ $N$ $N$ $L-1+N$ 的text embedding，送入StableDiffusion的cross-attention，训练MLP进行重构。额外还可以LoRA fine-tune cross-attention layer。
生成时不再需要额外训练，任意给定某个人物的几张image，编写prompt进行生成。

PortraitBooth (transform, no test-time fine-tuning)

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

类似PhotoMaker。

FreeCue (transform)

Foundation Cures Personalization: Recovering Facial Personalized Models Prompt Consistency

MegaPortrait (no pseudo word, no test-time fine-tuning)

MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation

Arc2Face (transform, no test-time fine-tuning)

Arc2Face: A Foundation Model of Human Faces

IDAdapter (transform, no test-time fine-tuning)

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

图里没画出来，在prompt后加了"the woman is sks"，并且at the first embedding layer of the text encoder, we replace the text embedding of the identifier word “sks” with the identity text embedding，但没有优化sks的token embedding，而是用学到的embedding取代。

Dense-Face (direct)

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

We use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation.

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

使用多人脸图像自监督训练。
采样时只需要提供aligned faces。

DiffSFSR (no pseudo word, no test-time fine-tuning)

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

给定一张人脸图像，和对场景和表情的描述，先使用StableDiffusion根据场景描述生成一张图像作为训练数据，再根据表情描述从数据库中选择一张具有该表情的图像作为表情条件，人脸图像作为id条件。
$z_{t}$ 上，这样就可以让模型专注于人脸部分的建模。
使用diffusion loss + identity loss + expression loss 一起训练diffusion model，不需要自监督。

DemoCaricature (direct)

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

ROME

Face Aging (direct)

Identity-Preserving Aging of Face Images via Latent Diffusion Models

DreamBooth
计算Class-specific Prior Preservation Loss时，将人脸数据按age分组，每组一个组名，如child，old等，使用带有组名的prompt和图像作为数据集。
训练后，使用photo of a person as 进行生成，实现对某个给定人脸的aging与de-aging。

CelebBasis (direct)

Inserting Anybody in Diffusion Models via Celeb Basis

StableDiffusion的text embedding是可以插值进行生成的，基于这一发现，可以收集一些CLIP text encoder能够识别的名人的人名，使用PCA算法计算出它们token embedding的一组基，这组基可以看成人脸特征在token embeeding space的表示。
训练时，给定任意一张人脸的图片，训练一个MLP去modulate这组基，组成该人脸对应的pseudo word的embedding，插入"a photo of _"，使用Textual Inversion方法训练这个MLP。

CharacterFactory

CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

人物工厂，不是TI，不需要reference image，直接生成随机的可用的pseudo word embedding。
使用GAN生成fake embedding，采样名人的人名作为real embedding，对抗训练。
$\mathcal{L}_{\text{con}}$ 让生成的pseudo word embedding在不同template prompt的text embedding中表现一致。做法是最小化不同template prompt中这些word在text embedding中对应的embedding的pairwise distances。

StableIdentity (direct)

StableIdentity: Inserting Anybody into Anywhere at First Sight

受Celeb Basis启发，寻找一些名人的人名，得到他们的word embedding。通过一个MLP将输入人脸图像转化为两个word embedding，通过AdaIN转化到celeb word embedding空间（celeb word embedding的均值和方差分别充当shift和scale），TI训练这个MLP。学到的两个word embedding可以用于任何text-based generative model，比如ControlNet，text2video。

SeFi-IDE (direct)

SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation

LCM-Lookahead (no pseudo word, no test-time fine-tuning)

LCM-Lookahead for Encoder-based Text-to-Image Personalization

专注人脸的IP-Adapter。

Inpainting

RealFill

RealFill: Reference-Driven Generation for Authentic Image Completion

有reference images的inpainting任务，借助TI技术提取reference images的信息，辅助inpainting。

PVA

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

类似RealFill，有reference images的inpainting任务，借助TI技术提取reference images的信息，辅助inpainting。

Restoration

Personalized Restoration

Personalized Restoration via Dual-Pivot Tuning

有reference images的restoration任务。

Benchmark

DreamBench++

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive.
DreamBench++ is a human-aligned benchmark automated by advanced multimodal GPT models.

Efficiency

HollowedNet

Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models

Lifelong

LFS-Diffusion

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

X-to-Image (more fine-grained than text-to-image)

$x_{t}$ 与给定条件的loss，在采样过程中计算loss并使用其梯度指导采样，另一种就是直接操作attention map，让其符合给定条件的constraint。

Sketch

SKG (sketch + text)

Sketch-Guided Text-to-Image Diffusion Models

为预训练好的StableDiffusion引入sketch。
使用预训练好的edge提取器生成训练数据（自监督），训练一个可以根据UNet的各层feature maps预测edge的MLP。方法类似于Label-Efficient Semantic Segmentation With Diffusion Models。
采样时用MLP损失函数的梯度做classifier guidance，只在T到0.5T加guidance。
$\alpha = \frac{\|z_{t}-z_{t-1}\|_{2}}{\|\nabla_{z_{t}}\mathcal{L}\|_{2}}\cdot s$ $s$ $z_{t-1}$ 是原扩散模型采样的结果。其动机是，如果某一步前后变化较大，则表明这一步会生成了更多信息，所以要增大guidance；如果某一步的guidance本身变化较大，则减小scale，防止过度引导。

SKSG (sketch + text)

Sketch-Guided Scene Image Generation

先利用每个object的sketch和只含有object的prompt单独生成该object，之后对该object进行TI学习。
将所有object按sketch的位置拼在一起进行blended生成。

SketchAdapter (sketch + text)

It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models

只在很少的skecth-image pair数据集上训练，没有text。
$\hat{z}_{0}$ 过VAE decoder后经过一个sketch提取器，得到的结果与输入sketch计算距离；用image caption模型生成图像caption，送入StableDiffusion，让两个StableDiffusion的预测尽量靠近。

ToddlerDiffusion

ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

模拟人类画图的思路，先生成sketch，再生成palette，最后生成图像。使用ShiftDDPMs的公式，以sketch或palette而不是pure noise为起点进行训练。

SKGLO (sketch + text)

Training-Free Sketch-Guided Diffusion with Latent Optimization

training-free

KnobGen (sketch + text)

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

CGC将sketch和text feature进行融合，融合后当做text输入diffusion model。
FGC是ControlNet或者T2I-Adapter，乘一个系数进行knob。

Layout/Segmentation

Locally-Conditioned-Diffusion

Compositional 3D Scene Generation using Locally Conditioned Diffusion

$K$ $\{m_{k}\}_{k=1}^{K}$ $\{y_{k}\}_{k=1}^{K}$ $\sum_{k=1}^{K} m_{k} \odot \epsilon_{\theta}(x_{t}, t, y_{k})$ 进行采样。

IIG (bounding box + text)

Semantic-Driven Initial Image Construction for Guided Image Synthesis in Diffusion Model

$x_{T}$ 即可实现layout-to-image。
$z_{T}$ $4\times 4$ 的noise block，构造prompt，使用denoising第一步得到的cross-attention map的值对noise block进行标注，构建一个趋于生成某一类物体的noise block的数据库。
生成时，从物体对应的noise block数据库中采样，填在指定的bounding box内进行生成。

NoiseCollage (bounding box + text)

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

masked cross-attention：layout之内的image feature与object prompt进行cross-attention，layout之外的image feature与global prompt进行cross-attention，两者结果相加。

TriggerPatch (bounding box + text)

The Crystal Ball Hypothesis in Diffusion Models: Anticipating Object Positions from Initial Noise

A trigger patch is a patch in the noise space with the following properties: (1) Triggering Effect: When it presents in the initial noise, the trigger patch consistently induces object generation at its corresponding location; (2) Universality Across Prompts: The same trigger patch can trigger the generation of various objects, depending on the given prompt.
We try to train a trigger patch detector, which functions similarly to an object detector but operates in the noise space. 随机噪声，生成图像，使用预训练好的object detector检测物体，检测得到的结果作为该噪声的ground truth，训练trigger patch detector。
生成时，随机噪声，检测trigger patch，移动trigger patch到目标位置。

LayoutDiffuse (bounding box + text)

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

adapt pre-trained unconditional or conditional diffusion models，在每个attention layer后加一个带residual的layout attention layer，即h=LayoutAttn(h)+h。
LayoutAttn(h)将layout分成每个instance单独的layout（即只标识了一个object），每个layout当成mask，提取h中该object的region feature map，然后为每个feature加上该object对应的class label或者caption的learnable embedding，然后做self-attention；对于h，使用空标签或者空字符串的learnable embedding加到每个feature上，做self-attention，作为背景；然后乘上mask加在一起，重叠部分取平均。类似ControlNet，参数初始化为0，LayoutAttn(h)一开始输出为0，训练开始前不影响原网络。

LayoutDiffusion (bounding box + text)

LayoutDiffusion Controllable Diffusion Model for Layout-to-image Generation

重新设计UNet，全部重新训练。

CreatiLayout (bounding box + text)

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

MMDiT

RCL2I (bounding box + text)

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

The proposed regional cross-attention layer is inserted into the original diffusion model right after each self-attention layer. The weights of the output linear layer are initialized to zero, ensuring that the model equals to the foundational model at the very beginning.

EliGen (bounding box + text)

EliGen: Entity-Level Controlled Image Generation with Regional Attention

MMDiT + regional attention

IFAdapter (bounding box + text)

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

PLACE (bounding box + text)

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

$N$ 个word-layout pair组成一条数据。
layout control map：将layout转换为semantic mask，让对应的word的cross-attention map只有semantic mask内的响应值，但由于StableDiffusion是在8倍下采样的latent上运行的（深层的feature map更小），对mask采取同样的下采样可能会导致一些小物体被忽略，所以这里通过感受野计算mask，对于feature map上每个image token，如果其在原图尺寸上的感受野与当前物体的semantic mask有交集，则设为1，否则设为0。使用原cross-attention map与乘上mask后的cross-attention map的插值。
$HW$ $HW \times HW$ $HW$ ），优化这个加权和与cross-attention map靠近。
Layout-Free Prior Preservation Loss：由于数据集较小，为了防止过拟合，使用一些文生图数据计算diffusion loss，此时把layout control map中的semantic mask cross-attention map的插值系数设为0即可。

MIGC (bounding box + text)

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

$n+2$ 个shading result按mask求和。
只在mid-layers (i.e., 8 × 8)和the lowest-resolution decoder layers (i.e., 16 × 16)上应用MIGC。
在COCO上使用diffusion loss训练，同时还优化cross-attention map上背景区域的响应值之和（类似TokenCompose）。

3DIS (bounding box + text)

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

3DIS-FLUX: Simple and Efficient Multi-Instance Generation with DiT Rendering

Layout Adapter + Fine-tuned Text-to-Depth Model = Layout-to-Depth Model
The injected depth maps are designed to manage the low-frequency components of the constructed scene, while the generation of high-frequency details is handled by advanced grounded text-to-image models. To enhance the integration of these components, we implement a filtering process to remove high-frequency noise from the feature maps generated by ControlNet before injecting them into the image generation network.
We propose a training-free detail renderer to replace the original Cross-Attention Layers.

HiCo (bounding box + text)

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

B2B (bounding box + text)

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

StableDiffusion
training-free
box: 对text中有bounding box的object对应的cross-attention map，定义一些bounding box附近的sliding box，bounding box内的响应值减去bounding box外的响应值再加上这些sliding box内的响应值与bounding box内的响应值的IoU（保证均匀），作为object reward。
bind: attribute的cross-attention map与对应的object的cross-attention map在bounding box内的响应值的KL散度的相反数，作为attribute reward。
两个reward加在一起求梯度作为guidance。

R&B (bounding box + text)

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

StableDiffusion
training-free

LAW-Diffusion (bounding box + text)

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

$x_{t}$ concat在一起输入网络。

SALT (bounding box + text)

Spatial-Aware Latent Initialization for Controllable Image Generation

ESCE (bounding box + text)

Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

StableDiffusion
training-free

Directed Diffusion (bounding box + text)

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

StableDiffusion
training-free
在生成时，提高text token对应的cross-attention map的bounding box区域的权重。

Attention-Refocusing (bounding box + text)

Grounded Text-to-Image Synthesis with Attention Refocusing

StableDiffusion
training-free
attention refocusing
$L_{fg}$ $L_{bg}$ $L_{CAR}=-L_{fg}+L_{bg}$ 。
$L_{SAR}$ ：对于每个bounding box，对于当前bounding box之内的所有image token，求它们的self-attention map中，所有包含该image token的bounding box所覆盖的地方之外的response的最大值，求和。
$L_{CAR}+L_{SAR}$ $x_{t}$ 的梯度作为guidance。

BACON (bounding box/segmentation + text)

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

StableDiffusion
training-free

BoxDiff (bounding box/segmentation + text)

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

StableDiffusion
training-free
只在16x16的分辨率上进行操作
$x_{t}$ 的梯度作为guidance。

CSG

Training-free Composite Scene Generation for Layout-to-Image Synthesis

StableDiffusion
training-free
只在16x16的分辨率上进行操作
类似BoxDiff，设计多个constraint loss的和对𝑥𝑡的梯度作为guidance。

Zero-Painter (segmentation + text)

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

StableDiffusion + StableInpainting
training-free
PACA：增大除了SOT之外所有token的cross-attention map中的mask区域内的响应值。对于SOT有一个很有意思的特点，其cross-attention map中的值哪里被增大了，最终输出的图像哪里就会变成背景，所以可以利用这一特点，对SOT的cross-attention map进行反向操作，增大mask区域外的响应值。
ReGCA：inpainting的cross-attention，背景和前景使用不同的KV，只对背景使用global prompt。

CAC (bounding box/segmentation + text)

Localized Text-to-Image Generation for Free via Cross Attention Control

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Enhancing Image Layout Control with Loss-Guided Diffusion Models

StableDiffusion
training-free
Cross Attention Control
$m$ $m+1$ 个cross-attention map，乘上bounding box/segmentation的mask，相加得到最后的cross-attention map。相比于Attention Refocusing，不需要计算loss和梯度。

SpaText (segmentation + text)

SpaText: Spatio-Textual Representation for Controllable Image Generation

每个segment对应一个text，可以分区域生成，指定物体之间的空间关系。
自监督训练，使用预训练分割模型提取图像segments，用CLIP提取每个segment的CLIP image embedding，初始化一个全为0的segmentation map，大小和图像一样，通道数和CLIP image embedding维数一样，将每个segment的CLIP image embedding放到segmentation map中对应位置。
$\cdot$ $x_{t}$ 上作为条件输入，fine-tune decoder，训练时不需要文本。
$\cdot$ E-2的Prior模型将每个segment对应的text的CLIP text embedding转换成CLIP image embedding，再组装成segmentation map，使用Decoder进行生成。

EOCNet (segmentation + text)

Enhancing Object Coherence in Layout-to-Image Synthesis

修改StableDiffusion网络结构，fine-tune。

FreestyleNet (segmentation + text)

Freestyle Layout-to-Image Synthesis

将StableDiffusion的cross-attention改为rectified cross-attention：将text token对应的cross-attention map中，在bounding box之内的保留原值，在bounding box之外的设为负无穷。By forcing each text token to affect only pixels in the region specified by the layout, the spatial alignment between the generated image and the given layout is guaranteed。再使用任何layout-based数据fine-tune StableDiffusion。

ALDM (segmentation + text)

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

$\hat{x}_{0}$ 进行分割与给定layout对比计算loss进行优化，we observe that the diffusion model tends to learn a mean mode to meet the requirement of the segmenter, exhibiting little variation.
$\hat{x}_{0}$ 所有pixel分类到fake class；diffusion model作为生成器：除了diffusion loss，加入adversarial loss，让判别器指导训练。
$\hat{x}_{0}$ $\hat{x}_{0}$ ，计算K个adversarial loss求平均进行训练。

DenseDiffusion (segmentation + text)

Dense Text-to-Image Generation with Attention Modulation

StableDiffusion
training-free
和rectified cross-attention一样的思路，只不过是training-free的，可以直接采样：At cross-attention layers, we modulate the attention scores between paired image and text tokens to have higher values. At self-attention layers, the modulation is applied so that pairs of image tokens belonging to the same object exhibit higher values。这里的paired image and text tokens意思是当前image token的位置在text token所描述的object的bounding box内。
$softmax(\frac{QK^{T}+M}{\sqrt{d}})$ $M=\lambda_{t} \cdot R \odot M_{pos} \odot (1-S) - \lambda_{t} \cdot (1 - R) \odot M_{neg} \odot (1-S)$ $\lambda_{t} = w \cdot \frac{t}{T}$ $R$ $M_{pos}=max(QK^{T})-QK^{T}$ $M_{neg}=QK^{T} - min(QK^{T})$ $QK^{T}+M$ $QK^{T}$ $S$ 为比例矩阵，如果segments之间面积差别较大，生成质量会受影响，所以对每个image token，计算出其所属的segment的面积占全图的比例，用于正则。

SCDM (segmentation)

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis

In real-world applications, semantic image synthesis often encounters noisy user inputs. SCDM enhances robustness by stochastically

perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion.

MagicMix (layout/style from image/text + text)

MagicMix: Semantic Mixing with Diffusion Models

noisy latents linear combination版本的SDEdit，削弱原图的细节，只保留基本的结构和外观信息。

DiffFashion

DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models

DiffEdit+MagicMix

GeoDiffusion (bounding box -> text -> image)

Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt

translate geometric conditions to text（包括object坐标等），fine-tune StableDiffusion。

GLoD (layout + text)

GLoD: Composing Global Contexts and Local Details in Image Generation

Masked SEGA.

Pose

StablePose

StablePose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

TruePose

TruePose: Human-Parsing-guided Attention Diffusion for Full-ID Preserving Pose Transfer

Camera Parameter

GenPhotography

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

相机参数：bokeh blur, lens, shutter speed, temperature等。
$\alpha$ $\{\alpha_{i}\}_{i=1}^{N}$ $N$ $\alpha$ 变换的反映。
用这些视频LoRA fine-tune T2V模型，同时训练一个contrastive camera encoder编码相机参数，编码结果拼在invariant scene description的编码结果之后。
contrastive camera encoder: 因为前后帧之间只有某个相机参数不同，所以做差取feature。
推理时，既可以根据给定的相机参数和prompt生成图像（所有帧使用相同相机参数），也可以对已有图像进行相机参数的编辑（从原图相机参数平滑过渡到目标相机参数）。
用T2V的原因：即使固定随机种子，只要prompt稍有差异，T2I生成图像也会有很大差异，但T2V可以保持前后帧scene的一致性。

Scene Graph

DiffuseSG (scene graph)

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Train a DiffuseSG model (Graph Transformer) to produce layout and then utilize a pretrained layout-to-image model to generate images.

DisCo (scene graph)

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

R3CD (scene graph)

R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion

将scene graph分为多个三元组（object1-relation-object2），所有三元组拼在一起作为条件输入denoising model进行训练。
除了diffusion loss，还加了两个contrastive loss，从同一个batch中采样具有相同relation的三元组作为positive，batch内其余三元组作为negtive，利用relation的cross-attention map之间的cosine similarity计算一个contrastive loss，再利用三元组的diffusion loss之间的MSE计算一个contrastive loss。

SGDiff (scene graph)

Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

LAION-SG

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

A large-scale dataset with high-quality structural annotations of scene graphs (SG).

Trajectory

TraDiffusion (trajectory + text)

TraDiffusion: Trajectory-Based Training-Free Image Generation

定义cross-attention map和trajectory之间的energy function，求梯度作为guidance进行采样。

Slot

LSD

Object-Centric Slot Diffusion

SlotDiffusion

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

L2C

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

GLASS

Guided Latent Slot Diffusion for Object-Centric Learning

SlotAdapt

Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation

Blob

BlobGEN (blob + text)

Compositional Text-to-Image Generation with Dense Blob Representations

GLIGEN with blob tokens

DiffUHaul (blob + layout)

DiffUHaul: A Training-Free Method for Object Dragging in Images

Image

IP-Adapter (image + text)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

StableDiffusion
CLIP image encoder提取image embedding，训练一个线性层将其映射到长为4的sequence，类似StyleAdapter，加一个和text cross-attention layer并行的可训练的image cross-attention layer，使用原来的数据集，训练线性层和image cross-attention layer。
训练好的模型可以与ControlNet和T2IAdapter一起使用，无需额外训练。
IP-Adapter+：在text cross-attention layer之后加可训练的image cross-attention layer。

IPAdapter-Instruct (image + text)

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

基于IP-Adapter+，在image cross-attention layer再加一个text cross-attention layer，与instruction进行交互，使用instruction editing数据进行训练。
使用prompt，ip image，instruction一起生成。

Semantica (image)

使用成对的图像数据集，其中一张作为condition，另一张作为target，重新训练一个U-ViT的diffusion model，we do not use any text inputs and only rely on image conditioning.
使用预训练的CLIP或者DINO编码图像得到的token sequence或者CLS token作为condition，当使用token sequence时使用cross-attention，当使用CLS token时使用FiLM。

FiVA (image)

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

We constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes 1 M high quality generated images with visual attribute annotations.

PuLID (image + text)

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

IP-Adapter在训练时使用从原图中提取的feature，这一定程度上会导致模型过拟合，除了diffusion loss，还引入了两个alignment loss和一个ID loss。
训练时构造两条contrastive paths，one path with ID：两个cross-attention都用；the other path without ID：只用text cross-attention。为了确保sementic alignment使用text作为Q，image feature作为KV，计算cross-attention map，优化两条paths的cross-attention map之间的MSE loss。The insight behind our semantic alignment loss is simple: if the embedding of ID does not affect the original model’s behavior, then the response of the UNet features to the prompt should be similar in both paths.
为了确保layout alignment，同时优化两条paths的image feature的MSE loss。
$4$ 步生成，使用生成的图像计算ID loss。

InstantID (image + text)

InstantID: Zero-shot Identity-Preserving Generation in Seconds

上半部分类似IP-Adapter，只是将CLIP image embedding换成了face id embedding。但是作者认为这种方法不够好，因为image token和text token本身提供的信息就不同，控制的方式和力度也不同，但是IP-Adapter却把他们concat在一起，有互相dominate和impair的可能。
提出使用另一个IdentityNet（ControlNet架构）提供额外的空间信息，根据上述原因，这里的ControlNet去掉了text的cross-attention，只保留face id embedding的cross-attention。这里只提供双眼、鼻子、嘴巴的key points作为输入，一方面是因为数据集比较多样，更多的key points会导致检测困难，让数据变脏；另一方面是为了方便生成，也可以增加使用文本或者其他ControlNet的可编辑性。
在人脸数据集上自监督训练。

ID-Aligner (image + text)

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

A general framework to achieve identity preservation via feedback learning.

PF-Diffusion (image)

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

类似ObjectStitch，训练一个SeeCoder将reference image转换为CLIP text embedding，然后使用其替换StableDiffusion的CLIP text encoder，实现只使用reference image生成图像。还可以使用ControlNet引入其它条件。

M2M (image sequence)

Many-to-many Image Generation with Auto-regressive Diffusion Models

构造一个image sequence数据集。
$\{z_{0}^{i}\}_{i=1}^{N}$ $\{z_{t}^{i}\}_{i=1}^{N}$ $\{z_{t}^{i}\}_{i=1}^{N}$ $\{z_{0}^{i}\}_{i=1}^{N}$ $z_{t}^{i}$ $z_{0}^{<i}$ 的图像的pixel进行attention。

Manga

DiffSensei

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

MangaDiffusion

Manga Generation via Layout-controllable Diffusion

3D

ShapeWords (3D + text)

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Emotion

EmotiCrafter

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

General

Late-Constraint (sketch/edge/segmentation + text)

Late-Constraint Diffusion Guidance for Controllable Image Synthesis

为预训练好的StableDiffusion引入各种条件，算是SKG的升级版。
使用预训练好的模型抽取image的各种conditions（如mask、edge等），训练一个可以根据UNet的各层feature maps预测conditions的condition adapter。
采样时，用当前的feature maps输入到condition adapter得到预测的conditions，与给定的conditions计算距离，求梯度作为guidance。
这类方法本质上还是训练一个noisy classifier，但使用的是diffusion model的feature。

Readout-Guidance (sketch/edge/pose/depth/drag + text)

Readout Guidance: Learning Control from Diffusion Features

和Late-Constraint类似，分为spatial和relative两种head。
spatial包含pose，edge，depth等，训练模型根据diffusion feature预测ground truth，采样时根据预测和给定的label计算MSE loss，求梯度作为guidance。
relative包含corresponce feature和appearance similarity，训练模型根据两个不同图像的diffusion feature进行预测。
drag：corresponce feature head uses image pairs with labeled point correspondences and trains a network such that the feature distance between corresponding points is minimized, i.e., the target point feature is the nearest neighbor for a given source point feature. We compute pseudo-labels using a point tracking algorithm to track a grid of query points across the entire video. We randomly select two frames from the same video and a subset of the tracked points that are visible in both frames. 训练时，将输入的diffusion feature转化为一个feature map，image pairs的feature map之间的corresponding point feature之间计算loss；编辑时，先将原图输入UNet得到diffusion feature，再送入网络提取feature map，计算其staring point处的feature与生成图像的feature map的target point处的feature的距离，求梯度作为guidance。

MCM (segmentation/sketch + text)

Modulating Pretrained Diffusion Models for Multimodal Image

$x_{t}, \epsilon_{\theta}(x_{t}), y_{1}, \cdots, y_{n}$ $\gamma_{t}, v_{t}$ $\epsilon^{\prime}_{t} = \epsilon_{\theta}(x_{t}) \cdot (1 + \gamma_{t}) + v_{t}$ $\hat{x}_{0}$ ，MSE进行训练。

Acceptable Swap-Sampling (concept from text)

Amazing Combinatorial Creation Acceptable Swap-Sampling for Text-to-Image Generation

给定两个object text，生成两个concept融合在一起的图像，类似MagicMix。

对于一个0-1的列交换向量，其长度和CLIP编码结果的维度相同，若向量某位置为0，则选取第二个object text的CLIP编码结果的该位置的列向量，若向量某位置为1，则选取第一个object text的CLIP编码结果的该位置的列向量，组合成一个新的CLIP编码结果，将其输入到StableDiffusion是可以生成两个concept融合在一起的图像的。

实践中，随机采样一堆列交换向量，每个列交换向量按上述流程生成图像，再使用一些选取策略从所有图像中选出最符合标准的图像。

SCEdit (keypoints/depth/edge/segmentation + text)

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

通过对skip conncetion的feature做editing实现fine-tune或者可控生成。
$O_{j}^{SC}(x_{N-j}) = T_{j}(x_{N-j}) + x_{N-j}$ $j$ $N$ $x_{N-j}$ $j$ $T_{j}$ 是可训练的Tuner，先过一个矩阵降维，再过GELU，最后过一个矩阵升维，这里只针对channal维进行操作。该方法可以视为LoRA的counterpart，是一种通用的fine-tune方法，比如可以用于将模型adapt到某个style domain。
$O_{j}^{CSC}(x_{N-j}) = \sum_{m=1}^{M} \alpha_{m} (T_{j}(x_{N-j} + c_{j}^{m}) + c_{j}^{m}) + x_{N-j}$ $\{c^{m}\}_{m=1}^{M}$ $M$ 个条件，如depth等，这些条件也送入一个可训练的hint block产生multi-scale feature。该方法可以视为ControlNet的counterpart。

GLIGEN (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

GLIGEN: Open-Set Grounded Text-to-Image Generation

Stableiffusion
除了caption，额外给定一组entity和对应的grounding信息（比如layout），进行spatial control。
在self-attention和cross-attention之间加一个可训练的gated self-attention层，把grounding token和visual token接在一起做self-attention，输出只保留visual token所在位置的部分，乘上一个可训练的gate标量，residual连接。gate标量初始化为0，类似ControlNet的zero-conv，保证一开始的网络和Stableiffusion有一样的效果。
$h \times w$ 个token，同时将depth map降采样后concat到输入上，训练StableDiffusion的第一个卷积层。

ReGround (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

ReGround: Improving Textual and Spatial Grounding at No Cost

把GLIGEN改成类似IP-Adapter的并行attention形式，不用重新训练，直接把训练好的GLIGEN改成ReGround的形式，效果也能变好。

InteractDiffusion (interaction + text)

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

定义interaction是一个三元组，分别是主体（subject）、动作（action）和客体（object），三者分别对应一个文本描述和一个bounding box，主体和客体使用同一个MLP，将文本（预训练文本编码）和bounding box（Fourier embedding）转化一个token，动作用另一个MLP也转化为一个token。

如果一张图中有多个interaction，那么不同interaction之间无法区分，所以为每个interaction加一个可训练的embedding，类似positional embedding。同样，一个interaction中三元组之间也无法区分，所以为三者各加一个可训练的embedding，所有interaction公用该embedding。

得到最终的embedding后，类似GLIGEN进行训练。

InstDiff (box/mask/scribble/point + text)

InstanceDiffusion: Instance-level Control for Image Generation

ControlNet (edge/segmentation/keypoints + text)

Adding Conditional Control to Text-to-Image Diffusion Models

为预训练好的StableDiffusion引入类似PDAE的条件模块ControlNet。
ControlNet：固定StableDiffusion，复制StableDiffusion的UNet的encoder和middle block的每个block进行训练，输出与UNet对应的decoder的输出进行加和。zero convolution是所有参数都初始化为0的1x1卷积层，这样在训练前整个trainable copy的输出为0，不影响原网络。
condition一般和原图尺寸一样。由于要和原网络的input相加，所以尺寸必须和原网络的input相同。StableDiffusion的input是降维后latent，所以condition也需要降维，所以就需要额外训练一个encoder对condition进行编码降维。
多个ControlNet可以组合使用。
StableDiffusion一般必须用classifier-free guidance才能生成较好的图像，此时ControlNet可用于both unconditional and conditional prediction，也可只用于conditional prediction。但是如果想不使用prompt进行生成，此时如果将ControlNet用于both，cfg退化，效果不好；如果将ControlNet只用于conditional prediction，会导致guidance太强，解决方案为resolution weighting。

ControlNet-XS

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

ControlNet存在information delay的问题，即某个时间步的去噪时，SD encoder不知道control信息，ControlNet encoder不知道generative的信息。
ControlNet-XS让两个encoder之间同步information，一个的feature map过一个可训练的convolution后加在另一个上，反之亦然，这样ControlNet encoder就不需要复制SD encoder了，而是可以使用参数量更少的处理同维度feature map的网络，随机初始化进行训练即可，效果还比ControlNet要好。

CtrLoRA

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

FineControlNet

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

StableDiffusion + ControlNet
training-free
将多实例输入进行分离，修改cross-attention，每个实例过一次cross-attention，所有实例的输出相加得到最后输出。在UNet feature上进行操作，所以在UNet encoder部分，只融合text信息，在UNet decoder部分，同时融合control信息和text信息。

SmartControl

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

Relax the visual condition on the areas that are conflicted with text prompts. 如使用deer的depth map生成tiger时，鹿角部分需要舍去。
$\alpha$ $h^{i+1} = D^{i}(h^{i} + \alpha \cdot h_{\text{cond}^{i}})$ $\alpha$ $\alpha$ ，因此根据这一点构造一个relax alignment的数据集，之后训练一个SmartControl。

ControlNet++

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

$\hat{x}_{0}$ 进行fine-tune。

X-Adapter

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

train a universal compatible adapter so that plugins of the base stable diffusion model (such as ControlNet on SD) can be directly utilized in the upgraded diffusion model (such as SDXL).
训练一个mapper，将base model的decoder的feature映射到upgraded model的decoder的feature维度并加上去，使用upgraded model的diffusion loss训练mapper。注意训练时，upgraded model输入的是empty prompt。

CCM

CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

The way to add new conditional controls to the pre-trained CMs.
ControlNet can be successfully established through the consistency training technique.

CoDi

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

和CCM的目标一样，使用consistency distillation在预训练diffusion model的基础上训练一个类似ControlNet的网络进行快速的条件生成。
$0$ ，使用consistency training technique进行训练。

Ctrl-Adapter

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

类似X-Adapter。Pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden for many users.

ControlNeXt

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

We remove the control branch and replace it with a lightweight convolution module composed solely of multiple ResNet blocks. We integrate the controls into the denoising branch at a single selected middle block by directly adding them to the denoising features after normalization through Cross Normalization.
不再是复制原模型，而是使用一个轻量级的模块处理条件，并且只将结果在原模型的某个中间的block引入。极大的降低了参数量和计算量。

MGPF

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

针对visual controls are misaligned with text prompts的问题，比如prompt中提到了某个object，但visual control中没有对应的edge，这样使用ControlNet生成出的图像会丢失这个object。
这本质上是ControlNet主导了生成的结果，所以提出了一种training-free的方法，根据每个object的edge提取mask，所有mask组合在一起，将ControlNet的feature乘上该mask再加到UNet decoder的feature上，目的是让ControlNet只负责生成有visual controls的objects，our experimental results show that the application of masks to ControlNet features substantially mitigates conflicts between mismatched textual and visual controls, effectively addressing the problem of object missing in generated images.
$z_{t}$ 。

CNC (depth/image/depth and image + text)

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

StableDiffusion + ControlNet
自监督训练，对于某张图像，提取salient object的mask，图像乘上mask即为foreground图像，图像乘上mask的补码再对salient object部分进行inapinting得到background图像。分别对foreground和background图像提取depth。
提取foreground和background图像的CLIP image embedding，经过一个网络后concat在text embedding后，在ControlNet的cross-attention层用上mask，让Q和foreground K只在mask区域有值，让Q和background K只在mask区域之外有值。
foreground和background是不对等的，对调它们的输入会生成不同位置关系的图像，所以叫3D depth aware。

FreeControl (keypoints/depth/edge/segmentation/mesh + text)

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

StableDiffusion
training-free
$C \times H \times W$ $H \times W$ $C$ $3 \times H \times W$ ，画成图后具有分割属性。同一concept不同模态的图片进行DDIM Inversion有同样的效果。同一concept不同模态求得的基也是通用的。

$N \times C \times H \times W$ $N \times H \times W$ $C$ 的向量，求PCA后取基。生成过程中，让生成图像的feature在这组基上的坐标与condition的feature在这组基上的坐标靠近，计算loss求梯度作为guidance。思想和Late-Constraint类似，只不过是training-free的。

T2I-Adapter (edge/segmentation/keypoints + text)

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

为预训练好的StableDiffusion的encoder输出的各分辨率的feature map加上由condition计算出的同尺寸的feature map，只优化T2I-Adapter。

BTC (sketch/depth/pose + text)

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

训练时不需要text，且只需要几十到几百个样本。
类似T2I-Adapter，训练一个prompt-free condition encoder，其输出的feature map加在StableDiffusion的encoder输出的各分辨率的feature map上。prompt-free condition encoder从StableDiffusion的encoder复制而来，去掉了cross-attention层，每个尺寸的feature map输入一个额外的zero convolution层。

DiffBlender (sketch/depth/edge/box/keypoints/color + text)

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

StableDiffusion
self-attention和cross-attention之间插入可训练的local self-attention和global self-attention进行多模态训练。

Universal Guidance (segmentation/detection/face recognition/style + text)

Universal Guidance for Diffusion Models

StableDiffusion
$x_{t}$ $\epsilon_{\theta}(x_{t},t)$ $\hat{x}_{0}$ ，输入off-the-shelf segmentation/detection/face recognition/style模型计算loss，求梯度作为guidance。
$\Delta x_{0}$ 进一步guidance。
采样的每一步都使用resample technique重复多次forward guidance + backward guidance。

Multi-Condition

baseline：ControlNet、T2I-Adapter等模型，不同condition单独训练好后，可以通过feanture插值的方式进行组合使用，实现multi-condition控制。

Composer (shape/semantics/sketch/masking/style/content/intensity/pallete/text)

Composer: Creative and Controllable Image Synthesis with Composable Conditions

用各种预训练网络提取图像的各种结构、语义、特征信息，然后作为条件训练GLIDE。
训练技巧：以0.1的概率丢弃全部conditions，以0.7的概率包含全部conditions，每个condition独立以0.5概率丢弃。

MaxFusion

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

一个ControlNet接收不同模态输入进行训练，图中的不同task使用的是相同的网络。
将不同模态在每层计算完成后得到的feature进行merge然后skip-connect到UNet decoder，merge后的feature再unmerge为原来的数量输入到下一层。
merge策略：对于每个spatial位置，计算两个feature之间的相关性，如果大于某个预设的阈值，就取两个feature的平均；如果小于阈值，就分别计算它们相对于各自整个feature的标准差，选择标准差较大的那个feature。
baseline是Multi-T2I Adapter和Multi-ControlNet，即每个task单独训练一个T2I Adapter或ControlNet，然后一起使用。

OmniControlNet

OmniControlNet: Dual-stage Integration for Conditional Image Generation

先为不同模态分别学习一个pseudo word，例如使用几张depth map images和"use <depth> as feature"利用TI学习"<depth>"的word embedding。
之后使用不同模态训练ControlNet，其中trainable copy的prompt之前加上对应条件的模态的"use <depth> as feature"，这样一个ControlNet就可以处理不同模态的条件。

gControlNet

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

多个模态的condition融合，输入到一个ControlNet进行训练，实现任意种模态的condition组合生成。

Uni-ControlNet

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

AnyControl

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

ControlNet
multi-control fusion block是cross-attention，让query token与visual token进行交互（text token不参与，直接输入下一层），visual token要加positional embedding以区分不同spatial control，multi-control alignment block就是self-attention，让query token获取信息。
query token最终的输出送入ControlNet的cross-attention。
训练时随机drop不同spatial control，以让模型适用于不同数量的spatial control。

DynamicControl

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

FaceComposer

FaceComposer: A Unified Model for Versatile Facial Content Creation

类似Composer，专做人脸，还支持talking face生成。

TASC

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

解决prompt中有但是depth map中没有的物体在生成时丢失的问题。

Any-to-Any/In-Context/Prompt/Instruction

Versatile Diffusion

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Multi-Source

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

多个音乐源拼接在一起进行训练，训练时所有音乐源都使用相同的时间步，噪声不一样。
total generation
partial generation：blended inpainting，配乐。
source separation：将某个要分离出来的音乐源视为所有音乐源的和减去其它音乐源的和。

UniDiffuser

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

使用预训练编码器将image和text都转换为token，额外训练两个decoder，可以根据token重构image和text。
text-image联合训练，使用U-ViT架构，训练时两者采样不同的时间步和噪声，这样可以做到unconditional（另一个模态一直输入噪声），conditional（另一个模态一直输入条件），joint（同步生成） sampling。

D-DiT

Dual Diffusion for Unified Image Generation and Understanding

OneDiffusion

One Diffusion to Generate Them All

类似UniDiffuser。

ONE-PIC

Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

EasyGen

Making Multimodal Generation Easier When Diffusion Models Meet LLMS

$\| \epsilon^{x} - \epsilon_{\theta}^{x}(x_{t^{x}}, y_{0}, t^{x}, 0) \|_{2}^{2} + \| \epsilon^{y} - \epsilon_{\theta}^{y}(x_{0}, y_{t^{y}}, 0, t^{y}) \|_{2}^{2}$ ，即不再训练他们的联合分布。
将BiDiffuser和LLM联合。

CoDi

Any-to-Any Generation via Composable Diffusion

目标：generate any combination of output modalities from any combination of input modalities.
We begin with a pretrained text-image paired encoder, i.e., CLIP. We then train audio and video prompt encoders on audio-text and video-text paired datasets using contrastive learning, with text and image encoder weights frozen。这样每个模态就能得到一个encoder，且编码结果共享一个common embedding space。每个模态以编码结果为条件训练一个diffusion model。
上面训练得到的是单模态的diffusion model，只能单对单自生成，还不能多对多生成。使用text-image数据，为text diffusion model和image diffusion model的UNet各自加入新的cross-attention层，训练时只训练这个cross-attention层，cross-attention的方式是为每个模态的noisy latent设计一个independent encoder，将不同模态的noisy latent嵌入到一个common embedding space，attend这个embedding token，除了diffusion loss同时也利用contrastive learning进行训练，这样text和image的noisy latent就可以通过它们的encoder对齐。之后固定住text的encoder和cross-attention weights，用text-audio数据，重复该方法，训练得到audio的encoder和cross-attention weights。之后固定audio的encoder和cross-attention weights，用audio-video数据，重复该方法，训练得到video的encoder和cross-attention weights。这样在cross-attention中，四种模态的noisy latent都被对齐了，之后可以interpolation不同noisy latent的encoder embedding进行joint sampling，即使这种combination可能没训练过。

CoDi-2

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

对于多模态数据，利用Codi的multimodal encoder，将其它模态的编码结果（feature sequence）送入LLM进行训练，对输出（feature sequence）进行回归，同时将其输入对应模态的diffusion model计算diffusion loss，两个loss一起训练。

text还是token prediction loss进行训练。

本质还是feature-based而非token-based。

GlueGen

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

为不同模态语料（如语音、外文等）学习一个编码网络，使编码结果（分布）与现有的StableDiffusion的text encoder的编码结果（分布）对齐。

这样就可以无缝切换，使用训练好的编码网络为StableDiffusion提供cross-attention的kv，做不同模态的生成。

不用fine-tune StableDiffusion，而且fine-tune会导致对之前模态的遗忘。

UniControl

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

模仿InstructGPT训练可根据instruction进行生成的StableDiffusion。
将不同任务整理成统一形式的task，每个task包含task instruction（如segmentation to image），prompt，visual conditon（segmentation）和target image，训练时使用ControlNet架构，prompt输入StableDiffusion，task instruction和visual condition输入ControlNet，多个task一起训练。可以泛化到zero-shot task和zero-shot task combination（如segmentation + skeleton to image）。

PromptDiffusion

In-Context Learning Unlocked for Diffusion Models

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

prompt由一个example pair和一个text构成，example pair由query image（如segmentation、edge map等）和query image对应的real image组成，之后给定一个新的query image，模型需要根据example pair和text生成对齐的图像。
训练好的模型还可以适用于unseen example pair，即In-Context Learning（无需训练的学习框架）。
模型架构和ControlNet一致，只是输入的条件变成了example pair和新的query image的组合。

ContextDiffusion

Context Diffusion: In-Context Aware Image Generation

InstructDiffusion

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

$z_{t}$ 上。
利用InstructPix2Pix的数据训练也可以做编辑。

InstructCV

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

$x$ $z_{t}$ 上进行训练。

PixWizard

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

DiffusionGeneralist

Toward a Diffusion-Based Generalist for Dense Vision Tasks

DreamOmni

DreamOmni: Unified Image Generation and Editing

ImageBrush

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

$\times$ 2的grid作为数据进行训练，example pair和query image保持不变，diffusion训练生成target image。

InstructGIE

InstructGIE: Towards Generalizable Image Editing

和ImageBrush类似。

Analogist

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

和ImageBrush类似。

ReEdit

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

和ImageBrush类似。

LoC

LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

给定eiding image pair，生成一个LoRA控制这种编辑，该LoRA还支持正负方向编辑。

DiffX

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Human/Hand

HumanSD

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

$z_{t}$ 上。
$\epsilon_{\theta}(z_{t},t,c) - \epsilon$ $\epsilon_{\theta}(z_{t},t,c) - \epsilon$ 类似梯度估计器的输出。

HairDiffusion

HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion

ComposeAnyone

ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions

Parts2Whole

From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

Appearance Encoder的输入不加噪，且每个part image独立输入提供reference feature，输入的text为该part image对应的类别，如face、hair等。
Shared Self-Attention的思想类似GLIGEN，进行self-attention后只保留image feature。如果有part image的mask，attention时只attend unmask部分的pixel。
Decoupled Cross-Attention是IP-Adapter，两个并行的cross-attention layer分别处理text和part image。

HandRefiner

HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting

hand depth map + ControlNet

Hand2Diffusion

Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

先生成手再生成body。

HanDiffuser

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

以hand params为中介进行生成。

RHanDS

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

将畸形的手从原图中割下来，输入RHanDS进行修复，之后再粘贴回原图。
RHanDS的训练包含两个阶段，第一阶段构造数据集（同一个人的两只手作为一对数据）训练保持style，第二阶段使用一个3D模型提取mesh训练根据structure重构。该3D模型也可以根据畸形的手提取出正常手的mesh。

HumanRefiner

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

AbHuman数据集：使用StableDiffusion生成human图像，人工标注了异常分数以及异常的区域，之后训练一个打分模型和一个异常目标检测模型。
在AbHuman上fine-tune一下StableDiffusion，不然StableDiffusion无法识别含有异常描述的prompt，之后CFG + score guidance进行生成。
之后的refine是可选项。

Hand1000

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

MoLE

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

MoE的方法组合LoRA参数。

Text/Glyph

TextDiffuser

TextDiffuser: Diffusion Models as Text Painters

生成带文字的图片。
先训练一个Transformer生成文字的layout，再训练一个以layout的mask为条件的diffusion model生成图片。

TextDiffuser-2

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

训练一个LLM对text rendering进行layout planning，之后训练一个diffusion model根据layout planning进行生成。

CustomText

CustomText: Customized Textual Image Generation using Diffusion Models

RefDiffuser

Conditional Text-to-Image Generation with Reference Guidance

GlyphControl

GlyphControl: Glyph Conditional Control for Visual Text Generation

自监督训练，使用OCR模型识别带文字图像中的文字，并将其输入ControlNet训练重构原图。

GlyphDraw

GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently

所有条件输入UNet重新训练。

GlyphDraw2

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models

ControlNet

TextGen

How Control Information Influences Multilingual Text Image Generation and Editing?

ControlNet

AnyText

AnyText: Multilingual Visual Text Generation And Editing

STGen

Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

UDiffText

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Brush Your Text

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

ControlNet + cross-attention mask constraint

LTOS

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

object-layout control module由GLIGEN实现。
visual-text rendering module由ControlNet实现（在GLIGEN的基础上），类似ControlNet-XS解决information delay问题一样，为了让layout与glyph信息有交互，让skip feature与backbone feature进行cross-attention后再进行skip-connection。

AMOSampler

AMOSampler: Enhancing Text Rendering with Overshooting

training-free
使用Text Rendering部分计算cross-attention map，we then average the attention map over different layers and
heads and rescale its values between 0 and 1.
$z_{t}$ $z_{o}$ $z_{o}$ 不同image patch在不同的时间步。
$z_{s}$ 。

TextCenGen

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

training-free.

ARTIST

ARTIST: Improving the Generation of Text-rich Images by Disentanglement

text module：先使用只有text的黑白图片训练一个diffusion model。
visual module：固定text module，使用带text的真实图片训练一个diffusion model，for each intermediate feature from the mid-block and up-block layers of text module, we propose to use a trainable convolutional layer to project the feature and add it element-wisely onto the corresponding intermediate output feature of the visual module.

JoyType

JoyType: A Robust Design for Multilingual Visual Text Creation

ControlNet
$\hat{x}_{0}$ OCR loss

MGI

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

TextMaster

TextMaster: Universal Controllable Text Edit

SIGIL

Towards Visual Text Design Transfer Across Languages

Image Composition

Collage-Diffusion

Collage Diffusion

将不同collage拼在一起并保证harmonization（无重叠）。
使用TI将每个collage编码进text embedding，同时修改StableDiffusion的cross-attention，类似MaskDiffusion引入mask信息，一起训练。
生成时为每个collage的pseudo word对应的cross-attenion map引入mask。

DiffHarmonization

Zero-Shot Image Harmonization with Generative Model Prior

Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with that of the background.
To achieve image harmonization, we can leverage a word whose attention is mainly constrained to the foreground area of the composite image, and replace it with another word that can illustrate the background environment.

DiffHarmony

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model

RecDiffusion

RecDiffusion: Rectangling for Image Stitching with Diffusion Models

task：rectangling

PrimeComposer

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

pixel composition：按照mask直接拼接在一起。
$M^{obj}$ 位置的KV。
RCA：限制object对应的cross-attention在mask内，mask之外的响应值赋为负无穷。
每一步latent都要和background的inversion过程中的latent再做pixel composition，以保持背景。

FreeCompose

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

利用DDS优化图像进行image composition。
$P_{o}$ $P_{t}$ 是some place，the KV of the self-attention layers masked by M are excluded during noise prediction.
$P_{o}$ $P_{t}$ 相同，需要提取原图中object的sketch，以及期望该object在目标图像中的sketch（可以和原图中的object的sketch不同，比如变成举手），优化时replaces the optimized image KV of the self-attention layers with input image KV of the self-attention layers。
$P_{o}$ $P_{t}$ 是a harmonious scene，两张要拼的图像直接粘在一起作为输入图像。

VisualComposer

Object-level Visual Prompts for Compositional Image Generation

Diffusion-in-Diffusion

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

TF-ICON

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

将reference image注入到main image中，并且符合为main image的风格。
使用exceptional inversion将两个image编码到噪声，然后将reference image的编码噪声resize并注入到main image的编码噪声中，再生成。

TALE

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

CD

Composite Diffusion

scaffolding stage：根据condition生成到某一中间步，只有大致的结构。
harmonization：text-guided generation or blended（若有segmentation condition）

LRDiff

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

$x_{t}$ $(2M-1) * \delta$ $M$ $\delta$ $\delta$ $\delta$ $\delta$ $x_{t}$ 的值取平均得到。

MagicFace

MagicFace: Training-free Universal-Style Human Image Customized Synthesis

$t > \alpha T$ 。
利用cross-attention和self-attention估计出每个concept的mask。
RSA: self-attention时concat上所有concept的K和V，计算self-attention map时乘上一个mask（也是concat在一起），抬高不同concept对应区域的权重。
RBA: 每个concept单独计算出一个self-attention map，只留下mask区域内的。

Make-A-Storyboard

Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

$\lambda$ $x_{\lambda}$ ，之后进行交替生成，一步使用只带concept的句子，一步使用只带scene的句子。

AnyScene

AnyScene: Customized Image Synthesis with Composited Foreground

Foreground Injection Module是ControlNet架构自监督训练。

GP

Generative Photomontage

$Q^{B}$ $Q^{B}$ features, we often observe suboptimal blending at the seams.
$Q$ $K$ $V$ $Q^{B}$ $Q$ features) eliminates the opportunity for the model to adapt the image structure near the seams.

Image Editing through Text

Summarization

MDP

MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

通用框架，类似survey。

Mask-Based

除了提供text，还需要指定需要编辑的区域，编辑时使用text-guided inpainting方法，保持unmask部分不变，参考Inpainting部分。

IIE

Guided Image Synthesis via Initial Image Editing in Diffusion Model

$x_{T}$ 对应的地方进行re-randomize；还可以通过移动生成图像中物体所在位置对应初始噪声区域来变换生成图像中物体的位置。

MaSaFusion

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

ITIE

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

Grounded-SAM获取mask后使用inpainting方法进行编辑。

MagicQuill

MagicQuill: An Intelligent Interactive Image Editing System

Benchmark

IE-Bench

IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment

Text-driven Image Editing Quality Assessment

Mask-Free

难点在于如何保持图像除编辑外的背景和其它内容与原图一致。

Baseline1

DDIM Inversion + Conditional Generation

Baseline2

Text-Guided SDEdit

LASPA (real image editing, no fine-tune)

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

为了保持原图的细节，最直接的做法就是将原图注入生成过程中，SDEdit相当于只是单步注入，LASPA在每一步都注入，使用最简单的插值法。

LaF (real image editing, no fine-tune)

Text Guided Image Editing with Automatic Concept Locating and Forgetting

Text-Guided SDEdit
$c_{n}$ $\epsilon_{\theta}(x_{t}) + \omega (\epsilon_{\theta}(x_{t}, c_{p}) - \epsilon_{\theta}(x_{t})) - \eta (\epsilon_{\theta}(x_{t}, c_{n}) - \epsilon_{\theta}(x_{t}))$ 。

RF-Inversion (real image editing, no fine-tune)

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator.

RF-Solver-Edit (real image editing, no fine-tune)

Taming Rectified Flow for Inversion and Editing

RF-Solver not only significantly enhances the accuracy of inversion and reconstruction, but also improves performance on fundamental tasks such as T2I generation.
用RF-Solver进行inversion后进行类似P2P的编辑。

FlowInversion (real image editing, no fine-tune)

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

RF inversion

P2P (generated image editing, real image editing, no fine-tune)

Prompt-to-Prompt Image Editing with Cross Attention Control

Imagen
text2img模型生成的图片的结构主要由随机种子和cross-attention决定，通过保持随机种子不变（使用DDIM时就是控制起始噪声不变），操控cross-attention可以实现内容保持。
此方法并不是对已有图片做编辑，而是从高斯噪声开始的，并行地生成两张图，一张根据source prompt生成，一张根据target prompt生成（程序运行前并不知道原图是什么样），相当于两条并行的使用source prompt的reconstruction generative trajectory和使用target prompt的editing generative trajectory，前者为后者提供cross-attention map用于修改自身的cross-attention map以达到编辑的效果。
$\rightarrow$ 64x64模型的16x16 resolution的hybrid-attention中的cross-attention部分进行操作，不操作self-attention部分，super resolution模型还用Imagen原来的。
KV都变成了visual token+target prompt token，对新的QK的计算结果即cross-attention map做操纵，主要有三种：word swap：除了被换的词，其它都用原来的cross-attention map；adding a new phrase：旧phrase部分都用原来的cross-attention map；attention re-weighting：给原来的cross-attention map要增强/减弱的词乘常数系数。
$x_{T}$ $\omega$ $\omega$ $\omega$ $\omega$ $\omega$ $\omega$ $\omega$ $\omega$ ，编辑效果很好，但原图背景不能保持）。为此，本论文提供一种细粒度的解决方案，即使用用户提供的原图描述中的要编辑的词对应的attention map生成一个mask（阈值法），该mask会保护修改词之外的region，进行blended生成。

NTI (real image editing)

Null-text Inversion for Editing Real Images using Guided Diffusion Models

StableDiffusion
$\omega$ $\omega$ 进行editing时重构效果较差的问题。
$\omega=1$ $z^{*}_{t}$ $z^{*}_{t}$ $\phi_{t}$ $\hat{z}_{T}=z^{*}_{T}$ $t=T$ $t=0$ $\hat{z}_{t}$ $\phi_{t}$ $\omega=7.5$ $z_{t-1}$ $z^{*}_{t-1}$ $\phi_{t}$ $\hat{z}_{t}$ $\phi_{t}$ $\omega=7.5$ $\hat{z}_{t-1}$ $\omega=7.5$ $\phi_{t}$ 帮助，可以重构图像。
$\hat{z}_{T}=z^{*}_{T}$ $\{\phi_{t}\}_{t=0}^{T}$ $\omega=7.5$ 进行生成。可以是P2P模式也可以不是。

PTI (real image editing)

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

StableDiffusion
$\omega$ 只能为0。
$\omega=0$ $z^{*}_{t}$ $z^{*}_{t}$ $c_{t}$ $\hat{z}_{T}=z^{*}_{T}$ $t=T$ $t=0$ $\hat{z}_{t}$ $c_{t}$ $\omega=7.5$ $z_{t-1}$ $z^{*}_{t-1}$ $c_{t}$ $\hat{z}_{t}$ $c_{t}$ $\omega=7.5$ $\hat{z}_{t-1}$ $\omega=7.5$ $c_{t}$ 帮助，可以重构图像。
$z^{*}_{T}$ $\omega=7.5$ $c=\eta \cdot c^{*} + (1-\eta)c_{t}$ $\eta \in [0,1]$ $\eta=0$ $\eta=1$ 时就是普通的DDIM Edit。

GEO (real image editing)

InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models

需要先perform manual pixel-level editing using techniques such as brush strokes, image pasting, or selective edits得到大概的编辑图，再进行refine。这与NTI这种从原图开始编辑的方法不同。
refine的过程：使用edit prompt对大概的编辑图进行DDIM Inversion到某一中间步后再CFG生成，类似purification。
$z_{t}$ $z_{t+1}$ $z_{t}$ $z_{t+1}$ $z_{0}$ $\epsilon_{t} = \epsilon_{\theta}(z_{t}, t, c)$ $N$ $z_{t}$ $z_{t+1}$ 。这么做是为了保持原图细节。

BARET (real image editing)

BARET: Balanced Attention based Real image Editing driven by Target-text Inversion

$c_{t}$ 进行优化。

NPI (real image editing)

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

$\phi$ $\phi$ 。

$\omega$ $\epsilon(x_t,t)=\epsilon_{\theta}(x_t,t,\phi)+\omega \cdot [\epsilon_{\theta}(x_t,t,c)-\epsilon_{\theta}(x_t,t,c)] = \epsilon_{\theta}(x_t,t,\phi)$ ，保证了重构质量。

editing时使用source prompt作为negtive prompt。

ProxEdit (real image editing)

ProxEdit: Improving Tuning-Free Real Image Editing with Proximal Guidance

improved NPI

StyleDiffusion (real image editing)

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

$\omega=1$ $\hat{z}_{t}$ $\tilde{z}_{T}=\hat{z}_{T}$ $t=T$ $t=0$ $\tilde{z}_{t}$ $\omega=7.5$ $z_{t-1}$ $\hat{z}_{t-1}$ $M_{t}$ $M_{t}$ $\tilde{z}_{t}$ $M_{t}$ $\omega=7.5$ $\tilde{z}_{t-1}$ $\omega=7.5$ $M_{t}$ 帮助，可以重构图像。
$z_{t-1}$ $\hat{z}_{t-1}$ 之间的MSE loss之外，还是用cross-attention map之间的MSE loss。
$\tilde{z}_{T}=\hat{z}_{T}$ $M_{t}$ $\omega=7.5$ 进行生成。

FlexiEdit (real image editing, no fine-tune)

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

$z_{t}$ 进行FFT，过滤掉一些高频信息（边缘、layout）方便编辑动作，同时保持低频信息（背景），再进行editing generative trajectory。
第二阶段：对第一阶段得到的图像进行SDEdit，生成过程中注入reconstruction generative trajectory的self-attention的KV，与原图的特征对齐。

UCAM (real image editing)

Uniform Attention Maps Boosting Image Fidelity in Reconstruction and Editing

还是P2P的流程，只是每一步多了一个blend的操作。
$\frac{1}{N}$ ，那么重构效果就会很好。
P2P的两个分支之外额外新增一个reconstruction generative trajectory作为auxiliary branch，使用null prompt和uniform cross-attention map进行重构，每一步根据source和target分支预测的latent的差，使用阈值法估计出一个mask，mask之内是差较大的区域，代表该区域是编辑响应区，mask之外是需要保持的背景，所以使用auxilary branch的latent进行blend，保持背景。

DirectInv (real image editing)

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

$z^{F_{sg}}$ $z^{I_{p}}$ 相同，保证reconstruction generative trajectory重构结果与原图一致。
training-free，不需要任何优化。

ZZEdit (real image editing)

Exploring Optimal Latent Trajetory for Zero-shot Image Editing

$z_{T}$ $z_{T}$ since the latter bearly lost all structure fidelity for subsequent reconstruction and editing. 选DDIM Inversion中间某一个时间步的latent作为起点去做P2P。
选取标准（Part I）：we only locate the first point during inversion which has a larger target response as our editing pivot.
$K$ 次，each denoising step in ZigZag process provides gradients from target prompt while each inversion step gives a small amount of noise for the next denoising step.

SimInversion (real image editing)

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

改进P2P，disentangle the guidance scale for the source and target branches to reduce the error.
$\omega_{s}=0.5$ $\omega_{t}=7.5$ 。

SYE (real image editing)

Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing

We introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image.
$\bar{\alpha}_{t} = \frac{1}{1 + \exp( - k (t-0.6T))}$ $\frac{d x_{t}}{d t}$ $t = 0$ .

InfEdit (real image editing)

Inversion-Free Image Editing with Natural Language

$\sigma_{t}=\sqrt{1-\bar{\alpha}_{t}}$ ，DDIM采样过程和Consistency Model Multistep采样过程一致，称为Denoising Diffusion Consistent Model (DDCM)。
利用这一点就不需要对原图DDIM Inversion就可以进行编辑。

IterInv (real image editing)

IterInv: Iterative Inversion for Pixel-Level T2I Models

针对含有super-resolution stage的inversion。

KV-Inversion (real image editing)

KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing

The contents (texture and identity) are mainly controled in the self-attention layer, we choose to learn the K and V embeddings in the self-attention layer.
$\omega=7.5$ $z^{*}_{t}$ $z^{*}_{t}$ $\psi_{t}$ $\hat{z}_{T}=z^{*}_{T}$ $t=T$ $t=0$ $\hat{z}_{t}$ $\psi_{t}$ $\omega=7.5$ $z_{t-1}$ $z^{*}_{t-1}$ $\psi_{t}$ $\hat{z}_{t}$ $\psi_{t}$ $\omega=7.5$ $\hat{z}_{t-1}$ $\omega=7.5$ $\psi_{t}$ 帮助，可以重构图像。
$z^{*}_{T}$ $\{\psi_{t}\}_{t=0}^{T}$ $\omega=7.5$ 进行生成。

EDICT (real image editing, no fine-tune)

EDICT: Exact Diffusion Inversion via Coupled Transformations

StableDiffusion
$\omega$ 。
$\omega$ 时先加噪再重构可以还原原图。

BDIA (real image editing, no fine-tune)

Exact Diffusion Inversion via Bi-directional Integration Approximation

BELM (real image editing, no fine-tune)

BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models

AIDI (real image editing, no fine-tune)

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

$x_{t-1}=a_t x_t + b_t \epsilon(x_t,t)$ $\epsilon(x_{t},t) \approx \epsilon(x_{t-1},t-1)$ $x_{t-1}$ $x_{t}=\frac{1}{a_{t}} x_{t-1} - \frac{a_{t}}{b_{t}} \epsilon(x_{t},t)$ $f(x_{t}) = \frac{1}{a_{t}}x_{t-1}-\frac{b_{t}}{a_{t}}\epsilon(x_{t},t)$ $x_{t-1}$ $x_{t}$ 。
$\omega$ $\omega=0$ $\omega$ $\omega$ $\omega=0$ $\omega$ $\omega$ 要好。
$\omega$ $\omega$ to apply larger guidance scales for pixels relevant to editing and lower ones for the rest to keep them unedited. 使用reconstruction generative trajectory的cross-attention map计算一个mask，由该mask决定which pixels are relevant to editing。

SPDInv (real image editing, no fine-tune)

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

$x_{t-1}=a_t x_t + b_t \epsilon(x_t,t)$ $\epsilon(x_{t},t) \approx \epsilon(x_{t-1},t-1)$ $x_{t}=\frac{1}{a_{t}} x_{t-1} - \frac{a_{t}}{b_{t}} \epsilon(x_{t-1},t-1)$ $x_{t}$ $\parallel f(x_{t}) - x_{t} \parallel_{2}$ $x_{t}$ 求不动点。
可以用于多种编辑方法，如P2P，MasaCtrl，PNP，ELITE

FPI (real image editing, no fine-tune)

Fixed-Point Inversion for Text-to-Image Diffusion Models

不动点。

FPI (real image editing, no fine-tune)

Exploring Fixed Point in Image Editing Theoretical Support and Convergence Optimization

不动点。

AdapEdit (real image editing)

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

基于P2P的soft editing。

FPE (real image editing)

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

P2P是替换cross-attention map，但是需要找到real image的prompt，虽然可行但效果不好。本文发现替换self-attention map也是可以的。
real image editing时，DDIM Inversion不需要prompt，reconstruction也不需要prompt。

FateZero (real image editing, no fine-tune)

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

StableDiffusion
P2P是将generative trajectory的cross-attention map注入到editing trajectory里，本论文直接将DDIM Inversion时的attention map注入到editing trajectory，此时就不需要generative trajectory了。这样做重构的效果也很好。
$\omega$ ，DDIM Inversion时记录所有时间步的self-attention map和cross-attention map，编辑生成时，类似P2P，将prompt中没变的部分的cross-attention map替换成DDIM Inversion时的cross-attention map，同时替换所有self-attention map （to preserve the original structure and motion during the style and attribute editing）。

ALE-Edit

Addressing Attribute Leakages in Diffusion-based Image Editing without Training

解决editing attribute leakage的问题，其它物体受被编辑物体的影响也被改变了。
$N$ $N$ 个source/target prompt连起来，分别得到base source prompt和base target prompt。
$e_{\text{base}}$ $N$ $e_{i}$ $i$ $e_{i}$ $e_{\text{base}}$ $e_{i}$ $0$ $e_{i}$ $i$ $N$ 个物体的target prompt。
$N$ $N+1$ 个cross-attention map，之后根据segmentation mask对cross-attention map进行blend（加权和）。
BB: 根据background mask对reconstrution generative trajectory和editing generative trajectory的latent进行blend（加权和）。

MasaCtrl (real image editing, no fine-tune)

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

StableDiffusion
$x_{T}$ $\omega=7.5$ ，类似P2P，分别用source prompt和target prompt进行reconstruction和editing。每一步，操作editing generative trajectory的UNet decoder的后几层的self-attention，用同一位置的reconstrcution generative trajectory的self-attention的KV替换editing generative trajectory的self-attention的KV（Q还是editing generative trajectory自己的）。
只修改UNet decoder的后几层的self-attention：the Query features in the shallow layers of U-Net (e.g., encoder part) cannot obtain clear layout and structure corresponding to the modified prompt。
只在中间的几步进行操作：performing self-attention control in the early steps can disrupt the layout formation of the target image. In the premature step, the target image layout has not yet been formed.
同时，每一步，两条generative trajectory都使用阈值法根据cross-attention map计算一个object的mask，限制editing generative trajectory的object区域的self-attention只参考reconstruction generative trajectory的object区域的信息。

相比于P2P只操控cross-attention，MasaCtrl只操控self-attention，操控cross-attention适合做物体增删，操控self-attention适合做动作改变。

DiT4Edit (real image editing, no fine-tune)

DiT4Edit: Diffusion Transformer for Image Editing

DiT版本的MasaCtrl。

MRGD (real image editing, no fine-tune)

Multi-Region Text-Driven Manipulation of Diffusion Imagery

MultiDiffusion版本的P2P，对不同region进行编辑。

ObjectVariations (generated image editing, no fine-tune)

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

StableDiffusion
对图像中某个物体做变换，而其它部分不改变，如将篮子变成盘子。两条并行的generative trajectory，在某个时间段内将句子中的单词替换。
shape preservation：在cross-attention map上使用阈值法标定出某个需要shape preservation的word对应的object的位置，然后在之前的self-attention map中，将该object所有的pixel对应的self-attention map的行和列注入到新的generative trajectory上。也可以将要编辑的object标定出来，然后把标定之外的pixel当做背景，对这些pixel做shape preservation。
使用Null-text Inversion可以做real image editing。

ViMAEdit (real image editing)

Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing

DDS (real image editing)

Delta Denoising Score

将图像本身看成参数，就可以利用SDS进行编辑（输入target prompt，梯度更新图像），但这样会导致图像模糊，如图中上半部分。
$\nabla_{\theta} \mathcal{L}_{\text{DDS}} = (\epsilon_{\phi}(z_{t}, t, y)-\epsilon_{\phi}(\hat{z}_{t}, t, \hat{y})) \frac{\partial z}{\partial \theta}$ $\theta$ $z$ $\theta$ $\hat{z}$ $y$ $\hat{y}$ $t$ $\epsilon$ $\nabla_{\theta} \mathcal{L}_{\text{DDS}} = \nabla_{\theta} \mathcal{L}_{\text{SDS}}(z, y) - \nabla_{\theta}\mathcal{L}_{\text{SDS}}(\hat{z}, \hat{y})$ $\nabla_{\theta}\mathcal{L}_{\text{SDS}}$ $\nabla_{\theta}\mathcal{L}_{\text{SDS}}$ $\nabla_{\theta}\mathcal{L}_{\text{SDS}}$ $\nabla_{\theta} \mathcal{L}_{\text{SDS}}$ $\nabla_{\theta}\mathcal{L}_{\text{SDS}}(\hat{z}, \hat{y})$ $\nabla_{\theta} \mathcal{L}_{\text{SDS}}(z, y)$ 的偏离项，DDS即为去除了偏离项的优化目标。
DDS对于每个编辑需求都要进行反向传播更新，比较消耗计算资源，进一步可以通过DDS训练一个编辑模型，如图所示。

DreamSampler (real image editing)

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

将DDS的随机采样时间步和噪声改为在DDIM采样过程中进行score distillation，不需要提供原图的prompt也能进行DDS编辑。
$z_{0}$ $\epsilon_{\theta}$ in the previous timestep of reverse sampling. With initial noise computed by DDIM inversion, reverse sampling do not deviate significantly from the reconstruction trajectory even though source description is not given.

SmoothDiffusion (real image editing)

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

$x_{T}$ $x_{0}$ ).
做法是训练时加一个正则项Step-wise Variation Regularization。
$\omega = 7.5$ 的DDIM Inversion and Reconstruction有好处，从而也有利于editing。

IP2P (real image editing, retrain)

InstructPix2Pix: Learning to Follow Image Editing Instructions

利用GPT3，StableDiffusion，P2P（generated image editing）创建一个数据集，每条数据包含原图，原图描述，目标描述和目标图片，训练一个新的StableDiffusion，以原图和目标描述为条件，建模目标图片，这样在推理时就不需要原图描述了。

UIP2P (real image editing, retrain)

UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency

利用cycel edit consistency提升预训练好的instruction-based image editing model，可以用在任意模型上。

Emu-Edit (real image editing, retrain)

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

类似IP2P，创建新数据集进行训练。
像Emu一样，训练完后使用少量高质量数据进行fine-tune。

UltraEdit (real image editing, retrain)

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

A large-scale (~4M editing samples), automatically generated dataset for instruction-based image editing.

OmniEdit (real image editing, retrain)

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

将编辑任务分类，每个任务训练/使用专用模型去造数据，所有数据合在一起训练一个模型。

SeedEdit (real image editing, retrain)

SeedEdit: Align Image Re-generation to Image Editing

自举式迭代训练。

InstructMove (real image editing, retrain)

Instruction-based Image Manipulation by Watching How Things Move

类似AnyDoor标注instruction-based editing dataset。
$z_{t}$ 上进行训练。

PbI (real image editing, retrain)

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

与IP2P使用P2P构造数据集不同，PbI使用PbE的思想构造数据集。
editing model和IP2P一样。

Diffree (real image editing, retrain)

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

使用分割数据集+inpainting模型造数据。
训练时根据inpainting结果和text去预测原图，除了diffusion loss，还训练一个小模型预测mask，对最终结果进行blend。

RIE (real image editing, retrain)

Referring Image Editing: Object-level Image Editing via Referring Expressions

比general image editing更加精细
利用现有的image composition model、region-based image editing model、image inpainting model构造数据集进行训练。
编辑模型是一个conditional diffusion model，source image和referring expression作为条件送入cross-attention进行训练。

EditWorld (real image editing, retrain)

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

$I_{\text{ori}}$ $I_{\text{ori}}$ $I_{\text{ori}}$ $I_{\text{tar}}$ $I_{\text{tar}}$ $I_{\text{ori}}$ $I_{\text{ori}}$ $I_{\text{tar}}$ 和instruction作为一条数据。
editing model与IP2P一样。

LIME (real image editing, retrain)

LIME: Localized Image Editing via Attention Regularization in Diffusion Models

预训练好的InstructPix2Pix。
提取原图的UNet features，resize，concat，normalization，聚类，得到segmentation。
提取目标描述中related token的cross-attention map，算出响应值最高的几个点，这几个点所在的segment拼在一起，即为RoI区域。
在IP2P生成时做blended editing，同时利用RoI修改cross-attention map，对于unrelated token的cross-attention map，RoI区域内的都减去一个较大的常数值，避免unrelated token对编辑造成影响。

FoI (real image editing, retrain)

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

预训练好的InstructPix2Pix。
从instruction中提取关键词，使用该关键词对应的cross-attention map，多次进行平方+norm的操作拉开高低值之间的差距，使用阈值法估算出一个mask。
$\epsilon_{\theta}(z_{t}, t, I, \phi)$ （即null-instruction）的cross-attention map替换。
$s_{T}(\epsilon_{\theta}(z_{t}, t, I, T) - \epsilon_{\theta}(z_{t}, t, I, \phi))$ 乘上mask。

WYS (real image editing, retrain)

Watch Your Steps: Local Image and Scene Editing by Text Instructions

预训练好的InstructPix2Pix。
类似DiffEdit，在编辑之前先计算一个mask，在InstructPix2Pix生成时做blended editing。

ZONE (real image editing, retrain)

ZONE: Zero-Shot Instruction-Guided Local Editing

预训练好的InstructPix2Pix。
description-guided model类似StableDiffusion的cross-attention map是token-wise的，instruction-guided model类似InstructPix2Pix的cross-attention map是consistent的。所以在InstructPix2Pix的cross-attention map上利用阈值法估计出一个mask。但这个mask过于粗糙，所以将InstructPix2Pix的编辑结果送入SAM，利用IoU选出重叠最大的segment作为mask。得到mask后，用原图的mask之外的部分替换InstructPix2Pix的编辑结果的mask之外的部分，再利用一些平滑操作去除artifact。

VisII (real image editing, retrain)

Visual Instruction Inversion: Image Editing via Visual Prompting

基于IP2P做Visual Instruction的Textual Inversion。
IP2P的输入是原图和instruction，输出是编辑后的图像。现给定一对原图和编辑后的图像的示例，在IP2P上利用TI的思想学习一个instruction的embedding，之后就可以把这个学到的instruction embedding用在其它图像上，实现与示例类似的编辑效果。

E4C (real image editing, fine-tune)

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

$\omega=0$ $z_{t}$ $z_{t}$ 输入UNet，给editing generative trajectory提供KV或Q。
Queries for structure and layout, whereas keys and values for textures and appearance. 对于保持layout的编辑，选择替换Q，此时就不需要下面的优化；对于需要编辑layout的编辑，选择替换KV，此时需要下面的优化。
$\mathcal{L}_{\text{CLIP}}$ $z_{0}$ $\mathcal{L}_{\text{Reg}}$ ，

Imagic (real image editing, fine-tune)

Imagic: Text-Based Real Image Editing with Diffusion Models

Imagen
只给原图和target prompt
先以target prompt embedding为起点，使用TI优化出一个source prompt embedding，之后fix source prompt embedding，fine-tune Imagen，之后使用source prompt embedding和target prompt embedding线性插值进行生成。
不fine-tune Imagen做不到图像保持，类似DragDiffusion，所以fine-tune很重要。

FastEdit (real image editing, fine-tune)

FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

改进优化Imagic，使得每次编辑的速度提升20多倍。
优化一：不使用text-to-image模型，而是使用image-to-image模型，其可以根据CLIP image embedding生成图像，这样就不需要TI优化source prompt embedding了，原图的CLIP image embedding就可以作为source embedding，这里利用了CLIP embedding的对齐性质。
优化二：使用原图的CLIP image embedding对diffusion model进行fine-tune重构原图，这里根据原图的CLIP image embedding和target prompt的CLIP text embedding的差异度选择fine-tune的时间步范围，减少fine-tune次数。
优化三：使用LoRA fine-tune，减少fine-tune参数量。
fine-tune结束后，类似Imagic，可以使用原图的CLIP image embedding和target prompt的CLIP text embedding的插值进行编辑生成。

Forgedit (real image editing, fine-tune)

Forgedit: Text Guided Image Editing via Learning and Forgetting

setting与Imagic相同，做法稍有差异。
vision language joint learning：使用BLIP为原图生成source prompt，将source prompt输入CLIP得到source prompt embedding，再使用该embedding和原图一起fine-tune Imagen，这里embedding也参与优化。fine-tune Imagen时只更新一部分参数，并且发现The encoder of UNets learns the pose, angle and overall layout of the image. The decoder learns the appearance and textures instead.所以可以forget参数：If the target prompt tends to edit the pose and layout, we choose to forget parameters of encoder. If the target prompt aims to edit the appearance, the parameters of decoder should be forgotten.
生成时，计算target prompt embedding与优化得到的source prompt embedding正交的部分作为editing embedding，使用优化得到的source prompt embedding与editing embedding的线性组合进行生成，目的是为了保持原图细节。

DBEST (real image editing, fine-tune)

On Manipulating Scene Text in the Wild with Diffusion Models

和Imagic顺序相反，因为这里提供了source prompt。

先fine-tune diffusion model，再使用预训练好的text recognition model的交叉熵loss优化target prompt embedding。

PNP (real image editing, no fine-tune)

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

StableDiffusion
$\phi$ $\phi$ 一条使用target prompt，每一步对editing generative trajectory进行feature injection和self-attention map injection。和pix2pix-zero思想很像。
feature injection：和MasaCtrl得出一样的结论，UNet深层的feature有更好的structure信息。使用reconstruction generative trajectory的UNet较深层的feature map替换editing generative trajectory的。但这样虽然很好了的保留了原图的structure信息，但也有一些纹理信息泄露到了生成图像中。
$\text{Softmax}(QK^{T})$ ）替换editing generative trajectory的，使得纹理信息保持一致。

Self-Guidance (real image editing, no fine-tune)

Diffusion Self-Guidance for Controllable Image Generation

用cross-attention map或者UNet feature map计算loss并求梯度作为guidance，实现物体移动、改变大小、改变外观等编辑功能。
position：object对应的word对应的cross-attention map的质心位置。
shape：对object对应的word对应的cross-attention map使用阈值法得到一个二值mask。
apperance：使用上述mask乘上UNet feature map后求均值。
编辑时两条trajectory，一条generative或者reconstruction trajectory，一条editing trajectory，计算所有不想改变的物体对应的word对应的shape和apperance之间的MSE loss，再根据编辑需求计算loss，求梯度指导editing trajectory生成。
物体移动：计算某个object对应的word对应的position与期望位置之间的MSE loss。
改变大小：计算某个object对应的word对应的shape与期望的shape之间的MSE loss。
改变外观：计算某个object对应的word对应的appearance与期望apperance之间的MSE loss。

Guide-and-Rescale (real image editing, no fine-tune)

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

使用DDIM Inversion过程中的latent和editing过程中的latent输入UNet，计算两者self-attention map和feature之间的MSE loss，求梯度进行guidance。
注意计算loss时输入的都是source prompt，目的是保持layout一致。

AGG (real image editing)

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

结合了MCG和DDS的guidance方法，使用任意loss指导采样。

Asyrp (real image editing)

Diffusion Models Already Have a Semantic Latent Space

$\lambda_{CLIP} \mathcal{L}_{direction}(P_{t}^{edit},y_{target};P_{t}^{source},y_{source}) + \lambda_{recon} |P_{t}^{edit} -P_{t}^{source}|$ $f_{t}(h)=\Delta h$ 。
$S_{for}$ $T$ $S_{edit}$ $t_{edit}$ $0$ 。一条是原本的轨迹，一条是不断shift的轨迹，每一步都用上述loss进行一次优化，类似DiffusionCLIP的GPU-efficient方法。

Interpretable h-space (real image edting)

Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models

$h$ ，这说明Asyrp中那个证明是不对的；而且这样每步生成时只需要过一次UNet了，更高效。
$h_{t}$ $t$ $h_{t}$ $n$ $\{v_{tj}\}_{j=0}^{n}$ $\hat{h}_{t} = h_{t}+\gamma v_{tj}$ $j$ 个主分量具有相同的语义。
$\epsilon_{\theta}(x_{t})$ 输出变化最大的方向。虽然是image-specific的，但是在某张图上找到的编辑方向也是可以应用于其它样本的。
$h_{t}$ $h_{t}$ ，取平均作为编辑方向，但是这样做有耦合的问题，因为正负例不可能完全只有一个属性的对立。使用类似找正交基的方法，每次计算一个编辑方向时，去除之前已发现的所有编辑方向的影响，最终得到解耦的编辑方向。

ChatFace (real image editing)

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

$z$ $\Delta z$ 。

ZIP (real image editing)

Zero-Shot Inversion Process for Image Attribute Editing with Diffusion Models

Self-Discovering (real image editing)

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

RBE

Unsupervised Region-Based Image Editing of Denoising Diffusion Models

FluxSpace (real image editing)

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

类似h-space，这里开发了一个新的Flux自带的表征空间，the outputs of Flux’s joint attention layers, a linear representation space where semantic image edits can be performed in a disentangled manner.
Flux有T5和CLIP两个text encoder，T5负责细节，所以细节编辑需要操作T5（Fine-Grained Editing），语义编辑需要操作CLIP（Coarse-Grained Editing）。
$a$ $b$ $c$ $c$ $b$ $d$ $c$ $d$ $e$ $a + \lambda_{\text{fine}} e$ 是最终输出。
Coarse-Grained Editing：CLIP输出也是个线性空间，所以也可以进行上述的投影。

GANTASTIC (real image editing)

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

$d$ 。
$\mathcal{L}_{\text{latent}} = - E_{t, \epsilon}[\parallel \epsilon_{\theta}(x_{t}^{\prime},t,d) - \epsilon_{\theta}(x_{t},t,d) \parallel_{2}^{2} ]$ $d$ $d$ 学习两者之间最大的差异在哪，和SDS异曲同工。
$\mathcal{L}_{\text{sem}} = 1 - \text{cossim}(E_{I}(x^{\prime}),d) + \text{cossim}(E_{I}(x),d)$ 确保语义。

NoiseCLR (real image editing)

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

identify interpretable directions in text embedding space of text-to-image diffusion models

In noisy space, for edits carried out by the same direction to be attracted towards each other, while edits conducted by different directions to repel one another, in line with the core principles of contrastive learning.

StyleDis (real image editing, no fine-tune）

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

StableDiffusion
$c^{(0)}$ $c^{(1)}$ $\lambda_{t}$ $x_{T}$ $c_{t}=\lambda_{t}c^{(0)}+(1-\lambda_{t})c^{(1)}$ $x_{0}^{\lambda}$ $c^{(0)}$ $x_{0}^{(0)}$ $c^{(1)}$ $\mathcal{L}_{clip}(x_{0}^{(0)},c^{(0)};x_{0}^{\lambda},c^{(1)})+\beta\mathcal{L}_{perc}(x_{0}^{(0)},x_{0}^{\lambda})$ $\lambda_{t}$ ，不fine-tune StableDiffusion。
$c^{(0)}$ $x_{T}$ $c_{t}$ 作为条件生成。

SINE (real image editing, fine-tune)

SINE: SINgle Image Editing with Text-to-Image Diffusion Models

StableDiffusion
类似DreamBooth，用原图和带有pseudo word的prompt，fine tune pseudo word embedding和StableDiffusion，每编辑一张图就要fine-tune一次模型。
$p \times p$ $sp \times sp$ $s$ $sp \times sp$ ，同时将这个patch的位置信息编码输入到StableDiffusion中，一方面可以提高泛化能力，另一方面能使模型输出任意尺寸的图片。编辑生成时使用原图的位置信息。
$\omega \epsilon_{\theta}(z_{t},c)+(1-\omega)\epsilon_{\theta}(z_t)=\omega [v\epsilon_{\theta}(z_{t},c)+(1-v)\hat{\epsilon}_{\theta}(z_{t},\hat{c})]+(1-\omega)\epsilon_{\theta}(z_t)$ $\hat{\epsilon}_{\theta}$ $\hat{c}$ $c$ 是target prompt。
不需要DDIM Inversion。

SEGA (generated image editing, no fine-tune）

SEGA: Instructing Diffusion using Semantic Dimensions

CFG的线性组合

DiffEdit (real image editing, no fine-tune)

DiffEdit: Diffusion-based Semantic Image Editing with Mask Guidance

自动计算mask的Blended Diffusion。
$\phi$ $\phi$ ）编码原图到某一中间步；使用target prompt进行生成，每一步使用mask进行blended生成。
理论证明了，使用uncondtional DDIM Inversion加噪，比SDE直接一步加噪，重构效果更好。

LIPE (real image editing, no fine-tune)

LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

有待编辑物体的reference image的编辑任务。
先利用TI技术学习一个待编辑物体的pseudo word，利用pseudo word简单造句得到identity-aware prompt，之后使用source prompt对原图进行DDIM Inversion，记录所有latent，再使用source prompt和identity-aware prompt进行重构，利用重构最后一步时pseudo word的cross-attention map估算出一个mask，从inversion得到的噪声开始使用target prompt和identity-aware prompt进行编辑生成，编辑的每一步，根据pseudo word的cross-attention map估算出一个mask，取两个mask的并，根据这个mask进行latent的blend，mask区域内取编辑生成的latent，mask区域外取inversion时的latent，以保持背景不变。

DM-Align (real image editing, no fine-tune)

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

自动计算mask，转换为inpainting问题。

FISEdit (real image editing, no fine-tune)

FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference

类似DiffEdit自动计算mask：利用P2P的方法操作cross-attention map，使用两个generative trajectory输出的feature map计算出difference mask记为要编辑的区域。

InstDiffEdit (real image editing, no fine-tune)

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

类似DiffEdit自动计算mask：利用target prompt的start token对应的cross-attention map具有全局语义信息的性质，计算其余token的cross-attention map与其的相似度，使用最相似的那个token的cross-attention map，处理后估计一个mask。

Diff-AE & PDAE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Unsupervised Representation Learning from Pre-trained Probabilistic Diffusion Models

训练自编码器，在隐空间训练线性分类器，利用属性超平面的法向量作为编辑方向。

DiffEx

Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models

DisControlFace (real image editing)

DisControlFace: Disentangled Control for Personalized Facial Image Editing

$x_{0}$ 作为Diff-AE的输入，相当于训练ControlNet进行inpainting。

$0.75 -0.5 \times (T-t) / T$ $z$ 都不一样。

UFIE

User-friendly Image Editing with Minimal Text Input: Leveraging Captioning and Injection Techniques

传统的编辑方法类似P2P需要用户提供source prompt和target prompt，本论文使用现有的caption模型为原图生成source prompt，用户只需要指出需要修改source prompt中哪些concept即可。

HIVE

HIVE: Harnessing Human Feedback for Instructional Visual Editing

训练一个StableDiffusion，以原图和target prompt为条件，对目标图像进行去噪。
引入human feedback，使用learned reward function fine-tune上述StableDiffusion。

DialogPaint

DialogPaint: A Dialog-based Image Editing Model

StableDiffusion
multi-turn editing

EMILIE

Iterative Multi-granular Image Editing using Diffusion Models

StableDiffusion
multi-turn editing，在StableDiffusion的latent space上进行多轮编辑。

MLLM-Plan

GenArtist

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

利用GPT-4调度使用各种编辑方法进行编辑。
GPT-4是只能生成text的MLLM，所以只能帮助做plan，无法直接根据需求生成图像。

Ground-A-Score

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

针对复杂编辑要求的DDS方法。
$\{x_{k}\}_{k=1}^{n}$ $\{y_{k}\}_{k=1}^{n}$ $\{m_{k}\}_{k=1}^{n}$ $\nabla_{z} \mathcal{L}_{\text{DDS}} = \sum_{k=1}^{n} m_{k} (\epsilon_{\phi}(z_{t}, t, y_{k})-\epsilon_{\phi}(\hat{z}_{t}, t, x_{k}))$ 。
GPT-4V是只能生成text的MLLM，所以只能帮助做plan，无法直接根据需求生成图像。

SANE

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

利用LLM将ambiguous instruction改写为多个specific instructions，利用IP2P模型组合多个instructions进行编辑。

DVP

Image Translation as Diffusion Visual Programmers

$\epsilon = \sigma(\epsilon_{u}) \text{conv}(\frac{\epsilon_{u} - \mu(\epsilon_{c})}{\sigma(\epsilon_{c})}) + \mu(\epsilon_{u})$ $\text{conv}(\cdot)$ $1 \times 1$ $\epsilon_{u}$ $\epsilon_{u}$ 的自由度太高了。

TIE

TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing

构造CoT数据fine-tune MLLM。

EVLM

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

fine-tune VLM，根据reference image和text instruction，generates much more precise editing instructions.

MLLM-Feature

MGIE

Guiding Instruction-based Image Editing via Multimodal Large Language Models

使用InstructPix2Pix的数据集，让MLLM根据图像和old instruction生成new instruction，给new instruction后加一些可训练的[IMG] token。
将old instruction、原图和new instruction输入LLaVA，训练生成new instruction的text部分，同时将[IMG]部分的feature作为editing command，和原图一起输入一个diffusion model，生成目标图像，所有可训练模块一起训练。
LLaVA是只能生成text的MLLM，无法直接根据需求生成图像，这里借助了MLLM的编码能力，为其feature训练一个diffusion decoder。

CAFE

Customization Assistant for Text-to-image Generation

和MGIE类似。

EmoEdit

EmoEdit: Evoking Emotions through Image Manipulation

根据emotion生成instruction，使用预训练IP2P进行编辑。

RP2P

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

We introduce ReasonPix2Pix, a dataset specifically tailored for instruction-based image editing with a focus on reasoning capabilities. 构造数据集时生成具有联想能力的instruction，比如使用the owner of the castle is a vampire代替make the castles dark.
原图和instruction输入MLLM，使用MLLM输出的feature和原图作为条件fine-tune StableDiffusion，生成目标图像。

Video

Frame2Frame

Pathways on the Image Manifold: Image Editing via Video Generation

Consistent Editing

多张图一起编辑，要求编辑效果一致

Edicho

Edicho: Consistent Image Editing in the Wild

$Q_{i}$ $Q_{j}$ $Q_{j}$ $Q_{i}$ $K_{i}, V_{i}$ $K_{j}, V_{j}$ 计算attention。
$z_{i}^{u}$ $z_{j}^{u}$ ，在每个correspondence点上随机任选一个，融合组成最后的uncond score。

Image Editing through Reference Image

可以看成image-guided inpainting，参考Inpainting部分的text-guided inpainting，只是将条件从text换成了image。

Object

ILVR

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

Latent Variable Refifinement： match the low-pass filter feature of noisy latents to that of reference image

FGD

Filter-Guided Diffusion for Controllable Image Generation

类似ILVR，设计filter让生成样本与reference image在特定属性上一致。

PbE

Paint by Example: Exemplar-based Image Editing with Diffusion Models

StableDiffusion
$z_{t}$ 的concat作为输入，重新训练StableDiffusion，使用全图diffusion loss。
self-supervised learning：使用带有bounding box的图像数据集进行自监督训练，即将bounding box内区域作为mask，bounding box内图片作为参考图片。这样训练时模型很容易过拟合，模型只学到学到一个简单的复制粘贴，提出两个解决方案：Information Bottleneck：因为我们需要将参考图片移植到原图mask区域，模型很容易去记忆图片空间信息而不是去理解上下文信息，所以我们将参考图片压缩，提高重构难度，即将其剪切并使用CLIP image encoder编码，结果作为StableDiffusion的KV进行cross-attention。Strong Augmentation：自己造的数据集存在domain gap between train-test，因为训练集中的参考图片本来就是原图切下来的，而测试集中基本都是无关的，所以我们对训练集中的参考图片进行数据增强（翻转、旋转、模糊等），又由于bounding box都是紧贴物体的，不利于模型泛化，所以对mask区域也进行数据增强，先用Bessel曲线拟合bounding box，再在曲线上均匀采样20个点，随机延伸1~5个像素点。
类似inpainting的blended采样。
classifier-free guidance：20%的概率用可一个训练的向量替代CLIP image encoder编码结果，采样时guidance scale可以控制融合程度。

PbS

Reference-based Image Composition with Sketch via Structure-aware Diffusion Model

在PbE的基础上，还需要提供mask部分的sketch作为条件（concat），进一步提高可控性。

ControlCom

ControlCom: Controllable Image Composition using Diffusion Model

挖掉图像前景做自监督训练。
一个额外的indicator决定是否改变被挖出来的前景的illumination和pose，indicator也作为条件输入diffusion model进行训练。

MADD

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

SAM + inpainting挖掉图像前景做自监督训练。
$p$ , which include points, bounding boxes, masks, and even null prompts.

IMPRINT

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

使用multi-view数据集训练一个image encoder（主体DINOv2 + 小adapter，两者都参与训练），输入一个view的图像生成embedding序列，送入StableDiffusion，重构另一个view的图像。训练image encoder和StableDiffusion的decoder。
固定image encoder的主体部分，重新训练一个diffusion model，自监督训练，image encoder的adapter也参与训练。

Refine-by-Align

Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

We introduce a new task: reference-guided artifacts refinement.
Given a generated image (with artifacts), a free-form mask indicating the artifacts region in the generated image, and a high-quality reference image containing important details such as identity logo or font, our model can automatically refine the artifacts in the generated image by leveraging the corresponding details from the reference.

DreamInpainter

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

两个条件，一个reference image，一个text，这样不仅可以将reference image填入mask，还能通过text进行控制，比如动作等。
$32 \times 32 \times 768$ $1024 \times 768$ $1024 \times 1024$ $K$ $K \times 768$ 作为条件。

PhD

Paste, Inpaint and Harmonize via Denoising Subject-Driven Image Editing with Pre-Trained Diffusion Model

将exemplar去除背景，直接paste在目标区域，作为条件输入ControlNet进行类似PbE的self-supervised learning。

RefPaint

Reference-based Painterly Inpainting via Diffusion Crossing the Wild Reference Domain Gap

在Versatile Diffusion基础加了一个mask branch，reference image（训练时是被mask掉的部分）做context flow，masked image做mask branch，进行self-supervised的inpainting训练。

ObjectStitch

ObjectStitch: Generative Object Compositing

用的是pre-trained text2img diffusion model，由于给的是object图片而不是text，所以需要一个模块将object图片转换为text embedding，即content adaptor，类似TI：使用训练好的CLIP和大规模image-caption数据训练一个content adaptor，content adaptor将CLIP的image embedding映射到text embedding空间，得到translated embedding，然后让它尽量靠近CLIP的text embedding。训练好之后再用pre-trained text2img diffusion model和textual inversion方法fine-tune content adaptor。
固定content adaptor，fine-tune pre-trained text2img diffusion model。
类似inpainting的blended采样，diffusion model只输入translated embedding。

LogoSticker

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

基于TI。

ObjectMate

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

使用object detection提取物体，使用object removal模型去除object得到background，使用retrieval方法得到3张和object相似的reference image，作为一条数据进行训练。

AnyDoor

AnyDoor: Zero-shot Object-level Image Customization

$1\times 1536$ $256 \times 1536$ $257 \times 1024$ ，代替text embedding输入cross-attention，代表了物体的全局特征，但是该特征丢失了物体细节，所以使用高通滤波提取物体细节特征，插入原图要放物体的地方，输入Detail Extractor（ControlNet架构），两者互补。训练时同时fine-tune UNet decoder。
之前的使用图像自监督训练的方法虽然有数据增强，但还是会导致多样性不足的问题，所以提出使用视频数据集造数据：对同一场景随机采样两帧，提取一帧的物体作为target，另一帧作为目标。

Bifrost

BIFRÖST: 3D-Aware Image compositing with Language Instructions

类似AnyDoor，额外加上了language instruction作为条件。

AnyLogo

AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status

LAR-Gen

Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance

Locate：StableInpainting
Assign：IP-Adapter
第一阶段：StableInpainting+IP-Adapter训练Diffusion UNet。
第二阶段：把第一阶段训练好的Diffusion UNet复制出一个RefineNet，RefineNet UNet decoder的self-attention前的feature送入Diffusion UNet，与对应的feature concat在一起进行self-attention，只训练RefineNet的image cross-attention。
self-supervised learning，训练时subject image是从scene image中挖出来的，使用LLaVA生成subject image的caption作为text。
blended采样。

PAIR-Diffusion

PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models

使用预训练模型提取图像的segmentation map作为图像的structure特征，再使用一个预训练的图像编码器编码图像，提取浅层feature map，取segmentation map中每个segment对应区域的feature map的spatial pool作为该segment的appearance特征，两者作为条件训练diffusion model。
structure编辑：对分割图进行编辑（比如改变某个object的形状、去掉某个object）
appearance编辑：提供一张reference image，用其全图的或者其中某个object的appearance特征替换某个segment对应的appearance特征，进行生成。
注意，编辑时不需要DDIM Inversion，直接根据条件从噪声开始生成即可。但毕竟structure和appearance不包含图像全部特征，所以未编辑部分会有一些变化。但编辑时可以对未编辑的segment进行mask，类似inpainting的blended采样。

CustomNet

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

使用SAM提取出原图中的object和background，估算出object的viewpoints，使用zero-1-to-3生成一个随机viewpoints的novel view object，训练一个diffusion model，novel view object、background和viewpoints作为条件，预测原图。
生成时，可以指定object的角度、在图像中的位置以及背景。

Custom-Edit

Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

给定一张image和几张reference images，将image中某个object替换为reference image中的concept。
Custom-Diffusion方法提取reference image中的concept到pseudo word。
Prompt2Prompt + Null-text Inversion做real image editing，用pseudo word替换prompt中object对应的word。

DreamEdit

DreamEdit: Subject-driven Image Editing

和CustomEdit一样，但是基于mask的，DreamBooth做完TI后，做text-guided inpainting采样（blended）。

DreamCom

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

$z_{t}$ 上，使用一个稀有的单词造句(如a sks cat)进行TI，同时fine-tune StableDiffusion。
生成时，给定背景图和想要object出现的位置的bounding box (mask)，使用上述句子进行生成。

SpecRef

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing

有reference image的P2P。

VisCtrl

Tuning-Free Visual Customization via View Iterative Self-Attention Control

CLiC的无pseudo word版本，直接使用self-attention KV注入实现concept替换。

FreeEdit

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

构造数据集训练。

Unconstrained

Thinking Outside the BBox: Unconstrained Generative Object Compositing

$50\%$ 概率不使用mask，这样既可以做mask-free也可以做mask-based object stitch。
提取mask时，还提取了object的shadow mask和reflection mask，使得模型在object stitch的同时可以生成影子。

InstantSwap

InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences

用DDS做基于reference image的替换编辑。
先用DreamBooth学一个target concept的pseudo word（这里统一用sks），之后用source prompt和带target concept的target prompt进行DDS。
SECR保证background preservation。

Try-On

TryOnDiffusion

TryOnDiffusion: A Tale of Two UNets

cascade模式
使用Parallel UNet是为了解决channel-wise concatenation效果不行的问题，所以改用cross-attention机制，绿线代表将feature当成KV送入主UNet。

StableVITON

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

使用PbE的StableDiffusion，其cross-attention是接收CLIP image embedding过一个MLP的结果。
CLIP image embedding丢了很多信息，所以在decoder block之间再插入一个zero cross-attention block引入细节。
在text cross-attention里，某个word对应的cross-attention map是这个物体的大致轮廓，但是在zero cross-attention block里是image cross-attention，query里衣服上某个image token对应的cross-attention map应该是key中同样位置的image token，而非整个衣服区域，所以cross-attention map应该是尽量集中于一点的，所以额外使用了一个attention total variation loss, which is designed to enforce the center coordinates on the attention map uniformly distributed, thereby alleviating interference among attention scores located at dispersed positions. 即让query里不同image token对应的cross-attention map差异尽量大。

TED-VITON

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

MMDiT

MC-VTON

MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

Flux
MP: mask person, G: garment

MMTryon

MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

StableDiffusion的cross-attention换为Multi-Modal Attention block，self-attention换为Multi-Reference Attention block。

TryOn-Adapter

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

Try-On-Adapter

Try-On-Adapter: A Simple and Flexible Try-On Paradigm

PLTON

Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

类似StableVITON，使用PbE的StableDiffusion，其cross-attention是接收CLIP image embedding过一个MLP的结果，Dynamic Extractor使用CLIP image encoder编码图像，但是之后的MLP是可训练的。
HF-Map输入一个可训练的ControlNet。

StableGarment

StableGarment: Garment-Centric Generation via Stable Diffusion

GarDiff

Improving Virtual Try-On with Garment-focused Diffusion Models

DTC

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

Paint by Example是重新训练整个conditional StableDiffusion，这里改用ControlNet架构。

IDM-VTON

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Wear-Any-Way

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

利用semantic correspondecce，分别将穿着garment的person图像和garment图像输入同一个StableDiffusion，提取feature，计算相似性，可以得到correspondecce作为监督数据，这样生成时可以指定衣服的穿着方式，比如衣角扬起等。

FIA-VTON

Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

BootComp

Controllable Human Image Generation with Personalized Multi-Garments

AnyFit

AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario

AnyDressing

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

TPD

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

FLDM-VTON

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

使用额外的off-the-shelf clothes flattening network进行监督。

M&M-VTO

M&M VTO: Multi-Garment Virtual Try-On and Editing

ShoeModel

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

构造数据自监督训练。

BooW-VTON

BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

Face

FaceStudio

FaceStudio: Put Your Face Everywhere in Seconds

人脸挖出来，自监督训练。

HS-Diffusion

HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping

换头，预训练模型进行blended inpainting生成。

EmojiDiff

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Stable-Makeup

Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

使用ChatGPT生成不同makeup style的prompt，使用LEDITS对没有makeup的人脸图像进行编辑，生成带makeup的人脸图像，监督训练。
类似IP-Adapter，将CLIP提取的global token加patch tokens送入cross-attention。

SHMT

SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models

Stable-Hair

Stable-Hair: Real-World Hair Transfer via Diffusion Model

造数据：要transfer什么就把什么留下，对其它部分进行inpainting。

MLLM

可以实现多种task，如text2img generation，personalization，editing等。

BLIP-Diffusion

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

利用BLIP的方法，先使用大规模image-text数据预训练一个multimodal image encoder，可以从image中提取text-aligned特征。
给定subject image和subject text，输入multimodal image encoder，得到subject image的特征，再训练一个MLP将其转化为text embedding。之后利用subject image构造训练image（如替换背景等）和对应的prompt，将subject image特征转化后的text embedding接在prompt之后，输入text encoder，输出再输入StableDiffuion进行训练。multimodal image encoder、MLP、text encoder和StableDiffuion一起训练。
给定subject image、subject text和prompt就能生成，不需要test-time fine-tune了。

UNIMO-G

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

自监督训练：先用caption模型得到图像的caption，再用Grounding DINO和SAM得到caption中的object的图像，将caption中的object word替换为图像，得到interleaved数据集，输入预训练MLLM进行编码，编码结果（所有token的last hidden layer的输出）送入StableDiffusion重构图像，只训练StableDiffusion。
$z_{t}$ 和image entity的cross-attention，和TokenCompose一样对cross-attention map使用segmentation map进行监督（自监督训练，所以segmentation map是已知的），一方面可以提升训练效果，另一方面可以在推理时指定位置。

Kosmos-G

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

MLLM：使用CLIP提取image embedding，use attentive pooling mechanism to reduce the number of image embeddings，用interleaved text image数据进行next-token prediction训练MLLM和CLIP的最后一层，只在text token上算loss，类似Emu2的caption阶段。
$s$ $t$ $M$ $M(s)$ $t$ $N$ $N(M(s))$ $s$ 计算MSE loss，两个Q-Former一起训练。
We can also align MLLM with Kosmos-G through directly using diffusion loss with the help of AlignerNet. While it is more costly and leads to worse performance under the same GPU days.

Emu2

Generative Multimodal Models are In-Context Learners

caption：使用CLIP提取image embedding，use mean pooling mechanism to reduce the number of image embeddings，用interleaved text image数据进行next-token prediction训练CLIP（注意不是MLLM），只在text token上算loss，该阶段的目的是得到一个image encoder。
caption+regression：固定image encoder，用interleaved text image数据进行next-token prediction训练MLLM，在text token上算分类loss，在image feature上算regression loss。
StableDiffusion：训练StableDiffusion对image encoder的编码结果进行解码。

GILL

Generating images with multimodal language models

caption：类似LLaVa，用image-caption数据进行next-token prediction训练一个projection layer，只在text token上算loss。
$r$ $r$ $r$ $r$ 个image token embedding。没看懂，根本没有引入image。
$r$ 个image token embedding转化为CLIP text embedding，与caption的CLIP text embedding计算MSE loss。

UniReal

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Transfusion的多任务版本。

Image Editing through Point-based Supervision

Self-Guidance

Diffusion Self-Guidance for Controllable Image Generation

DragDiffusion

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

StableDiffusion
$t$ $z_{t}$ $z_{t}$ $\hat{z}_{t}$ 开始DDIM去噪生成编辑后的样本。

CLIPDrag

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

和DragDiffusion方法类似，只是在Eq(4)的motion supervision时引入了一个额外的CLIP direction loss，使用文本提高drag编辑的效果。
$\mathcal{L}_{\text{ms}}$ $G_{l}$ $\mathcal{L}_{\text{clip}}$ $G_{g}$ $G_{l}$ $G_{g}$ $G_{l}$ $G_{l}$ $G_{g}$ $G_{l}$ 方向的分量。

GoodDrag

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

$z_{t}$ $z_{t}$ $z_{t}$ 拉回域内。

DragText

DRAGTEXT: Rethinking Text Embedding in Point-based Image Editing

$F_{q}(z_{t})$ $\mathcal{L}_{\text{ms}}$ $z_{t}$ ，还可以优化text embedding以提升效果，有点类似TI。

DragNoise

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

$t$ $z_{t}$ $z_{t}$ $\hat{z}_{t}$ 开始DDIM去噪生成编辑后的样本。
$\hat{s}_{t}$ and substitute them in the subsequent timesteps.

AdaptiveDrag

AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing

EasyDrag

EasyDrag: Efficient Point-based Manipulation on Diffusion Models

$t$ $z_{t}$ $z_{t}$ 相同的尺寸后concat在一起做motion supervision。
$z_{t}$ $z_{t}$ 对应的feature map的drag后的点附近的feature作为目标。
$z_{t}$ 进行self-attention的KV替换。

StableDrag

StableDrag: Stable Dragging for Point-based Image Editing

$1 \times 1$ 的卷积核，track model的训练只在原图以用户指定的starting point为中心的local patch上进行，训练完成后在后续的motion supervision和point tracking中全程使用。使用卷积核在local patch上进行卷积，得到一个和local patch同大小的score map，ground truth是以用户指定的starting point为中心的一个符合高斯分布的score map，两个map计算MSE优化卷积核。
在进行long-range drag时，图像内容难免会发生较大变化，point feature也会发生改变，此时让它和原图的starting point feature保持一致就不科学了，not only ensuring high-quality and comprehensive supervision at each step but also allowing for suitable modifications to accommodate the novel content creation for the updated states. 因此根据point tracking的结果计算一个confidence score，当confidence score较大时，就使用上一步的point feature作为监督优化latent，当confidence score较小时，就使用原图的starting point feature作为监督优化latent。

FreeDrag

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

$F_{r}(h_{i}^{k}) = \sum_{q_{i} \in \Omega(h_{i}^{k},r)}F(q_{i})$ $\mathcal{L}_{\text{drag}} = \sum_{i=1}^{n} \parallel F_{r}(h_{i}^{k}) - T_{i}^{k} \parallel_{1}$ $T_{i}^{k+1} = \lambda_{i}^{k} \cdot F_{r}(h_{i}^{k}) + (1 - \lambda_{i}^{k}) \cdot T_{i}^{k}$ $T_{i}^{0} = F_{r}(p_{i}^{0})$ $\lambda_{i}^{0} = 0$ $\lambda_{i}^{k}$ $\mathcal{L}_{\text{drag}}$ $\mathcal{L}_{\text{drag}}$ $\lambda_{i}^{k}$ $T_{i}^{k+1}$ $\mathcal{L}_{\text{drag}}$ $\lambda_{i}^{k}$ $T_{i}^{k+1}$ 的变化，这与StableDrag的思想一致。
$h_{i}^{k}$ $p_{i}^{0}$ $t_{i}$ .

DragonDiffusion

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

StableDiffusion
从DIFT获取灵感，模型输出的feature具有correspondence性质，相同物体对应区域的feature具有很高的相似性。
类似P2P+self-guidance，两条并行的generative trajectory，一条是reconstruction，一条是editing，用各自第2，3层的输出feature（self-guidance是用attention）计算loss（原区域和目标区域的feature的相似度），求梯度作为guidance。
将editing generative trajectory的UNet decoder的self-attention的key-value替换为reconstruction generative trajectory的。

DiffEditor

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

DragonDiffusion的改进版
先使用LAION训练一个image prompt encoder：具体做法是先使用预训练的CLIP image encoder将图像编码为长257的embedding sequence，作为cross-attention的key-value送入一个QFormer，输出长64的embedding sequence，送入StableDiffusion的cross-attention，只训练这个QFormer。在编辑时，在editing generative trajectory上使用原图的image prompt，效果更好。
$z_{T}$ $\sigma_{t}>0$ 。
$z_{t}$ $z_{t-1}$ $z_{t}$ ，重复该过程，可以避免一步生成不准确导致最终结果不和谐的问题。之前的resample technique都使用随机加噪，引入了不确定性，这里使用DDIM inversion进行确定性的加噪。

LucidDrag

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner

editing guidance就是DragonDiffusion的guidance。

Pixel-wise-Guidance

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

与SKG和Late-Constraint类似。
$\{x_{0}, y\}$ $x_{t}$ 输入预训练扩散模型，使用UNet feature map训练一个语义分割模型。
对图像的segmentation map进行编辑，同时根据编辑结果计算一个mask，算mask-based方法。
编辑时先DDIM Inversion到某一中间步，再生成，将生成时的UNet feature map输入语义分割模型生成segmentation map，计算其和编辑后的segmentation map之间的loss，求梯度作为guidance。

RegionDrag

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

Point-Based Drag有两个缺点，一是语义不明确，只给一个起点和一个终点，合理的但语义不同的编辑结果可能有很多，二是过程过于复杂，所以提出Region-Based Drag。

Readout-Guidance

Readout Guidance: Learning Control from Diffusion Features

SDE-Drag

The Blessing of Randomness: SDE beats ODE in General Dfusion-based Image Editing

方法就是CycleDiffusion
unified framework

$x_{t_{0}}$ $x_{t_{0}}$ $\hat{x}_{t_{0}}$ .

$\hat{x}_{t_{0}}$ $\hat{x}_{0}$ following either an ODE solver, an SDE Solver, or a Cycle-SDE process.

$p(\hat{x}_{t_{0}})$ $p(x_{t_0})$ ), while the gap remains invariant in the ODE formulation, suggesting the blessing of randomness in diffusion-based image editing.

$\hat{x}_{t_{0}}$ $x_{t_{0}}$ 的分布，但如果有了随机性，生成时，这个偏离会原来越小，如果没有随机性（即ODE），生成时，这个偏离会保持不变。

Drag

$m$ steps along the segment joining the two points and each one moves an equal distance sequentially.

RotationDrag

RotationDrag: Point-based Image Editing with Rotated Diffusion Features

the point-based editing method under rotation scenario

Motion-Guidance

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

$\hat{x}_{0}$ 与原图之间的光流，与用户给定的光流计算loss，求梯度作为guidance。
根据用户给定的光流估算一个mask，blended生成。
才用Repaint的resample technique。

Magic-Fixup

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

detail extractor和synthesizer都用StableDiffusion初始化，都去掉了cross-attention，input block都做了扩展，两者都参与训练。
相当于给synthesizer的self-attention之后加了个cross-attention，Q是自己，KV是detail extractor的self-attention之前的feature。
类似AnyDoor，使用视频造数据进行训练。
$\sqrt{\bar{\alpha}_{t}} I_{\text{coarse}} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon$ 开始，如果从标准高斯分布开始效果会差。

SceneDiffusion

Move Anything with Layered Scene Diffusion

类似Locally-Conditioned-Diffusion和MultiDiffusion，给定图像和其对应的layout，可以通过对layout进行移动从而实现对物体的移动。
除了移动，增删layout可以实现物体的增删，调整layout图层顺序可以实现物体的前后调整。

InstantDrag

InstantDrag: Improving Interactivity in Drag-based Image Editing

两个模型：a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion).
InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation.
FlowGen：根据drag生成光流图。
FlowDiffusion：使用视频数据集自监督训练一个根据当前帧和光流生成目标帧的模型。

Model Editing / Concept Removal / Unlearning

TIME

TIME: Editing Implicit Assumptions in Text-to-Image Diffusion Models

当prompt没有指明时，模型会做一些Implicit Assumptions进行生成，比如生成的玫瑰都是红色，医生都是男性。本方法将编辑这种Implicit Assumptions（是编辑，不是去除），比如将玫瑰是红色编辑为玫瑰是蓝色，这样模型以后再见到带有玫瑰的prompt时，就会默认生成蓝色的玫瑰。
做法是为所有cross-attention训练新的KV projection matrix，让新矩阵与玫瑰的乘积靠近原矩阵与蓝色玫瑰的乘积，这样新矩阵就会默认将玫瑰映射到原来模型里的蓝色玫瑰的投影。

UCE

Unified Concept Editing in Diffusion Models

和TIME类似，闭式解修改所有cross-attention的KV projection matrix。

RECE

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

改进UCE的公式。

MACE

MACE: Mass Concept Erasure in Diffusion Models

最后的融合多个LoRA成一个LoRA的方法类似Mix-of-Show中的方法。

SLD

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

ESD

Erasing Concepts from Diffusion Models

fine-tune StableDiffusion
反向编辑，对图像中与文本相关的内容进行擦除。
反向利用classifier guidance，fine-tune模型让预测的噪声与预训练模型的反向classifier guidance的噪声靠近。

AC

Ablating Concepts in Text-to-Image Diffusion Models

让StableDiffuion忘记一些concept，比如使用带有"in the style of Van Gogh"的prompt时，模型就会忽略"Van Gogh"，生成正常style的图片。
$c^{\star}$ $c$ $c$ $x_{t}$ $c^{\star}$ $c$ $c$ 为条件时disable grad。

Unlearning

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

对抗训练，让模型在grumpy cat和cat时预测的noise无法分辨，这样修改后的模型遇到grumpy cat时会按cat生成，忽略grumpy。

FMN

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

对于一些想让StableDiffuion忘记的concept，收集一些reference images，并用concept造一些prompt，fine-tune整个StableDiffusion，loss为所有cross-attention layer处的concept对应的cross-attention map的所有响应值的平方和。
注意fine-tune时不需要diffusion loss。

PCE

Pruning for Robust Concept Erasing in Diffusion Models

Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs.
stage 1: We use a numerical criterion to identify concept neurons.
stage 2: We validate concept neurons are sensitive to adversarial prompts.

ConceptPrune

ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

We first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning.

EIUP

EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts

original prompt是"a girl"，erasure prompt是"naked"，erasure prompt的cross-attention map注入original prompt的cross-attention map中并进行抑制。

Prompt-Tuning-Erase

Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts

学习一个prompt embedding，其可以直接concat在CLIP text emebdding后送入cross-attention。
$p_{k}$ $\epsilon_{\theta_{k}}$ $\epsilon_{\theta}$ $c_{e}$ $p_{k}$ $\epsilon_{\theta_{k}}(c_{e}, p) - \epsilon_{\theta}(c_{e})$ $c_{e}$ $p$ $\theta_{k}$ $c_{e}$ $p$ $c_{e}$ $p_{k+1}$ $\theta_{k}$ $\epsilon_{\theta_{k}}(c_{e}) - \epsilon_{\theta}('')$ $c_{e}$ $\epsilon_{\theta_{k}}(c_{e}, p) - \epsilon_{\theta}('')$ $\theta_{k+1}$ 。如此循环直到收敛。

PromptGuard

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

SuppressEOT

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

只针对"... without xxx"句型的prompt仍然生成带有"xxx"的图像的情况。
zero xxx和zero EOT都不解决问题，只有同时zero才有效；EOT之间距离也很近。
$(N - |p| -1) \times 768$ ）做SVD，the main singular values are corresponding to the suppressed information (the negative target)，所以做奇异值的抑制，之后复原再进行生成。

SepCE4MU

Separable Multi-Concept Erasure from Diffusion Models

AbO

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

Geom-Erasing

Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models

使用带有二维码、水印、文字的image-text pair数据集，将二维码、水印、文字的位置信息加进text，fine-tune StableDiffusion，这样生成时只用原text就可以避免生成二维码、水印、文字。

Ring-A-Bell

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

Diff-QuickFix

Localizing and Editing Knowledge in Text-to-Image Generative Models

不同属性的知识（objects style color action）分布在UNet中不同block中，只对想要编辑或者ablate的concept对应的属性对应的block做fine-tune。

EraseDiff

EraseDiff: Erasing Data Influence in Diffusion Models

在训练时，对于需要遗忘的数据使用非高斯分布的噪声进行加噪，这样采样时就不会生成这些数据。

TV

Robust Concept Erasure Using Task Vectors

Editioning

Training-free Editioning of Text-to-Image Models

和erasing相反，让模型专注于某个concept的生成。

Position

Position: Towards Implicit Prompt For Text-To-Image Models

erase concept后，用户依然可以通过implicit prompt生成该concept，比如erase了"Eiffel Tower"，使用"Located in France, an iconic iron lattice tower, symbolizing the romance of Paris and French engineering prowess."依然可以生成。
针对这一问题提出了Benchmark，但没有提出解决方案。

Six-CD

Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models

Existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric.

DUO

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

针对image而非prompt的unlearning，prompt unlearning虽然阻止了模型在碰到特定的prompt时触发生成相应的内容，但diffusion model还是有生成该内容的能力的，而image unlearning直接让diffusion model失去生成该内容的能力。
做法是针对某个prompt分别收集retain和forget样本，使用Diffusion-DPO优化diffusion model。

Meta-Unlearnings

Meta-Unlearning on Diffusion Models Preventing Relearning Unlearned Concepts

RGD

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

CAD

Unveiling Concept Attribution in Diffusion Models

SAeUron

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

$z = \text{ReLU}(W_{\text{enc}}(x - b_{\text{pre}})+b_{\text{enc}}), \hat{x} = W_{\text{dec}}z + b_{\text{pre}}$ ，一共只有四个可学习参数。
$k$ $z$ , setting the rest to zeros.

Benchmark

EraseBench

EraseBench: Understanding The Ripple Effects of Concept Erasure Techniques

Concept Erasure时会影响其它object的生成，assessment using our EraseBench framework helps identify these effects and provides a framework to evaluate the reliability and robustness of concept erasure techniques.

Image-to-Image Translation

SDEdit (no fine-tune)

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

需要source domain和target domain上训练好的diffusion model。

Inversion-by-Inversion (no fine-tune)

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

two-stage SDEdit

UNIT-DDPM (retrain)

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models

unpaired
domain translation function提取domain信息。

LaDiffGAN (retrain)

LaDiffGAN: Training GANs with Diffusion Supervision in Latent Spaces

类似Diff-Instruct，使用diffusion model训练image-to-image translation的GAN模型。

CycleNet (retrain)

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

unpaired
$x_{0}$ $\epsilon_{\theta}(y_{t}, c_{y}, x_{0})$ ，依赖域内重构和源域目标域之间互相translation的一致性进行训练。

Palette (retrain)

Palette: Image-to-Image diffusion models

paired，self-supervised learning，自动生成paired数据，如colorization，inpainting等
condition source image through concatenation

DDBM (retrain)

Denoising Diffusion Bridge Models

paired
$q_{tT}$ $t \sim T$ $q(x_{t} | x_{0})$ 一样有闭式解。
和ShiftDDPMs类似。

DBIM (retrain)

Diffusion Bridge Implicit Models

DDBM的DDIM版，DBIM相对于DDBM相当于DDIM相对于DDPM，可以使用预训练好的DDBM进行加速采样。

LSB (no fine-tune)

Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation

EBDM (retrain)

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

CDBM (retrain)

Consistency Diffusion Bridge Models

A-Bridges (retrain)

Score-Based Image-to-Image Brownian Bridge

FSBM (retrain)

Feedback Schrodinger Bridge Matching

ILVR (no fine-tune)

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

unpaired

DiffusionCLIP (fine-tune)

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

unpaired
一个单词对应一个domain对应一个fine-tuned模型。
pre-trained unoncditional DDPM + pre-trained CLIP
$S_{for}$ $t_0$ $S_{gen}$ $x_0$ 计算CLIP directional loss，对DDPM进行一次fine-tune，类似递归神经网络。
$S_{gen}$ $\hat{x}_0$ $S_{gen}$ 步。

Rectifier (fine-tune)

High-Fidelity Diffusion-based Image Editing

unpaired
DiffusionCLIP
训练网络预测卷积层的LoRA参数，这样不需要像DiffusionCLIP那样递归优化。

EGSDE (no fine-tune)

EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations

只需要target domain上训练好的diffusion model，给定source domain上的原图做SDEdit，使用两个预训练好的energy function进行指导采样。
改变domain-specific特征: 训练一个domain classifier，去除分类层变为一个编码器，计算生成的latent和原图的noisy latent的feature之间的cosine similarity，求梯度作为guidance。
保留domain-independent特征: low-pass filter，计算生成的latent和原图的noisy latent的低通滤波之间的L2距离，求梯度作为guidance。

DDIB (no fine-tune)

Dual Diffusion Implicit Bridges for Image-to-Image Translation

需要source domain和target domain上训练好的diffusion model。
Probability Flow ODE在source domain和target domain之间构成薛定谔桥。
$x_{0}$ $x_{T}$ $x^{\prime}_{0}$ $x^{\prime}_{0}$ $x_{0}$ 。
cycle的前半段即为translation。

DECDM (no fine-tune)

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

DDIB在文档模型上的应用。

CycleDiffusion (no fine-tune)

Unifying Diffusion Models' Latent Space, With Applications to Cyclediffusion and Guidance

需要source domain和target domain上训练好的diffusion model。
translation过程和DDIB一样，但使用DPMEncoder代替Probability Flow ODE。
如果使用同一个text-to-image模型，两个不同text作为condition，可以分别看成source domain和target domain上训练好的DPM，可以用这种方法既可以做image-to-image translation也可以做image editing。

先用source domain模型编码

再用target domain模型解码

注意DPM-Encoder是针对stochastic diffusion models的。

DDPM-Inversion (no fine-tune)

An Edit Friendly DDPM Noise Space

方法同DPM-Encoder（作者声称和DPM-Encoder不一样，但并没有看出有什么区别，有可能说的是之前版本的DPM-Encoder？）

LEDITS (no fine-tune)

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

DDPM-Inversion+SEGA（多个guidance的combination）

LEDITS++ (no fine-tune)

LEDITS++: Limitless Image Editing using Text-to-Image Models

用DPM-Solver做inversion，同时使用cross-attention map和DiffEdit的方法估计mask，做mask-based editing。

TurboEdit (no fine-tune)

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

SDXL-Turbo + DDPM-Inversion做加速编辑。

Pix2Pix-Zero (no fine-tune)

Zero-shot Image-to-Image Translation

$\rightarrow$ dog这样的zero-shot image-to-image translation，不同text输入StableDiffusion可以看成不同domain上训练好的diffusion model。
$c$ $\Delta c$ 。
$c$ $x_{T}$ $\epsilon_{\theta}$ $\epsilon_{\theta}(z_{t},t,c)$ ，没有使用cfg），一个loss计算不同位置之间的相关性，另一个loss计算每个位置和标准高斯分布的KL散度。
$x_{T}$ $c$ $M^{ref}_{t}$ $x_{T}$ $c+\Delta c$ $M^{edit}_{t}$ $||M^{ref}_{t}-M^{edit}_{t}||_{2}$ $x_{t}$ $x_{t}$ $c+\Delta c$ $x_{t-1}$ 。

PIC (no fine-tune)

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

和Pix2Pix-Zero一样的任务。
使用source prompt将原图DDIM Inversion后，将source prompt换为target prompt进行生成做translation效果不好，原因是在去噪早期阶段text embedding的abrupt transition。
we formulate a noise prediction strategy for the text-driven image-to-image translation by progressively updating the text prompt embedding via time-dependent interpolations of the source and target prompt embeddings.

FBSDiff (no fine-tune)

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

$\rightarrow$ girl这样的zero-shot image-to-image translation，不同text输入StableDiffusion可以看成不同domain上训练好的diffusion model。

CDM (retrain)

unpaired
训练diffusion model的同时训练两个encoder，一个编码content，一个编码style，利用inductive bias，content是一个spatial layout mask，使用时降/上采样到feature map的尺寸；style是一个向量，代表高维语义。在UNet每一层用AdaGN，style做channel-wise affine transformation，content和AdaGN输出做spatial上的乘。
采样时，先用自身编码结果DDIM Inversion到噪声，再用目标图像的content或style进行生成。

DiffuseIT (no fine-tune)

Diffusion-based Image Translation using Disentangled Style and Content Representation

SDEdit + guidace + resample technique

Few-Shot Diffusion (fine-tune)

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

unpaired
$x^{A}$ $x_{0}^{A\rightarrow B}$ $x^{A}$ $x_{0}^{A\rightarrow B}$ $x_{0}^{A\rightarrow B}$ $x^{B}$ $x_{0}^{A\rightarrow B}$ $\hat{x}_{0}$ 。
$w = \frac{1}{m}\sum_{i=1}^{m}E(x_{i}^{B}) - \frac{1}{n}\sum_{i=1}^{n}E(x_{i}^{A})$ $L_{DDC}=\| E(x^{A}) + w , E(x_{0}^{A\rightarrow B}) \|$
做translation时类似SDEdit，只用target domain的diffusion model。

Fine-grained Appearance Transfer (no fine-tune)

Fine-grained Appearance Transfer with Diffusion Models

unpaired
利用DIFT做semantic matching和feature transfer。

S2ST

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

unpaired
$z_{T}$ $z_{T}$ $z_{T}$ $z_{T}$ 开始，类似Null-Text Inversion，边优化边生成。

structure loss: 生成的图像和原图的sobel gradient之间的MSE loss

$z_{0}$ 计算MSE loss

FCDiffusion (fine-tune)

Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

$z_{0}$ $c=\text{FFM}(z_{0})$ $y$ .
translation时，先对原图进行DDIM Inversion，再使用target prompt和原图的不同frequency进行生成。

Style Transfer

Text

DiffStyler (training-free)

DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization

$x_{t}$ $\epsilon_{\theta}(x_{t},t)$ $\hat{x}_{0}$ $\hat{x}_{0}$ 使用各种loss进行优化。

ZeCon (training-free)

Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer

$x_{t}$ $\hat{x}_{0}$ $\hat{x}_{0}$ 计算CLIP loss + contrastive loss（用于content preservation），求梯度作为guidance。

Specialist-Diffusion (training-based)

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style

StableDiffusion
每个style（如Flatten Design, Fantasy, Food doodle等）收集几十对text-image数据，做数据增强，fine-tune StableDiffusion，作为这个style的specialist diffusion，输入文本就可以生成这个style的图像。

Image

StyleAdapter (training-based)

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

StableDiffusion
类似IP-Adapter：使用CLIP编码所有reference images，输入一个可训练的StyEmb网络得到style feature，给StableDiffusion插入一个可训练的cross-attention层，image token与style feature做cross-attention，其输出与text cross-attention层输出加在一起送到下一层（Tow-Path Cross-Attention），训练StyEmb和新插入的cross-attention层对reference images进行重构。
采样时根据prompt和style image进行生成。
For data augmentation, we apply the random crop, resize, horizontal flipping, rotation, etc., to generate K = 3 style references for each input image during training.

DEADiff (training-based)

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

StableDiffusion
使用frozen CLIP提取reference image的feature，Q-Former的query与feature和"content"/"style"进行cross-attention，Q-Former的输出输入StableDiffusion的text cross-attention，新训练一个KV projection maxtrix（Q还用text cross-attention的），将Q-Former的输出project之后与text的KV concat在一起进行计算，算是IP-Adapter的变种。
训练时如果使用"style"，就用style相同但content不同的image pair进行训练，"content"同理。注意推理时只使用"style"，训练时的"content"是为了让style representation的提取更加解耦。

ColorizeDiffusion (training-based)

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

自监督训练，提取图像的sketch，对图像加噪，加噪结果和sketch加在一起作为UNet的输入，将UNet的cross-attention改造为linear层，使用预训练CLIP提取图像的image embedding送入linear层，所有参数一起训练进行重构。
采样时，提取原图的sketch和reference image的CLIP image embedding输入网络进行生成，保持原图的结构，完成风格向reference image的转化。
还可以使用文本对reference image embedding进行manipulation，因为CLIP的text embedding和image embedding已经对齐了，所以可以在CLIP embedding空间，根据所给text和scale对reference image embedding进行manipulation。

ArtFusion (training-based)

ArtFusion: Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models

训练一个以content和style为条件的diffusion model，以输入数据自身的content（LDM的VAE提取）和style（vgg feature）为条件做self-reconstruction。
采样时使用不同的content image和style image。

SGDiff (training-based)

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

类似ArtFusion，使用输入数据的patch作为style。

StyleDiffusion (training-based)

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

先使用预训练Style Removal模型去除原图和reference image的style，类似DiffusionCLIP，用CLIP directional loss fine-tune模型，一个style一个模型，在CLIP image embedding空间，上面两个的差应该和下面两个的差相似。

ControlStyle (training-based)

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

ControlNet + DiffusionCLIP。

OSASIS (training-based)

One-Shot Structure-Aware Stylized Image Synthesis

$I_{B}^{\text{style}}$ $\epsilon_{\theta}$ $\epsilon_{A}$ $\epsilon_{B}$ 。
$\epsilon_{\theta}$ $I_{B}^{\text{style}}$ $I_{A}^{\text{style}}$ $\epsilon_{\theta}$ $I_{A}^{\text{in}}$ $\epsilon_{A}$ $I_{A}^{\text{style}}$ $I_{A}^{\text{in}}$ $t_{0}$ $\epsilon_{A}$ $\epsilon_{B}$ $\epsilon_{B}$ ，并使用style image的reconstruction loss作为正则。
$1 \times 1$ $I_{A}^{\text{in}}$ $x_{t}^{\text{SPN}} = \text{SPN}(I_{A}^{\text{in}})$ $x_{t}^{\prime} = x_{t} + \lambda \cdot x_{t}^{\text{SPN}}$ $x_{t}^{\prime}$ $\epsilon_{B}$ 的输入。

CSGO (training-based)

CSGO: Content-Style Composition in Text-to-Image Generation

构造数据集训练ControlNet。

VisualStylePrompt (training-free)

Visual Style Prompting with Swapping Self-Attention

生成时，将decoder某一层之后的所有self-attention的key和value替换为reference image生成时的相应位置的self-attention的key和value。

LAB (training-free)

Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

StyleID (training-free)

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

类似VisualStylePrompt，self-attention的KV注入。

Portrait-Diffuion (training-free)

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

$O_{t}$ ，就是一个简单的self-attention的KV注入：分别对content image和style image进行DDIM Inversion后再重构，将重构过程中的style image的self-attention的KV注入到content image的self-attention中。
使用style guidance CFG的原因是让target image尽量偏离content image。

ZePo (training-free)

ZePo: Zero-Shot Portrait Stylization with Faster Sampling

和PortraitDiffusion对比。

Ditail (training-free)

Diffusion Cocktail: Fused Generation from Diffusion Models

通常对于每个style会fine-tune得到一个模型，使用任意一对模型做any-to-any style transfer，将一个模型生成的图像作为content，用另一个模型对其进行style transfer。
做法类似PnP，做feature和self-attention map的注入，但不同的是，由于保存原图的feature和self-attention map太消耗存储，所以本文提出只保存原图生成过程中的latent，在style transfer时由当前模型再推理一次得到feature和self-attention map，效果和使用原模型的feature和self-attention map相差无几。

DiffStyle (training-free)

Training-free Content Injection using h-space in Diffusion Models

SC (training-free)

Training-Free Style and Content Transfer by Leveraging U-Net Skip Connections in Stable Diffusion 2

用injection image的skip connection替换original image的skip connection。
The injection timing can control whether the background is retained or replaced with that of the original image, and the injection strength can be further modulated using classifier-free guidance and depth-channel alternation.

CartoonDiff (training-free)

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

$\epsilon_{\theta}$ 做一个normalization实现卡通化，normalization can suppress the generation of fine texture details。

FreeStyle (training-free)

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

$x_{t}$ 送入UNet encoder得到的feature作为skip feature，含有大量高频信息（style），给其FFT乘一个系数再iFFT。

ASI (training-free)

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

使用P2P，直接将style prompt拼在original prompt（即content prompt），之后进行text-guided style transfer，但这种方式会破坏原图信息，如头发等。
$F_{c}$ $F_{s}$ $F_{c}$ $F_{s}$ $1$ $0$ $F_{c}$ $0$ $1$ $\text{OR}$ $1$ $0$ $\sigma(F_{s})(\frac{F_{c} - \mu(F_{c})}{\sigma(F_{c})} + \mu(F_{s})) \odot M + F_{c} \odot (1-M)$ 。

MagicStyle (training-free)

MagicStyle: Portrait Stylization Based on Reference Image

AdaIN技术。

STRDP (training-free)

Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer

AdaIN技术。

TI

InST (training-based)

Inversion-Based Creativity Transfer with Diffusion Models

StableDiffusion
用CLIP编码reference image，训练一个网络，根据image embedding预测一个text token embedding（非CLIP编码），输入到预训练好的StableDiffusion（先过CLIP），用TI方法训练这个网络。

StyleBooth (training-based)

StyleBooth: Image Style Editing with Multimodal Instruction

$W$ 同时fine-tune InstructPix2Pix。

DomainGallery

DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Given a few-shot target dataset of a specific domain such as sketches painted by an artist, we expect to generate images that fall into the domain.

Pair-Customization (training-based)

Customizing Text-to-Image Models with a Single Image Pair

ArtBank (training-based)

ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank

ISPB：每个style对应一个learnable parameter matrix，由该style专属的SSAM转化为一个token embedding，使用该style的一些images进行TI训练，只优化ISPB。
Stochastic Inversion：Random noise is hard to predict, and incorrectly predicted noise can cause a content mismatch between the stylized image and the content image. To this end, we first add random noise to the content image and use the denoising U-Net in the diffusion model to predict the noise in the image. The predicted noise is used as the initial input noise during inference to preserve content structure.

LSAST (training-based)

Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt

$1000$ $10$ $3$ 个部分，每个阶段每个部分使用一个单独的token embedding，使用一些style images进行TI训练。
生成时，除了DDIM Inversion，还使用一个预训练的edge的ControlNet保持content image的结构。

Others

Stylebreeder

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

art generation dataset，包含prompt，negative prompt和生成的图像。

HiCAST

HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

RB-Modulation

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

AttnMod

AttnMod: Attention-Based New Art Styles

modify attention for creating new unpromptable art styles out of existing diffusion models

Inverse Problem

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

$y=Hx + \epsilon$ $\nabla_{x_{t}} \| W(y-H \hat{x}_{0}) \|$
DPS前身。

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

Diffusion Posterior Proximal Sampling for Image Restoration

Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data

Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction

Consistency Models Improve Diffusion Inverse Solvers

Deep Data Consistency: a Fast and Robust Diffusion Model-based Solver for Inverse Problems

Learning Diffusion Priors from Observations by Expectation Maximization

Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems

Prototype Clustered Diffusion Models for Versatile Inverse Problems

Reducing the cost of Posterior Sampling in Linear Inverse Problems via task-dependent Score Learning

Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

Online Posterior Sampling with a Diffusion Prior

Variational Diffusion Posterior Sampling with Midpoint Guidance

Free Hunch: Denoiser Covariance Estimation for Diffusion Models Without Extra Costs

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

Rethinking Diffusion Posterior Sampling From Conditional Score Estimator to Maximizing a Posterior

$y=Hx + \epsilon$ $\nabla_{x_{t}} \| y-H \hat{x}_{0} \|$

DEFT

DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised h-transform

类似PDAE，使用inverse problem的pair data训练一个gradient estimator进行指导采样。

PGDM

Pseudoinverse-Guided Diffusion Models for Inverse Problems

MAP

Inverse Problems with Diffusion Models: A MAP Estimation Perspective

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

CoSIGN

CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

类似CCM，为CM训练ControlNet，只用很少的步数求解Inverse Problem。

STSL

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

$\hat{x}_{0}$ ，本论文使用second order Tweedie's formula。

DMPlug

DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models

$\hat{x}_{0}$ $R$ $x_{0}=R(x_{T})$ $x_{T}$ 。

CI2RM

Fast Samplers for Inverse Problems in Iterative Refinement Models

Conditional Conjugate Integrators

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

FDEM

Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution

$y=Hx + \epsilon$ $y$ $\epsilon$ $x$ $H$ $H$ ，所以可以使用EM算法求解。
或许也可以用Variational Inference

LatentDEM

Blind Inversion using Latent Diffusion Priors

利用EM算法。

EMDiffusion

An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

利用EM算法。

A&D

Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models

$y=Hx+n$ $x$ $z$ $y$ $z^{\prime}$ $z^{\prime}$ $z$ $z^{\prime}$ $x$ 。
$y$ $z$ $x$ $\nabla_{z_{t}} \|H(D(\hat{z}_{0}(z_{t}))) - y\|$ $D$ 为预训练autoencoder的decoder。

BCDM

Bayesian Conditioned Diffusion Models for Inverse Problems

APS

Amortized Posterior Sampling with Diffusion Prior Distillation

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

amortized variational inference

Restoration

DDRM

Denoising Diffusion Restoration Models

DDNM

Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model

DDPG

Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance

IR-SDE

Image Restoration with Mean-Reverting Stochastic Differential Equations

ShiftDDPMs中的PriorShift的SDE。

DeqIR

Deep Equilibrium Diffusion Restoration with Parallel Sampling

DEQ-based

RCM

Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

We propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs.

BlindDPS

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

FAG-Diff

Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

LADiBI

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

GDP

Generative Diffusion Prior for Unified Image Restoration and Enhancement

DiffusionVI

Diffusion Priors for Variational Likelihood Estimation and Image Denoising

BIRD

Blind Image Restoration via Fast Diffusion Inversion

类似DMPlug，we aim to find the initial noise sample that can generate the image when applied to DDIM.
$\eta$ $H$ $\mathcal{L}_{\text{IR}}=\| y - H_{\eta}(x_{0}) \|^{2}$ $\eta$ $x_{T}$ 一起优化。

FlowIE

FlowIE: Efficient Image Enhancement via Rectified Flow

直接使用Flow建模两个分布间的path，可以适用于多种任务，如inpainting，colorization和super resolution。

AutoDIR

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

训练一个能handle different degradations的image restoration网络。
训练一个网络识别输入图像属于哪种预定义的degradation（如blur），填入某个template形成prompt（如"a photo needs {blur} artifact reduction"）。
$z_{t}$ 上，prompt作为条件，训练复原原数据。
使用时，将原图输入2，得到prompt，再一起输入LDM进行restoration。

UIR-LoRA

UIR-LoRA: Achieving Universal Image Restoration through Multiple Low-Rank Adaptation

DiffBIR

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

PromptIR

PromptIR: Prompting for All-in-One Blind Image Restoration

ELA

Proxies for Distortion and Consistency with Applications for Real-World Image Restoration

ZeroAIR

Exploiting Diffusion Priors for All-in-One Image Restoration

Diff-Plugin

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

类似AutoDIR

TIP

TIP: Text-Driven Image Processing with Semantic and Restoration Instructions

ControlNet，ControlNet输入degration指令，StableDiffusion输入prompt，自监督训练。

Decorruptor

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

create pairs of (clean, corrupted) images and utilize them for fine-tuning to enable the recovery of corrupted images to their clean states.

PromptFix

PromptFix: You Prompt and We Fix the Photo

We compile approximately two million raw data points across eight tasks: image inpainting, object creation, image dehazing, image colorization, super-resolution, low-light enhancement, snow removal, and watermark removal. For each low-level task, we utilized GPT-4 to generate diverse training instruction prompts Pinstruction. These prompts include task-specific and general instructions. The task-specific prompts, exceeding 250 entries, clearly define the task objectives. For example, "Improve the visibility of the image by reducing haze" for dehazing.
For watermark removal, super-resolution, image dehazing, snow removal, low-light enhancement, and iimage colorization tasks, we also generate "auxiliary prompts" for each instance. These auxiliary prompts describe the quality issues for the input image and provide semantic captions.

SUPIR

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

借助MLLM生成prompt，ControlNet引入LQ，送入SDXL生成HQ。

Diff-Restorer

Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

类似SUPIR。

InstantIR

InstantIR: Blind Image Restoration with Instant Generative Reference

ReFIR

ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

DP-IR

A Modular Conditional Diffusion Framework for Image Reconstruction

BIR-D

Taming Generative Diffusion Prior for Universal Blind Image Restoration

guidance

Face/Human

PGDiff

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

partial guidance，类似GradPaint，整理了很多任务使用统一框架。

PFStorer

PFStorer: Personalized Face Restoration and Super-Resolution

有reference image的restoration，LQ和StableSR引入方式一样，reference image以类似ControlNet的方式引入。

RestorerID

RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

CLR-Face

CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models

DiffBody

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

ControlNet

DTBFR

Towards Unsupervised Blind Face Restoration using Diffusion Prior

AuthFace

AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

DR-BFR

DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration

OSDFace

OSDFace: One-Step Diffusion Model for Face Restoration

Super Resolution

将LR图像上采样到HR的resolution，该问题就可以转化为LQ到HQ的restoration问题。

SRDiff

SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models

diffusion model以LR为condition建模HR与upsample(LR)之间的residual。

SR3

Image Super-Resolution via Iterative Refinement

$x_{t}$ 上进行训练，和GLIDE的inpainting model类似。

StableSR

Exploiting Diffusion Prior for Real-World Image Super-Resolution

LR上采样到HR的resolution，经过VAE encoder编码后，输入一个可训练的time-aware encoder，得到multi-scale feature，再训练一个小卷积网络（SFT），根据feature预测scale and shift去affine StableDiffusion对应的feature，只训练encoder和SFT。
color correction：预测结果每个通道减去自己的均值再除以自己的标准差，之后乘以LR在该通道的标准差再加上均值。
$F_{e}$ $F_{d}$ ，MSE训练，只训练CFW。

ResShift

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting

super resolution，扩散过程以HR为起点LR为终点，不断增加LR和HR的residual，类似ShiftDDPMs推导后验公式，建模逆向过程。

SinSR

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

将ResShift扩展到DDIM的确定性采样，之后蒸馏为一步。

DoSSR

Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs

TASR

TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

PatchScaler

PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution

$\mathcal{L}_{\text{GRM}} = E_{y_{\text{LR}}} [ \| y_{\text{HR}} - x_{\text{HR}} \|_{1}^{2} + \lambda (C \| y_{\text{HR}} - x_{\text{HR}} \|_{2}^{2} - \eta \log C) ]$ $x_{\text{HR}}$ 是ground truth HR feature。
$x_{\text{HR}}$ 训练一个DiT。
$y_{\text{HR}}$ $t$ $y_{t}$ ，使用DiT去噪，类似SDEdit。

Treg

Regularization by Texts for Latent Diffusion Inverse Solvers

使用text引导的超分和去模糊。

PromptSR

Image Super-Resolution with Text Prompt Diffusion

$x_{t}$ 上重新训练一个带cross-attention的diffusion model，使用预训练text encoder对prompt进行编码送入cross-attention，prompt都是一些指令，如deblur，resize等。

Text-guided Explorable Image Super-resolution

CoSeR

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

类似PromptSR，根据LR生成一些粗略的HR的reference image和prompt，两者作为条件训练diffusion model进行超分生成。

FaithDiff

FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

CasSR

CasSR: Activating Image Power for Real-World Image Super-Resolution

根据LR生成一些粗略的HR的reference image，和LR一起作为条件训练diffusion model进行超分生成。

SeeSR

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

PASD

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

XPSR

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

类似SUPIR借助MLLM生成prompt。

SAM-DiffSR

SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

SAM辅助。

SegSR

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

SkipDiff

SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution

action 0 is to perform the reverse diffusion process with the current state, while action 1 is to skip the diffusion process.

ECDP

Effcient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution

使用两个loss重新训练一个分数模型。
$\mathcal{L}_{\text{score}}$ 即为diffusion loss。
$\mathcal{L}_{\text{quality}}$ ，ODE不需要存中间结果也可以反向传播（Neural Ordinary Differential Equations）。

FDDif

Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution

BlindDiff

BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution

most methods are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adapt ability to real-world applications that involve complex unknown degradations.
引入对degradation level的估计。

DA

Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution

CDFormer

CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution

Blind Image Super-Resolution

DiffFNO

DiffFNO: Diffusion Fourier Neural Operator

BFSR

RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

Acceleration

OSEDiff

One-Step Effective Diffusion Network for Real-World Image Super-Resolution

类似Diff-Instruct的思想。

InvSR

Arbitrary-steps Image Super-resolution via Diffusion Inversion

AdaDiffSR

AdaDiffSR: Adaptive Region-aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution

S3Diff

Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

TDDSR

TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution

AdcSR

Adversarial Diffusion Compression for Real-World Image Super-Resolution

HF-Diff

HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution

TSD-SR

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

OFTSR

OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

步数蒸馏加速。

FluxSR

One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation

Inpainting

Blended Diffusion

Blended Diffusion for Text-driven Editing of Natural Images

Blended Latent Diffusion

training-free，text-free + text-guided
pre-trained unoncditional diffusion model + pre-trained CLIP as guidance。
$q(x_{t}| x_{0})$ 取代。
extending augmentations

LatentPaint

LatentPaint: Image Inpainting in Latent Space with Diffusion Models

training-free，text-free
对latent representation（比如h-space）做blended

RePaint

RePaint: Inpainting using Denoising Diffusion Probabilistic Models

training-free，text-free
resample technique

TD-Paint

TD-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning

training-free，text-free
节省了RePaint的resample过程，加速。

CoPaint

Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models

training-free，text-free
$\hat{x}_{0}$ 和原图的unmasked region之间的MSE。
同样使用resample technique。

GradPaint

GradPaint: Gradient-Guided Inpainting with Diffusion Models

training-free，text-free
CoPaint的梯度guidance版。计算每步生成结果和原图的unmasked region之间的MSE，求梯度作为guidance，类似Posterior Sampling。

LanPaint

Lanpaint: Training-Free Diffusion Inpainting with Exact and Fast Conditional Inference

training-free，text-free
还是基于blended的方法，但之前的方法都是单向的根据unmasked region补全masked region，LanPaint提出使用双向更新。
$y$ $x$ $x$ $s_{\theta}(x_{t}, y_{t}, t)$ $y$ $s_{\theta}(x_{t}, y_{t}, t)$ $\frac{\sqrt{\bar{\alpha}_{t}}y_{0} - y_{t}}{1-\bar{\alpha}_{t}}$ $y_{0}$ 进行指导采样，保持unmasked region不变。

Tiramisu

Image Inpainting via Tractable Steering of Diffusion Models

Tractable Probabilistic Models

RAD

RAD: Region-Aware Diffusion Models for Image Inpainting

将diffusion forward process分为两个阶段，第一阶段只有mask区域的pixel被加噪，第二阶段是非mask区域的pixel被加噪。reverse process对应的第一阶段就是inpainting。
After training, only Phase 1 is utilized in inpainting because this suffices to generate the mask region. In fact, we may only utilize Phase 1 during training as well, but using both phases was better in our experience.

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

training-based，text-guided
参考Text-Guided Inpainting Model。

Imagenator

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

training-based，text-guided
Imagen版本的GLIDE的Text-Guided Inpainting Model，直接降采样并concat会导致mask边缘不和谐，所以训练一个encoder进行降采样。

StableInpainting

High-Resolution Image Synthesis with Latent Diffusion Models

training-based，text-guided
$z_{t}$ 一样的尺寸。

CAT-Diffusion

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

training-based，text-guided
StableInpainting有两个缺点：Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model.
pre-inpainting

SmartBrush

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

training-based，text-guided
$x_{t}$ 中只有前景被加噪，背景仍然是原图。
self supervised learning using panoptic segmentation dataset
mask augmentation + background preservation with mask prediction
编辑时还可以通过mask指定shape。

PowerPaint

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

training-based，text-guided
和StableInpainting一样的训练方法，额外在文本中引入可训练的prompt，作为该任务的prompt。

ControlNet-Inpainting

Adding Conditional Control to Text-to-Image Diffusion Models

training-based，text-guided
$z_{t}$ + masked image + mask作为condition输入ControlNet。

BrushNet

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

training-based，text-guided
构造BrushData数据集，基于segmentation，不是随机的mask。
ControlNet改进版，新加的网络去掉了cross-attention layer，只处理图像。

PainterNet

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

training-based, text-guided
构造PainterData数据集，数据集中的prompt只描述mask区域，与BrushData的描述整张图的不同。

Brush2Prompt

Brush2Prompt: Contextual Prompt Generator for Object Inpainting

根据unmask部分的内容及mask的形状，自动生成用于inpainting的prompt，之后使用text-guided inpainting model进行inpainting。

LoMOE

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

training-free，text-guided。
$x_{T}$ 。
因为是多区域编辑，所以使用基于mask的MultiDiffusion，每个区域使用自己的edit prompt单独去噪一次，然后根据mask加在一起。
$y_{t}$ ，cross-attention map之间的MSE loss用于保持编辑物体的位置和结构，background pixel MSE loss保持背景。

HD-Painter

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

基于StableInpainting的training-free方法，text-guided。
使用上述训练好的StableInpainting模型，将所有self-attention layer替换为Prompt-Aware Introverted Attention layer（PAIntA layer），其也是self-attention的计算方式，但对每个masked pixel对应的self-attention map做修改，给masked pixel对应的self-attention map中的unmasked pixel的响应值乘一个系数，该系数等于该unmasked pixel与所有word的cross-attention map的响应值的和，目的是让masked pixel更加注重那些与text有关的unmasked pixel。由于StableInpainting中所有self-attention layer（即PAIntA layer）都在cross-attention layer前，所以计算时借用下一个cross-attention layer的参数。
Reweighting Attention Score Guidance：计算每个word对应的cross-attention map，根据mask计算交叉熵， maximize the cross-attention scores in the masked region and minimize the cross-attention scores in the unmasked region，所有word的计算结果求和，求梯度作为guidance。一般的guidance会使采样结果偏离，导致采样质量下降，这里将guidance除以其标准差，将随机版本的DDIM采样公式中的噪声替换为guidance，因为随机版本的DDIM采样公式是可以保证采样结果不偏离的，因为其噪声是标准正态分布，所以这里将guidance除以其标准差以匹配单位方差，但保持了其均值以实现guidance。
训练一个超分LDM，对上述inpainting结果进行超分。

MagicRemover

MagicRemover: Tuning-free Text-guided Image inpainting with Diffusion Models

training-free，text-guided，是专门做object removal的，text为想要remove的object。
$z_{t}$ $g(t, k ,\lambda) = \| CAM_{t,k} - [min(CAM_{t,k}) + \lambda(max(CAM_{t,k})-min(CAM_{t,k}))]CAM_{t,k} \|_{1}$ $CAM_{t,k} \odot \nabla_{z_{t}} g(t, k ,\lambda)$ 作为guidance，使用h-space的非对称采样方法。
将reconstructive generative trajectory的self-attention的KV注入inpainting generative trajectory，但可以使用MasaCtrl的思路，使用reconstructive generative trajectory的cross-attention估计一个object的mask，让inpainting generative trajectory的self-attention中的object区域只参考reconstructive generative trajectory的KV中mask之外的区域。

AttentiveEraser

Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance

training-free，需要提供要remove的object的mask。
AAS: 提高mask区域与background区域的self-attention map的weight，让remove的地方更符合background；降低mask区域与mask区域的self-attention map的weight，因为现在要inpaint mask区域，mask区域没有可参考的东西；降低background区域与mask区域的self-attention map的weight，防止background区域被影响，因为mask区域没有可参考的东西。
$\epsilon_{\theta}(z_{t}) + s \cdot (\text{ASS}(\epsilon_{\theta}(z_{t})) - \epsilon_{\theta}(z_{t}))$ 。

MagicEraser

MagicEraser: Erasing Any Objects via Semantics-Aware Control

TI学习一个形容词，加在背景相关词的前面，如这里的$\text{A photo of } S_{\star} \text{ sky}.，同时LoRA fine-tune diffusion model，在构造的数据集上训练。
提高与mask region相关region的self-attention map的weight，降低不相关region的self-attention map的weight。
$S_{\star}$ 的prompt进行erase。

SmartEraser

SmartEraser: Remove Anything from Images using Masked-Region Guidance

造数据：这里的input image中mask位置的人是paste上去的。
之前的方法都是将masked image作为输入进行inpaint，these approaches lack contextual information for the masked area, often resulting in unstable performance，容易导致模型根据mask形状进行object regeneration，而SmartEraser将完整图像作为输入，it guides the model to accurately identify the object to be removed, preventing its regeneration in the output.

RORem

RORem: Training a Robust Object Remover with Human-in-the-Loop

PILOT

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

training-free，text-guided
$\mathcal{L}_{\text{bg}}=\| (1-m)\odot \tilde{z}_{0,t}^{u} - (1-m)\odot z^{in} \|_{2}^{2}$ $\mathcal{L}_{\text{S}}=\frac{\| (1-m)\odot \tilde{z}_{0,t}^{u} - (1-m)\odot \tilde{z}_{0,t}^{c} \|_{2}^{2}}{\| \tilde{z}_{0,t}^{u} - \tilde{z}_{0,t}^{c}\|_{2}^{2}}$ $\tilde{z}_{0,t}^{u}$ $\tilde{z}_{0,t}^{c}$ 尽可能有区别但mask区域外尽可能保持一致，鼓励mask区域内与text对齐。
$\gamma T$ $\tau$ 步才做一次优化。

Uni-paint

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

training-based，text-free + text-guided
We found blended is insufficient since the known information is inserted externally rather than generated by the model itself, the model lacks full context awareness, potentially causing incoherent semantic transitions near hole boundary. 只需要masked finetune一下模型，继续使用blended生成。
进一步才用masked attention，对于cross-attention，只让text和masked区域内的pixel做attention，对于self-attention，只让masked区域内的pixel之间互相做attention。

MaGIC

Multi-modality Guided Image Completion

training-based，text-based，StableInpainting Model。
每个模态训练一个encoder提取模态的feature，feature是multi-scale的，每个scale注入到UNet的encoder对应scale的feature上。对于structure-form（如segmentation，edge等），直接相加；对于context-form（如text，style等），将feature进行pool后注入cross-attention作为context vector。训练时freeze StableDiffusion Inpainting Model，只训练模态encoder，且每个模态单独训练。有点类似ControlNet。
采样时可以多个模态encoder一起用，不过不能再用上面的注入方式了（因为feature不具备可加性），而是使用StableDiffusion Inpainting Model的UNet的multi-scale feature和引入单个模态encoder后得到的multi-scale feature计算MSE loss，求梯度作为guidance，因为梯度具有可加性，这样可以实现多模态的guidance而不用重新训练多模态。

InpaintAnything

Inpaint Anything: Segment Anything Meets Image Inpainting

SAM + 任意inpainting model

StrDiffusion

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

training-based
$\mu$ 。
sparse structure：例如the grayscale map和edge map。

ByteEdit

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

StableInpainting with feedback learning
$20$ $15$ $[1, 10]$ $t^{\prime}$ $15$ $t^{\prime}$ $t^{\prime}$ $0$ $t^{\prime}$ $0$ 的生成链。

SketchInpainting

Sketch-guided Image Inpainting with Partial Discrete Diffusion Process

只对mask部分的token进行discrete diffusion，构造数据集进行自监督训练。

LazyDiffusion

Lazy Diffusion Transformer for Interactive Image Editing

$\alpha$ 的patch方法对masked image进行patchify，输入一个transfomer context encoder，之后只保留mask覆盖区域的token作为global context。
$\alpha$ 初始化，对原图加噪并patchify，只将mask覆盖区域的token输入模型，global context token concat在输入上，prompt输入cross-attention，端到端训练所有参数进行去噪。
类似SmartBrush构造数据集进行自监督训练。

AsyncDSB

AsyncDSB: Schedule-Asynchronous Diffusion Schrödinger Bridge for Image Inpainting

Outpainting

Ten

Generative Powers of Ten

zoom stack

PQDiff

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

随机crop两个view，一个archor view一个target view，resize到相同shape，用左上角坐标计算RPE，以archor view为条件训练生成target view。

PBG

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

We use Stable Inpainting as a base model and add the ControlNet model on top to adapt it to the salient object outpainting task.

Representation Learning

Diff-AE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

SODA

SODA: Bottleneck Diffusion Models for Representation Learning

$2m+1$ $z$ $m$ $\{z_{i}\}_{i=1}^{m}$ $z_{i}$ $\{z_{i}\}_{i=1}^{m}$ $z$ 之间的解耦性。

PDAE

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

DBAE

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

$x_{0}$ $z$ $z$ $x_{T}$ $x_{0}$ $x_{T}$ 之间的bridge。
$z$ $x_{T}$ $x_{T}$ $z$ $z$ $x_{T}$ 的推理速度也更快。

HDAE

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

DiffuseGAE

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

在Diff-AE的隐空间上学习解耦表征。

DisDiff

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models

EncDiff

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

CL-Dis

Closed-Loop Unsupervised Representation Disentanglement with beta-VAE Distillation and Diffusion Probabilistic Feedback

FDAE

Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning

DiTi

Exploring Diffusion Time-steps for Unsupervised Representation Learning

CausalDiffAE

Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models

DDAESSL

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

DiffMAE

Diffusion Models as Masked Autoencoders

UMD

Unified Auto-Encoding with Masked Diffusion

mask区域使用MAE loss，noisy部分使用diffusion loss，一起训练。
$t=0$ 时间步，使用high ratio mask，不加噪，只使用MAE loss。

IGPT

Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

探索mask和noise如何同时使用。
Corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary.

MDM

Masked Diffusion as Self-supervised Representation Learner

动态mask ratio版的MAE。

StableRep

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

根据文本生成图像作为数据。

Can Generative Models Improve Self-Supervised Representation Learning?

根据原图生成图像作为数据，instance-guided generation作为一种augmentation进行SSL。

Unlike StableRep, we do not replace a real dataset with a synthetic one. Instead, we leverage conditional generative models to enrich augmentations for self-supervised learning. In addition, our method does not require text prompts and directly uses images as input to the generative model.

PersonalizedRep

Personalized Representation from Personalized Generation

CLSP

Contrastive Learning with Synthetic Positives

GenPoCCL

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

GenView

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

SynCLR-SynCLIP

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

DreamDA

DreamDA: Generative Data Augmentation with Diffusion Models

$\hat{x}_{0}$ ，原来的feature用于预测direction，DDIM采样，可以生成原图的variations。

DALDA

DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling

l-DAE

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

ADDP

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

InfoDiffusion

InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models

RepFusion

Diffusion Model as Representation Learner

Distill the intermediate representation from a pre-trained diffusion model to a recognition student.

After the distillation phase, the student is reapplied as a feature extractor and fine-tuned with the task label.

Reinforced Time Selection for Distillation.

De-Diffusion

De-Diffusion Makes Text a Strong Cross-Modal Interface

text as representation, encoder is a captioning model, decoder is a text2img model

gumbel softmax

DiffSSL

Do text-free diffusion models learn discriminative visual representations?

利用UNet intermediate feature maps做判别。

DIVA

Diffusion Feedback Helps CLIP See Better

DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text).
CLIP image embedding和text embedding在同一空间，所以可以作为条件输入StableDiffusion。

Free-ATM

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Other Tasks

Classification

RDC

Robust Classification via a Single Diffusion Model

$min_{y}\frac{1}{T}\sum_{t=1}^{T} E[ \| \epsilon_{\theta}(x_{t}, t, y) - \epsilon \|^{2}_{2} ]$

NDC

Diffusion Models are Certifiably Robust Classifiers

TiF

Few-shot Learner Parameterization by Diffusion Time-steps

在few-shot dataset上LoRA fine-tune StableDiffusion，prompt用"a photo of [C]"，使用类似上述RDC的公式推理，但给公式引入了一个时间步weight，并指明这个weight很重要。

CiP

Image Captions are Natural Prompts for Text-to-Image Models

对于只有类别标注的图像数据集，如ImageNet，利用预训练caption模型，对某个图像生成一个caption，拼在"a photo of class"之后组成一个prompt，再利用预训练StableDiffusion生成这个prompt的图像，用这个生成的图像替代原图，组成数据集。最终合成的数据集与原数据集大小相同。用合成的数据集训练分类器，效果更好。

CDN

Classification-Denoising Networks

HDC

Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning

GDC

A Simple and Efficient Baseline for Zero-Shot Generative Classification

Retrieval

DiffusionRet

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

TIGeR

Unified Text-to-Image Generation and Retrieval

Object Detection

DiffusionDet

DiffusionDet: Diffusion Model for Object Detection

CamoDiffusion

CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models

FocusDiffuser

FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection

DiffRef3D

DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object

DiffuBox

DiffuBox: Refining 3D Object Detection with Point Diffusion

MonoDiff

Monocular: 3D Object Detection and Pose Estimation with Diffusion Models

SDDGR

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

CLIFF

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

Edge Detection

DiffusionEdge

DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

Depth

Orchid

Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

D4RD

Digging into contrastive learning for robust depth estimation with diffusion models

BetterDepth

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

PriorDiffusion

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Lotus

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

FiffDepth

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

SharpDepth

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

SEDiff

SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models

DepathAnyVideo

Depth Any Video with Scalable Synthetic Data

Optical Flow

FlowDiffuser

FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models

Segmentation

DFormer

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

OVDiff

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

GCDP

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

image和segmentation在channel维拼成一条数据训练text-guided diffusion model，使用Gaussian-Categorical distribution新公式，可以根据text同时生成image和segmentation的互相生成。

SemFlow

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

We train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks.

UniGS

UniGS: Unified Representation for Image Generation and Segmentation

DiffDASS

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

Domain Adaptive Semantic Segmentation，利用image translation做分割模型的迁移。

需要source domain的图像和分割图，target domain的图像。

使用source domain的图像和分割图训练一个分割模型，使用target domain的图像训练一个扩散模型，对source domain的图像做SDEdit，分割模型和分割图计算loss做梯度修正，生成该分割图在target domain对应的图像，然后使用它们fine-tune source domain的分割模型，就可以得到target domain的分割模型。

DGInStyle

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

LDMSeg

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

pix2gestalt

pix2gestalt: Amodal Segmentation by Synthesizing Wholes

Correspondence

DiffMatch

Diffusion Model for Dense Matching

Caption

CLIP-Diffusion-LM

Apply Diffusion Model on Image Captioning

DiffCap

DiffCap: Exploring Continuous Diffusion on Image Captioning

Text-only Image Captioning

Text-Only Image Captioning with Multi-Context Data Generation

Prefix-Diffusion

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

LaDiC

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Visual Grounding

PVD

Parallel Vertex Diffusion for Unified Visual Grounding

DiffusionVG

Language-Guided Diffusion Model for Visual Grounding

DiffusionVG

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Visual Prediction

DDP

DDP: Diffusion Model for Dense Visual Prediction

Action Anticipation

DIFFANT

DIFFANT: Diffusion Models for Action Anticipation

Temporal Action Detection

DiffTAD

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

EffiDiffAct

Faster Diffusion Action Segmentation

ActFusion

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Object Tracking

DiffusionTrack

DiffusionTrack: Diffusion Model For Multi-Object Tracking

DINTR

DINTR: Tracking via Diffusion-based Interpolation

DiffusionTrack

DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking

DeTrack

DeTrack: In-model Latent Denoising Learning for Visual Object Tracking

Video Moment Retrieval

MomentDiff

MomentDiff: Generative Video Moment Retrieval from Random to Real

Video Question Answering

DiffAns

Conditional Diffusion Model for Open-ended Video Question Answering

Sound Event Detection

DiffSED

DiffSED: Sound Event Detection with Denoising Diffusion

Knowledge Distillation

DM-KD

Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

用diffusion model生成的数据作为训练集，输入到pre-trained teacher网络，蒸馏到student网络。这样不需要限制在真实数据集上，效果好，用生成的low fidelity的图像（减少采样步数等方法）效果更好。

DiffKD

Knowledge Diffusion for Distillation

用teacher网络提取到的feature作为数据训练一个diffusion model，将student网络提取到的feature作为teacher的feature的noisy version进行去噪，去噪后的feature和teacher的feature计算KL loss，优化student网络。

Data Attribution

data attribution: for a generated image, which training data contribute to it much?

Evaluating Data Attribution for Text-to-Image Models

Intriguing Properties of Data Attribution on Diffusion Models

Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond

Efficient Shapley Values for Attributing Global Properties of Diffusion Models to Data Group

SemGIR: Semantic-Guided Image Regeneration based method for AI-generated Image Detection and Attribution

MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models

Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Dataset Distillation

LD3M

Latent Dataset Distillation with Diffusion Models

Dataset distillation aims to generate a small set of representative synthetic samples from the original training set.

D4M

D4M: Dataset Distillation via Disentangled Diffusion Model

Image Quality Assessment

PFD-IQA

Feature Denoising Diffusion Model for Blind Image Quality Assessment

eDifFIQA

eDifFIQA: Towards Efficient Face Image Quality Assessment Based On Denoising Diffusion Probabilistic Models

DP-IQA

DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

NR-IQA

Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

$\rightarrow$ Understanding

利用预训练生成模型网络辅助理解模型，或者使用生成数据提升理解模型。

hybrid

DatasetDM

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

CleanDIFT

CleanDIFT: Diffusion Features without Noise

AnySynth

AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

DMaaPx

Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models

DMP

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

Syn-Rep-Learn

Scaling Laws of Synthetic Images for Model Training

Vermouth

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

$F_{exp}$ ，ResNet本身就能生成多分辨率的feature，和UNet生成的feature可以concat在一起。
U-head有两个flow，down-sample flow生成global feature，用于分类等任务，up-sample flow生成spatial feature，用于分割等任务。

SDP

Scaling Properties of Diffusion Models for Perceptual Tasks

Diff-2-in-1

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

GDF

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

GATE

Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

GenPercept

Diffusion Models Trained with Large Data Are Transferable Visual Models

We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data.
使用预训练好的diffusion model，输入原图，时间步为1，fine-tune diffusion model预测target，如depth等。

AddSD

Add-SD: Rational Generation without Manual Reference

利用diffusion model编辑图像添加物体，解决下游的分类、分割、检测等任务的类别长尾分布问题。

Data Mining

Diffusion Models as Data Mining Tools

Classification

Diffusion Classification

Diffusion Models Beat GANs on Image Classification

UNet feature + classification head

FGDS

Feedback-Guided Data Synthesis for Imbalanced Classification

Analyzing and Explaining Image Classifiers via Diffusion Guidance

Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Active Generation for Image Classification

Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

Clustering

DiFiC

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Image Retrieval

Zero-Shot Sketch-based Image Retrieval

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

Object Detection

DiffusionEngine

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

利用StableDiffusion制造detection数据，与Attention as Annotation类似。
先用已有的detection数据只加一步噪声输入StableDiffusion，训练一个可以根据UNet feature map pyramid生成bounding box的Detection Adaptor。之后固定Detection Adaptor，构造一些简单通用的prompt，给已有的detection数据图片加噪声再生成（类似SDEdit），将最后一步的feature map pyramid输入Detection Adaptor，输出作为生成图片的bounding box标注。

T2I-for-Detection

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

利用StableDiffusion制造detection数据，方法是分别生成前景和背景，再拼接粘合。

Data Augmentation for Object Detection via Controllable Diffusion Models

Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data

3DiffTection

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

DetDiffusion

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

DDT

Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector

NADA

No Annotations for Object Detection in Art through Stable Diffusion

Sketch

DiffSketch

Representative Feature Extraction During Diffusion Process for Sketch Extraction with One Example

Depth and Saliency

Diffusion Scene Representation

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

用预训练网络标注StableDiffusion生成图像的depth and salient数据集。
Extract the intermediate output of some self-attention layer at some sampling step. Interpolate lower resolution predictions to the size of synthesized images. A linear classifier is trained on it to predict the pixel-level logits.
之后就可以使用StableDiffusion和linear classifier进行预测。

JointNet

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps).

ECoDepth

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

PrimeDepth

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Segmentation

DDPMSeg

Label-Efficient Semantic Segmentation With Diffusion Models

对数据进行加噪，输入到预训练好的DDPM的UNet中，用decoder各层输出的feature map上采样到图片尺寸后concat起来，每个pixel对应一个vector，将其输入到MLP中进行标签预测，进行训练。
$B={5,6,7,8,12}$ $t={50,150,250}$ 的加噪数据，全部concat在一起，并训练多个独立的MLP，预测时采取投票制决定分类。

EmerDiff

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generatin fine-grained segmentation maps without any additional training.

MaskDiffusion

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

OVAM

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

选多个时间步的不同层的attribution prompt的cross-attention map估计segmentation。

attribution prompt不一定是最好的描述，借用TI的思想，可以用一些数据做token optimization，即优化attribution prompt的token embedding，效果很更好。

DiG

Diffusion-Guided Weakly Supervised Semantic Segmentation

FreeSeg-Diff

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

training-free

DatasetDiffusion

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

$A_{s}^{\tau} \cdot A_{c}$ $\tau$ $A_{s}^{\tau} \cdot A_{c}$ 中该像素就有较高的结果。

MADM

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

DIFF

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

VPD

Unleashing Text-to-Image Diffusion Models for Visual Perception

cross-attention map

EVP

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

enhanced VPD

Meta-Prompt

Harnessing Diffusion Models for Visual Perception with Meta Prompts

ODISE

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

DreamMask

DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data

Layout2Image + SAM

SDS

A Training-free Synthetic Data Selection Method for Semantic Segmentation

DiffSeg

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation using Stable Diffusion

DiffSeg utilizes a pre-trained StableDiffusion model and specifically its self-attention layers to produce high quality segmentation masks.

DiffSegmenter

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

设计prompt，和图像一起送进StableDiffusion，利用某个单词的cross-attention map得到该物体的大概的分割图，再利用self-attention map进行调整和补全。

AaA

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

不再依赖人工标注，利用StableDiffusion生成大量图像和分割图(cross-attention map)的数据训练分割模型

MaskFactory

MaskFactory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

SegGen

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

$\rightarrow$ segmentation map)，得到一个Text2Mask模型。
使用(image, segmentation map)训练一个ControlNet，得到一个Mask2Img模型。
之后可以使用这两个网络生成新的分割训练数据：使用现有分割数据集的某张原图，使用BLIP2得到原图的caption，输入到Text2Mask模型中，得到一系列segmentation map，再输入到Mask2Img模型，得到segmentation map对应的原图，组成数据对。
对于相同的分割模型，在现有的分割数据集基础上，额外使用生成的数据进行训练，效果比只使用现有的分割数据集的模型有明显提升。

FoBaDiffusion

Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models

SSSS

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

用scribble训练一个ControlNet，生成分割训练数据数据。

Outline

Outline-Guided Object Inpainting with Diffusion Models

利用少量的instance segmentation数据，使用StableInpainting对这些数据的做object variation，扩增数据。

LDM-Seg

Explore In-Context Segmentation via Latent Diffusion Models

ScribbleGen

ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

Grounding

Peekaboo

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

给定一张图片和一句描述图片中某个object的text，利用预训练好的text2img模型StableDiffusion预测目标object的mask。

Grounded Diffusion

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

利用预训练好的text2img模型StableDiffusion，在根据text输出image的同时，还会输出image对应的segmentation mask。

先用StableDiffusion生成图片，再用预训练好的object detector生成这些图片的segmentation mask，构建了一个数据集，再使用这个数据集训练grounding module，方法也类似Label-Efficient Semantic Segmentation With Diffusion Models。

GenPromp

Generative Prompt Model for Weakly Supervised Object Localization

DiffPNG

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Semantic Correspondence

SD-DINO

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

exploit Stable Diffusion features for semantic and dense correspondence

DIFT

Emergent Correspondence from Image Diffusion

不需要训练，直接使用Stable Diffusion features做匹配即可。

SD4Match

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

prompt tuning

Diffusion-Hyperfeatures

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

$r$ $B$ $\omega$ $\sum_{s=0}^{S}\sum_{l=1}^{L}\omega_{s,l} \cdot B_{l}(r_{s,l})$ $S$ $L$ $B_{l}$ $\omega_{s,l}$ 是可训练的weight。
DDIM generation和Inversion效果相似，所以既适用于synthetic images也适用于real images。
For semantic correspondence, we flatten the descriptor maps for a pair of images and compute the cosine similarity between every possible pair of points. We then supervise with the labeled corresponding keypoints using a symmetric cross entropy loss in the same fashion as CLIP.

DiffGlue

DiffGlue: Diffusion-Aided Image Feature Matching

VLM

SynthVLM

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

SynthVLM is a novel data synthesis pipeline for VLLMs.
Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs.

Multi-Object Tracking

TrackDiffusion

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

使用layout-to-image model，根据tracklet生成video sequence作为训练MOT的数据。

DiffMOT

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

Human-Object-Interaction

CycleHOI

CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation

Unifying Generative and Understanding

EGC

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

energy function，需要求二阶导优化。

类似Denoising Likelihood Score Matching for Conditional Score-based Data Generation

DiffDis

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.

Factorized Diffusion

Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation

类似MAGE的生成与理解统一的模型。

$K$ $K$ $K$ 个masked predicted noise的和作为最终的predicted noise计算diffusion loss，这样生成时可以图像和分割图一起生成。

也可以做real image segmentation，只需要加噪一步再去噪一步即可。

Other Interesting Paper

UnseenDiffusion

Unseen Image Synthesis with Diffusion Models

使用某个域内预训练的diffusion model生成域外的样本。

$x_{500}$ ，计算均值和方差，从这个高斯分布中采样，进行生成，可以得到OOD样本。

IMPUS

IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

目标：给定两张图像，做插值。

SD上分别作TI，得到两张图对应的text embedding。
用上述两个text embedding LoRA fine-tune SD。
$\phi$ LoRA fine-tune SD。
对text embedding进行插值，cfg生成。

DiffMorpher

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

DreamMover

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

生成插值序列（视频）

AID

AID: Attention Interpolation of Text-to-Image Diffusion

对两个图像生成过程的cross-attention的KV进行插值，替代当前插值点生成过程中的cross-attention中的KV。

NoiseDiffusion

NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation

对diffusion model generated images来说DDIM Inversion + slerp插值法效果很好，但对real images效果就不好，通过一些方法矫正noise可以解决这一问题。

BlackScholesDiffusion

Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

prompt插值生成。

Concept-centric Personalization

Concept-centric Personalization with Large-scale Diffusion Priors

新任务，将StableDiffusion个性化为专门生成某个概念图像的模型，和TI的区别是，该任务专注于某个更抽象的concept而非reference images中的concept，比如人脸，emphasizes fidelity and diversity in the generative results，所以需要提供至少上k的该concept的图像。

做法是将concept和其它控制条件分离，在提供的concept数据集上fine-tune StableDiffusion (全部使用null text)，得到concept-centric diffusion model，使用CFG进行生成，其它控制条件也可以通过CFG引入，比如text和ControlNet。

Neural Network Diffusion

Neural Network Diffusion

parameter autoencoder + latent diffusion model

Diffusion-based Neural Network Weights Generation

Conditional LoRA Parameter Generation

FineDiffusion

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories.
新型CFG：训练和采样时使用superclass label embedding替代null embedding。

FactorizedDiffusion

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

生成illusion。
$x$ $f_{i}(x)$ $x = \sum_{i} f_{i}(x)$ $x_{t}$ $\tilde{\epsilon}$ 。

Theory

Tweedie's formula

Awesome Architecture

Cascaded

MDM

ADM

AsCAN

D3PM

GGM

BDPM

ImageBART

VQ-Diffusion

CATDM

LDM

Wuerstchen

StableCascade

MSF

BinaryLDM

DiC

SiD

SiD2

Scaling-Law-1

Scaling-Law-2

Scaling-Law-3

DiT-Scaling-Law

RDM

U-ViT

Diffusion-RWKV

DiT

DiT-MoE

EC-DiT

DyDiT

PoM

U-DiT

Switch-DiT

SiT

HDiT

EDT

DiMR

Flag-DiT

Next-DiT

FiT

VisionLLaMA

DiG

SANA

CLEAR

LiT

Transfusion

MonoFormer

ARFlow

CausalFusion

RDPM

ACDiT

DART

Neural-RDM

Mamba

DiS

DiffuSSM

ZigMa

DiM

DiM

Dimba

LaMamba-Diff

LinFusion

Others

CrossFlow

Infinite-Diff

DoD

INFD

CAN

Effectiveness and Efficiency Enhancement

hybrid generative model

D2C

DDGAN

DiffuseVAE

ϵ\epsilon-VAE

FM-Boosting

PDM

refinement of network architectures

LDM

$\epsilon$ -VAE