TheoryTweedie's formulaAwesome ArchitectureCascadedMDMADMAsCAND3PMGGMImageBARTVQ-DiffusionCATDMLDMWuerstchenStableCascadeBinaryLDMSiDSiD2Scaling-Law-1Scaling-Law-2Scaling-Law-3DiT-Scaling-LawRDMU-ViTDiffusion-RWKVDiTDiT-MoEEC-DiTDyDiTPoMU-DiTSwitch-DiTSiTHDiTEDTDiMRFlag-DiTNext-DiTFiTVisionLLaMADiGCLEARTransfusionMonoFormerCausalFusionACDiTDARTNeural-RDMMambaDiSDiffuSSMZigMaDiMDiMDimbaLaMamba-DiffLinFusionOthersCrossFlowInfinite-DiffDoDINFDCANEffectiveness and Efficiency Enhancementhybrid generative modelD2CDDGANDiffuseVAE
One can estimate the mean of a Gaussian distribution, given a random variable
在diffusion model中,
Cascaded Diffusion Models for High Fidelity Image Generation
Matryoshka Diffusion Models
所有resolution一起训练,训练时对每个data的所有resolution的加噪时间步要一样,避免信息泄露,同样使用了SimpleDiffusion提出的noise schedule shift。
类似ProgressiveGAN,从低resolution开始训练,之后逐渐加宽UNet增多loss项数去训练更高resolution,训练高resolution时,低resolution网络也会一起训练。
Diffusion Models Beat GANs on Image Synthesis
AdaGN:将class映射到指定维度,和time embedding相加,通过一个MLP预测
super resolution model:将低分辨率图像
AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation
AsCAN is a hybrid architecture, combining both convolutional and transformer blocks.
Structured Denoising Diffusion Models in Discrete State-Spaces
Discrete Diffusion Model
Glauber Generative Model: Discrete Diffusion Models via Binary Classification
D3PM的前向过程每一步对所有token进行随机加噪,GGM每一步只对某个token进行随机加噪,这样逆向过程每一步只需要预测一个token。
随机一个长度为
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
VQGAN + multinomial diffusion
Transformer Encoder:
Vector Quantized Diffusion Model for Text-to-Image Synthesis
VQVAE + multinomial diffusion
Transformer Blocks:input
Mitigating Embedding Collapse in Diffusion Models for Categorical Data
While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training.
High-Resolution Image Synthesis with Latent Diffusion Models
AutoEncoder:
固定AutoEncoder,使用UNet训练DDPM建模数据降维后的latent space。好处:减少计算量;不失图像的结构性,仍能发挥UNet的归纳偏置;学到了一个可以更进一步利用的latent space。
slight regularization:KL or VQ,避免 high-variance latent spaces。
为UNet注入cross-attention层,放在self-attention之后。Q:
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models
three-stages to reduce computational demands
stage A:训练4倍降采样率的VQGAN,
Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到
stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。
stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。
生成时
Binary Latent Diffusion
和LDM一样的思路,只是latent是二值化的。
利用VQ的思路,使用Bernoulli采样取代VQ的最近邻,训练一个隐变量二值化的binary AutoEncoder:
推导Bernoulli Diffusion Process,使用DPM建模
Simple Diffusion: End-to-end diffusion for high resolution images
现有的高分辨率扩散模型有两种,一种是StableDiffusion的降维法,一种是coarse-to-fine的cascade super resolution法。SimpleDiffusion采用以下几个方法解决高分辨率扩散模型的像素空间的直接训练问题。
改变noise schedule:一个现象是,高分辨率图像
多尺度训练:高分辨率扩散模型的像素空间的直接训练的一个难点在于图像的高频信息(物体边缘等)较难建模,训练loss主要由这部分支配,所以本论文提出多尺度训练loss:
为了解决显存和计算问题,在低分辨率的feature map上增加网络深度,即block数量,本论文选择16x16,并且在整个模型最前面加一个下采样层,在最后面加一个上采样层,避免在最高分辨率下做计算。
只在低分辨率feature map上加dropout。
Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
Loss Weight: sigmoid shift
Flop Heavy Scaling: 增长token序列而不是增大模型size
Residual U-ViT
On the Scalability of Diffusion-based Text-to-Image Generation
For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency.
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
When operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.
Computational Tradeoffs in Image Synthesis Diffusion, Masked-Token, and Next-Token Prediction
We recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.
Scaling Laws For Diffusion Transformers
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
cascaded models perform better than end-to-end models under a fair setting,RDM也是用cascaded模式,但是与传统cascaded models不同的是,RDM是在时间步上cascade,这降低了训练和采样的步数。
高低分辨率都使用EDM formulation,即
想要在
低分辨率生成的
注意,blurring diffusion建模时使用Block Noise加噪(不是直接采样
All are Worth Words: A ViT Backbone for Diffusion Models
ViT in Pixel Space
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
RWKV brought improvements for standard RNN architecture, which is computed in parallel during training while inference like RNN. It involves enhancing the linear attention mechanism and designing the receptance weight key value (RWKV) mechanism.
Scalable Diffusion Models with Transformers
ViT in Latent Space
adaLN:LayerNorm中不学习scale和shift,而是额外使用一个MLP(每个block都有)根据timestep和condition预测一个scale和shift。
adaLN-Zero:在skip-connection前再乘一个预测出来的scale,初始化MLP使这个scale的输出都为
Scaling Diffusion Transformers to 16 Billion Parameters
在DiT中,每隔
增大
除了diffusion loss,额外加一个balance loss,避免imbalanced expert。
EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
MoE
Dynamic Diffusion Transformer
slimmable neural network,更细粒度的MoE。
PoM: Efficient Image and Video Generation with the Polynomial Mixer
We propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens.
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
When the encoder processes the input image by downsampling the image as stage-level amounts, the decoder scales up the encoded image from the most compressed stage to input size. At each encoder stage transition, spatial downsampling by the factor of
类似ToDo,使用token downsampling减少计算量,但并不丢失信息:将
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
引入MoE,每个block都使用timestep-based gating network预测一个概率分布,取TopK,这样可以做到参数隔离,缓解不同时间步冲突的问题。
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
design space + DiT
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
传统Transformer每一个block计算量都是一样的,这里把UNet思想用在Transformer上,中间层减少token数量,降低运算量。
同时借鉴SimpleDiffusion,高分辨率少做计算,低分辨率增长增宽。
这样可以在pixel层面的计算复杂度只随着图像分辨率提升线性增长。
EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching
HDiT和U-ViT的结合体。
基于人类画画的过程,先画整体(global),再画某个局部(local),局部画好后再看整体是否和谐,之后再找一个局部进行修改,依此循环。AMM是一个基于token之间距离计算的mask,让global attention变为local attention。
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
U-ViT
Transformer在计算量和效果之间存在tradeoff,patch size小时,token length长,计算量大,但效果好,patch size大时,token length短,计算量小,但效果差。
feature cascade:分为
在低分辨率上使用U-ViT架构,在高分辨率使用ConvNeXt节省计算量。
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Flag-DiT substitutes all LayerNorm with RMSNorm to improve training stability. Moreover, it incorporates key-query normalization (KQ-Norm) before key-query dot product attention computation. The introduction of KQ-Norm aims to prevent loss divergence by eliminating extremely large values within attention logits.
We introduce learnable special tokens including the [nextline] and [nextframe] tokens to transform training samples with different scales and durations into a unified one-dimensional sequence. We add [PAD] tokens to transform 1-D sequences into the same length for better parallelism.
由于将不同模态的数据都转换为一个1D序列统一建模,使用1D RoPE,所以要加[nextline] and [nextframe] tokens,如果只有图像一个模态,可以使用2D RoPE,这样就不需要[nextline] token了。本质上,[nextline] and [nextframe] tokens是为了补充将高维模态转换为1D后丢失的位置信息的。
有text时,self-attention和cross-attention并列。
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations.
NTK
Token Merge
FiT: Flexible Vision Transformer for Diffusion Model
ViT in Latent Space
不做crop,保持长宽比,resize图像使其满足
使用2D-RoPE位置编码方法,利用其外推性可以生成任意分辨率和长宽比的图像。
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
ViT in Latent Space
a vision transformer architecture similar to LLaMA to reduce the architectural differences between language and vision.
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
ViT in Latent Space
DiT models have faced challenges with scalability and quadratic complexity efficiency. We leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models and offering superior efficiency and effectiveness.
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
By fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
MonoFormer: One Transformer for Both Diffusion and Autoregression
Causal Diffusion Transformers for Generative Modeling
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
Neural Residual Diffusion Models for Deep Scalable Vision Generation
Scalable Diffusion Models with State Space Backbone
Diffusion Models Without Attention
ZigMa: Zigzag Mamba Diffusion Model
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation
Dimba: Transformer-Mamba Diffusion Models
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba
LinFusion: 1 GPU, 1 Minute, 16K Image
传统DiT之前的模型随着分辨率增大(token长度增大),计算量指数增长。
借鉴Mamba2、RWKV6、GLA等,we introduce a generalized linear attention paradigm,随着分辨率增大(token长度增大),计算量线性增长。
使用SD蒸馏训练该模型,achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity.
外拓zero-shot cross-resolution generation。
Flowing from Words to Pixels A Framework for Cross-Modality Evolution
For each training sample, we start with an input-target pair
Infinite-Diff: Infinite Resolution Diffusion with Subsampled Mollified States
Infinite-Diff is a generative diffusion model defined in an infinite dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling.
Diffusion Models Need Visual Priors for Image Generation
DoD enhances diffusion models by recurrently incorporating previously generated samples as visual priors to guide the subsequent sampling process.
We propose the Latent Embedding Module (LEM) that filters the conditional information using a compression-reconstruction approach to discard redundant details. We reasonably assume that the high-level semantic information extracted from generated images is similar to that obtained from real images. This assumption allows us to use the latents of ground truth images as inputs to LDM during training, simplifying the training strategy. Such simplification allows end-to-end training of DoD on image latents and joint optimization of the backbone model and LEM.
Image Neural Field Diffusion Models
Neural field is also known as Implicit Neural Representations (INR), which represents signals as coordinate-based neural networks.
提出Image Neural Field Autoencoder,目的是有一个隐空间分布可以建模并采样。
类似Diff-AE和PDAE,使用diffusion model建模隐空间分布。
Condition-Aware Neural Network for Controlled Image Generation
传统的条件模型中,所有条件共用相同的处理条件的static network,这限制了网络的建模能力。一种解决方案是每个条件使用一个expert model,但消耗极大。因此学习一个生成网络,根据条件动态生成处理条件的网络参数。introduces a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition.
Making depthwise convolution layers, the patch embedding layer, and the output projection layers condition-aware brings a significant performance boost.
CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT, on class conditional image generation on ImageNet and text-to-image generation on COCO.
D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation
diffusion建模autoencoder的latent distribution。
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs
解决原始扩散模型不可训练的问题,参考theory第6章。
步长较大时,
使用判别器判别
DDPMs的逆向过程也可以被解释为
GANs are known to suffer from training instability and mode collapse, and some possible reasons include the difficulty of directly generating samples from a complex distribution in one-shot, and the overfitting issue when the discriminator only looks at clean samples. Our model breaks the generation process into several conditional denoising diffusion steps in which each step is relatively simple to model, due to the strong conditioning on
DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents
epsilon-VAE: Denoising as Visual Decoding
Boosting Latent Diffusion with Flow Matching
LDM的推理速度随着图像分辨率提高而平方增长。
使用Flow Matching在上采样的低分辨率的latent和高分辨率的latent之间建模。使用低分辨率LDM进行生成,使用Flow Matching转换到高分辨率。
一般的Flow Matching是在数据分布和高斯分布之间建模的,这里是在数据对之间建模,所以叫Coupling Flow Matching。
Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model
类似FM-Boosting的思路逐步生成高分辨率的图像。
High-Resolution Image Synthesis with Latent Diffusion Models
LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models
We leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
`Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models
Simple Diffusion: End-to-end diffusion for high resolution images
BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion
compact UNet:fewer blocks in the down and up stages (Base), further removal of the entire mid-stage (Small), further removal of the innermost stages (Tiny).
distillation-based retraining:除了diffusion loss,还可以使用训练好的大网络的StableDiffusion进行output-level distillation(相同输入得到的输出之间的MSE loss)和feature-level distillation(相同输入得到的网络feature之间的MSE loss)。
KOALA: Fast and Memory-Efficient Latent Diffusion Models via Self-Attention Distillation
和BK-SDM做法一样。
进一步改进了feature-level distillation,测试了使用不同模块输出的feature进行distillation时的效果,发现对self-attention输出的feature进行distillation时效果最好,而且decoder early blocks位置的self-attention输出的feature效果最好。
DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference
Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
lightweight model architecture + DiffusionGAN + distillation
Spiking Diffusion Models
Spiking-Diffusion Vector Quantized Discrete Diffusion Model with Spiking Neural Networks
Spiking Denoising Diffusion Probabilistic Models
Fully Spiking Denoising Diffusion Implicit Models
SDiT: Spiking Diffusion Model with Transformer
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration
Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search
Denoising Diffusion Step-aware Models
不同step的重要性是不同的,没必要每一步都使用大模型。
slimmable network: a neural network that can be executed at arbitrary model sizes.
搜索最优的采样策略,不同步使用不同size的模型,减少计算量。
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
feature norm的角度理论分析UNet的long skip connection coefficient的影响。
Q-DM: An Efficient Low-bit Quantized Diffusion Model
Diffusion Models Without Attention
Analyzing and Improving the Training Dynamics of Diffusion Models
We update all of the operations (e.g., convolutions, activations, concatenation, summation) to maintain magnitudes on expectation.
ReDistill: Residual Encoded Distillation for Peak Memory Reduction
Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. We propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling.
Quantum Denoising Diffusion Models
Quantum Generative Diffusion Model
Towards Efficient Quantum Hybrid Diffusion Models
Quantum Hybrid Diffusion Models for Image Synthesis
Enhancing Quantum Diffusion Models with Pairwise Bell State Entanglement
Mixed-State Quantum Denoising Diffusion Probabilistic Model
量子计算
Optical Diffusion Models for Image Generation
光学
Post-training Quantization on Diffusion Models
Q-Diffusion: Quantizing Diffusion Models
BiDM: Pushing the Limit of Quantization for Diffusion Models
Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models
Efficient Quantization Strategies for Latent Diffusion Models
Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion
TerDiT: Ternary Diffusion Models with Transformers
VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
PTQ4DiT: Post-training Quantization for Diffusion Transformers
Diffusion Product Quantization
DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing
HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
StableQ: Enhancing Data-Scarce Quantization with Text-to-Image Data
BinaryDM: Towards Accurate Binarization of Diffusion Model
Towards Accurate Post-training Quantization for Diffusion Models
StepbaQ: Stepping backward as Correction for Quantized Diffusion Models
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models
Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models
Memory-Efficient Personalization using Quantized Diffusion Model
fine-tune quantized diffusion model
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models
COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization
QNCD: Quantization Noise Correction for Diffusion Models
Timestep-Aware Correction for Quantized Diffusion Models
TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models
Temporal Feature Matters: A Framework for Diffusion Model Quantization
Structural Pruning for Diffusion Models
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights
LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging
DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization
Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion
Singular Value Scaling Efficient Generative Model Compression via Pruned Weights Refinement
DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
训练student输出(预测噪声)与teacher输出对齐。
Improved Denoising Diffusion Probabilistic Models
They achieve similar sample quality using either
除了diffusion loss,额外优化
sampling
Progressive Distillation for Fast Sampling of Diffusion Models
v-prediction,在理论上和
Flow Matching for Generative Modeling
基于Continuous Normalizing Flows(Neural ODE),CNFs在训练时是先使用模型对数据样本进行转化(ODE simulations),计算转化后的样本与标准高斯分布之间的KL散度,优化模型;flow matching是simulation-free的,因为ODE路径已经提前定义好了。
diffusion model和score based model对
We find this training alternative to be more stable and robust in our experiments to existing score matching approaches.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
可以在任意两个分布之间进行建模。
随机从两个分布中的选择数据
每次训练完成后进行采样,得到当前flow下的coupling数据,之后用这些couping数据再进行训练,以此循环对轨迹进行修正,会得到直的没有交叉点的flow。
Elucidating the Design Space of Diffusion-Based Generative Models
preconditioning:As the input
augmentation: To prevent potential overfitting that often plagues diffusion models with smaller datasets, we apply various geometric transformations to a training image prior to adding noise. To prevent the augmentations from leaking to the generated images, we provide the augmentation parameters as a conditioning input to
Variational Diffusion Models
efficient optimization of the noise schedule jointly with the rest of the model
Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation
DiffEnc: Variational Diffusion with a Learned Encoder
Image generation with shortest path diffusion
Improved Denoising Diffusion Probabilistic Models
与linear schedule不同,cosine schedule直接定义
Improved Noise Schedule for Diffusion Training
Common Diffusion Noise Schedules and Sample Steps are Flawed
StableDiffusion使用的noise schedule最后一步加噪公式为
为了解决这个问题,需要保证zero terminal SNR,即
使用了zero terminal SNR的noise schedule后,
结合上述两点,可以使用纠正后的noise schedule和v-prediction方法fine-tune已有的StableDiffusion,效果一致。
Rescale Classifier-Free Guidance:使用zero terminal SNR的noise schedule后,原有的cfg会变得敏感,导致生成图像过曝,所以对cfg结果进行rescale。
Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
现有的模型不能生成太亮或太暗的图像。
现有的模型由于没有zero terminal SNR,所以本质上是在
采样时第一步从高斯分布开始,采样出一个
Score-Optimal Diffusion Schedules
Perception Prioritized Training of Diffusion Models
Debias the Training of Diffusion Models
类似P2-weighting。
使用
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergencen areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training.
We design an asymmetric time step sampling strategy that reduces the frequency of time steps from the convergence area while increasing the sampling probability for time steps from other areas.
Beta-Tuned Timestep Diffusion Model
The distribution variations are non-uniform throughout the diffusion process and the most drastic variations in distribution occur in the initial stages.
We propose a novel timestep sampling strategy that utilizes the beta distribution.
B-TTDM not only improves the quality of the generated samples but also speedups the training process.
Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training
Multi-Architecture Multi-Expert Diffusion Models
Addressing Negative Transfer in Diffusion Models
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models
Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture
Denoising Task Routing for Diffusion Models
Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising
diffusion model需要处理识别所有level的noise这一特点导致了其模型需要大量参数进行训练。
不同时间步(expert)用不同的特性网络(architecture),降低学习难度,减少模型参数量。
Decouple-Then-Merge: Towards Better Training for Diffusion Models
将时间步均分为
Dynamic Dual-Output Diffusion Models
Bring Metric Functions into Diffusion Models
Denoising Task Difficulty-based Curriculum for Training Diffusion Models
diffusion model不同时间步的学习难度是不同的,将时间步等分为20个区间,在每个区间单独训练一个模型(总共20个),考察它们的收敛速度,在loss和生成质量(混合采样,使用一个正常训练好的扩散模型,采样时只在指定区间使用单独训练的模型)方面,都是时间步越大收敛速度越快。
Curriculum Learning:a method of training models in a structured order, starting with easier tasks or examples and gradually increasing difficulty. 所以将时间步分区域后,先从最靠后的区域开始训练,之后依次向靠前的区域训练,每次训练时依然要训练之前区域的时间步,避免遗忘。
收敛更快,生成效果更好。
Representation Alignment for Generation Training Diffusion Transformers Is Easier Than You Think
有点类似MaskDiT的MAE loss,可以提升生成效果。
Masked Diffusion Models are Fast Learners
使用U-ViT架构(Pixel Space),最高90%的mask ratio,比DDPM收敛速度快4倍,且生成效果更好。
Masked Diffusion Transformer is a Strong Image Synthesizer
使用DiT架构(Latent Space),为了解决训练和推理时mask不同的distribution shift,训练时使用一个side-interpolater补全masked tokens,比DiT收敛速度快3倍,生成效果更好。
Fast Training of Diffusion Models with Masked Transformers
使用DiT架构(Latent Space),DiT encoder可以scaling up,DiT decoder使用固定的
只根据visible token预测invisible token的score太困难了,所以将diffusion loss拆分,对于visible token使用diffusion loss,对于invisible token使用对应noisy patches的MSE loss(注意是直接预测加噪的invisible patch,不是预测其噪声或者原图),类似MaskDM + MAE。
最高50%的mask ratio,比DiT收敛速度快3倍,达到相同的生成效果。
MAE必不可少,如果不加MAE,生成效果降低很多,但MAE的loss的系数如果太大反而会影响生成效果,所以这个系数要精心挑选。Without the MAE reconstruction task, the training easily overfits the local subset of unmasked tokens as it lacks a global understanding of the full image, making the gradient update less informative. 理解辅助生成。
SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer
DiT encoder可以scaling up,DiT decoder使用固定的
DiT decoder的输入插入的不是learnable mask token,而是直接插入invisible patch,diffusion loss在所有patch上计算,而不是像MaskDiT那样只根据visible token去预测invisible patch。
这样去掉了MAE,没有了理解就无法辅助生成,所以引入self-distilling模块。encoder的每个token处的最后一层输出再经过一个MLP+softmax预测一个
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
对
EDM only sees local patches and may have not captured the global cross-region dependency between local patches, in other words, the learned scores from nearby patches should form a coherent score map to induce coherent image sampling. To resolve this issue, we propose two strategies: 1) random patch sizes and 2) involving a small ratio of full-size images.
采样时分patch采样后拼在一起。
Through Patch Diffusion, we could achieve
Patched Denoising Diffusion Models For High-Resolution Image Synthesis
Rather than using entire complete images for training, our model only takes patches for training and inference and uses feature collage to systematically combine partial features of neighboring patches.
训练和推导时,第一种方法是将
一些具体任务本身就是某种过程,可以设计不同的马尔科夫转移链进行训练。
ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories
Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing
GUD: Generation with Unified Diffusion
The choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis).
The prior distribution that data is transformed into during diffusion.
The scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule.
CARD: Classification and Regression Diffusion Models
公式类似PriorGrad,diffusion model输出regression的值或者classification的概率。
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
把照相机图像曝光过程当做一种扩散过程进行建模。
Residual Denoising Diffusion Models
类似ResShift。
Beta Diffusion
Beta分布,优化KL-divergence upper bounds。
Fast Diffusion Model
与SGD建立联系,引入momentum,加快训练和采样。
Directly Denoising Diffusion Model
DDDMs train the diffusion model conditioned on an es timated target that was generated from previous training iterations of its own.
Define
DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation
early exiting策略:The basic assumption of early exiting is that the input samples of the test set are divided into easy and hard samples. The computation for easy samples terminates once some conditions are satisfied.
使用U-ViT,在训练时为每层的输出额外训练一个uncertainty estimation module,用于评估当前层作为输出的不确定度。UEM是一个预测标量的MLP,目标是当前层的输出与最后一层的输出的MSE Loss。
推理时,对于采样的每一步,一旦某层的输出的不确定度低于给定阈值,就将该层输出作为最终输出,达到加速的效果。
Diffusion Model Patching via Mixture-of-Prompts
为预训练好的DiT的每层block额外训练一组参数
The same prompts are used for each block throughout the training, thus they will learn knowledge that is agnostic to denoising stages. To patch the model with stage-specific knowledge, we introduce dynamic gating. This mechanism blends prompts in varying proportions based on the noise level of an input image. 学习一个gating网络,
Compensation Sampling for Improved Convergence in Diffusion Models
额外训练一个UNet预测补全项。
Slight Corruption in Pre-training Data Makes Better Diffusion Models
类似CADS对condition进行操作,但CADS仅在采样时进行操作,二CEP在训练时进行操作。
初步实验:To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M. As a result, class and text-conditional models pre-trained with slight corruption achieve significantly lower FID and higher IS and CLIP score. More corruption in pre-training can potentially lead to quality and diversity degradation. As degration level increases, almost all metrics first become better and then degrade. However, the degraded measure with more corruption sometimes is still better than the clean ones.
更一般的,we propose to directly add perturbation to the conditional embeddings of DMs, which is termed as conditional embedding perturbation。做法是对condition embedding加一个符合
Structure-Guided Adversarial Training of Diffusion Models
除了diffusion loss,用batch内
如果使用预训练好的编码网络会导致shortcut,所以引入对抗训练,训练编码网络极大化上述两个距离之间的差异(相当于区分fake和real的manifold structure)。
Improving Diffusion-Based Image Synthesis with Context Prediction
除了传统的diffusion loss(self-denoising)根据
采样时只用self-denoising网络,和传统diffusion model一致。
Learning Quantized Adaptive Conditions for Diffusion Models
类似Diff-AE和PDAE的自编码器,只不过这里用一个bsq code作为表征,不需要post training建模latent,虽然这种condition不能完全复原图像,但至少提供了一些信息。
采样时随机一个binary vector code作为条件,可以加速采样,提高采样质量。
Training Data Synthesis with Difficulty Controlled Diffusion Model
和CAD把coherence作为额外条件输入diffusion model训练类似,这里将difficulty作为额外条件输入diffusion model训练,可以控制生成不同复杂度的图像。
The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks
探索更好的引入timestep embedding的方式。
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators
This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps.
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately
Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting
Chain of Forgetting:
transfer时,同时使用一些原数据集的数据参与训练。使用原数据集的数据时,diffusion loss系数随着
Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models
Multi-scale Generative Modeling for Fast Sampling
We propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands.
Learning to Efficiently Sample from Diffusion Probabilistic Models
Learning to Schedule in Diffusion Probabilistic Models
AdaDiff: Adaptive Step Selection for Fast Diffusion
预定义一个步数集合,训练一个轻量级的步数选择网络,根据text embedding从集合中选择一个步数进行生成,根据生成结果打分,policy gradient优化网络。
有一个额外的loss鼓励小步数。
BudgetFusion: Perceptually-Guided Adaptive Diffusion Models
收集一批prompt,每个prompt用相同
Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models
Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback
训练一个time-dependent reward model,采样时use the score of time-dependent reward function as guidance。
Align Your Steps: Optimizing Sampling Schedules in Diffusion Models
Learning to Discretize Denoising Diffusion ODEs
Directly optimizing
Optimizing Few-step Sampler for Diffusion Probabilistic Model
Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models using Stepwise Spectral Analysis
Certain steps exhibit significant changes in image content, while others contribute minimally. 根据相邻时间步
Instead of the traditional uniform distribution-based time step sampling, we introduce a Beta distribution-like sampling technique that prioritizes critical steps in the early and late stages of the process. 早期和后期步长小步数多,中间步长大步数少。
大致可分为五类:Direct Distillation,Progressive Distillation,Adaversarial Distillation,Score Distillation(DI),Consistency Distillation
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
direct distillation,
本质是使用teacher构造
Accelerating Diffusion Models with One-to-Many Knowledge Distillation
分时间段进行蒸馏。
SDXL
We can significantly improve the quality of direct distillation by (1) scaling up the size of the ODE pair dataset and (2) using a perceptual loss, not MSE loss.
在SDXL的latent上重新训练一个VGG网络,优化student生成的latent和teacher生成的latent之间的LPIPS loss。
除了LPIPS loss,还是用对抗训练,使用类似GigaGAN的multi-scale discriminator。
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
主要结论:2-Rectified Flow is a better teacher model to distill an one-step student model than the original SD.
先使用StableDiffusion训练一个k-Rectified Flow,再对该ReFlow进行direct distillation(一步拟合多步)。
Progressive Distillation for Fast Sampling of Diffusion Models
训练student一步采样模拟teacher多步采样的效果。
v-prediction:在distillation时
On Distillation of Guided Diffusion Models
distill CFG teacher to student。
stage 1:训练一个和teacher相同步数的student,将guidance strength作为额外的条件,guidance strength也是随机均匀采样。
stage 2:和PD一样,迭代训练更少步数的student。
采样时可以调用
SFDDM: Single-fold Distillation for Diffusion models
PD是每次减少一半步数,直到目标步数,属于multi-fold。SFDDM一步到位,属于single-fold。
本质上,student就是步数很少时的DDPM,只不过是用teacher DDPM做监督,感觉没有直接训练效果好?
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
PD时不是用MSE而是用对抗损失进行训练。
判别器为
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models
为StableDiffusion额外引入一个可训练的cross-attention与negtive prompt交互。
两个loss,一个学单步,一个学多步,是冲突的?
Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models
Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation
对于某个
forward distillation:根据
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs
Semi-Implicit Denoising Diffusion Models
改进DDGAN。
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
改进SIDDM。
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
define a sequence distribution of clean data
然而直接在clean data上进行对抗训练无法避免GAN训练时遇到的困难,为了解决这一问题,DDGANs在corrupted data上进行对抗训练,but such an approach fails to directly match
受Self-Cooperative的启发,YOSO依然在clean data上进行对抗训练,但使用
训练之前,先对训练扩散模型进行微调,第一阶段转换到v-prediction,第二阶段改变noise schedule实现zero terminal SNR,之后对得到的模型直接fine-tune或LoRA fine-tune作为
HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation
StableDiffusion用不同步数生成图像,FT提取高低频信息,组合不同高低频,ITF转换为图像,可以看出单步生成不好的主要原因是高频信息不够好。
使用LoRA finetune StableDiffusion,让LoRA+SD单步生成的图像的高频部分和SD多步生成的图像的高频部分尽量靠近。
Adversarial Diffusion Distillation
student网络初始化为teacher网络,student网络的步数设为
GAN loss:We use a frozen pretrained feature network and a set of trainable lightweight discriminator heads. The trainable discriminator heads are applied on features at different layers of the feature network.
distillation loss: Notably, the teacher is not directly applied on generations of the ADD-student but instead on diffused, as non-diffused inputs would be out-of distribution for the teacher model. 即先采样
训练时只用一步生成,采样时用
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
在VAE latent空间,teacher网络生成样本
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
稳定Adaversarial Distillation中的对抗训练。
判别器由frozen teacher UNet encoder和trainable lightweight head构成,生成的图像加噪到相同timestep后输入判别器判别真假。
We use a dynamic discriminator pool to source these discriminator heads,head是分时间步的,每个head只负责某个时间步,每步训练时随机从pool中随机挑选一批head(相同时间步)进行训练,head优化更新后重新放回pool,the stochasticity of this process through random sampling ensures varied feedback, preventing any single head from dominating the generator’s learning and reducing bias. This diversifies feedback and enhances stability in GAN training.
每步训练后随机扔掉pool中1%的head,补充回相同数量的重新随机初始化的head,refreshing discriminator subsets helps maintain a balance between stable feedback from retained heads and variability from re-initialized ones to enhance generator performance.
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
Substitute the NeRF rendering with a text-to-image generator that can directly synthesize a text-guided image in one step, effectively converting the text-to-3D generation training into one-step diffusion model distillation.
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models
We have a pre-trained diffusion model with the multi-level score net denoted as
We aim to train an implicit model
In order to receive supervision from the multi-level score functions
The IKL is tailored to incorporate knowledge of pre-trained diffusion models in multiple time levels. It generalizes the concept of KL divergence to involve all time levels of the diffusion process.
在同一个扩散过程下,分别以两个分布为起点,优化它们的IKL。对IKL求关于
SDS algorithm is a special case of Diff-Instruct when the generator’s output is a Dirac’s Delta distribution with learnable parameters. 如果
ADD可以看成是Diff-Instruct和对抗训练的结合。
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
使用BK-SDM精简student model。
Diff-Instruct训练student model,在
One-step Diffusion with Distribution Matching Distillation
Distribution Matching Loss就是DI的IKL。
As the distribution of our generated samples changes throughout training, we dynamically adjust the fake diffusion model,这就是为什么要额外训练一个diffusion model的原因。fake diffusion model和one-step generator是一起训练的。
Improved Distribution Matching Distillation for Fast Image Synthesis
Multi-step generator (999, 749, 499, 249),和CM一样alternate between denoising and noise injection steps,如根据
为了避免training/inference mismatch,训练时的input不使用训练集的图像的加噪结果,而是使用上述方法使用
Removing the regression loss: true distribution matching and easier large-scale training.
Stabilizing pure distribution matching with a Two Time-scale Update Rule. fake diffusion model和few-step generator是分开训练的。
Surpassing the teacher model using a GAN loss and real data.
Multistep Distillation of Diffusion Models via Moment Matching
随机变量
moment matching:通过拟合概率分布的moment来拟合概率分布,如拟合期望、方差等。
Generalized Method of Moments (GMM) :定义任一函数
distill pre-trained diffusion model
In the case of one-step sampling, our method is a special case of Diff-Instruct, which distill a diffusion model by approximately minimizing the KL divergence between the distilled generator and the teacher model.
Flow Generator Matching
感觉类似FM上的DI。
Consistency Models
在具体实现中,CM是在40步EDM上训练的,所以
For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.
CM的训练目标是前后两步的consistency,而不是diffusion model的重构
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Since
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
StableDiffusion + LoRA作为Consistency Models。
Reward Guided Latent Consistency Distillation
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
和TCD本质上是一样的,但故事不一样,TCD是先有后两步,再引入第一步,CTM是先有前后两步,再引入中间一步。
TCD抄袭CTM。
Phased Consistency Models
The learning objectives of CTMs are redundant, including many trajectories that will never be applied for inference.
将diffusion trajectory分为
Generalized Consistency Trajectory Models for Image Manipulation
CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs, which translate between arbitrary distributions via ODEs.
Flow Matching is another technique for learning PFODEs between two distributions. 在Flow Matching学到的PFODEs上运用CTMs。
支持translation、editing等。
Simple and Fast Distillation of Diffusion Models
可以加快TCD的训练速度。
Trajectory Consistency Distillation
定义
左边为CM的多步采样,右边为TCD的多步采样。CM的多步采样每一步都预测到
使用LoRA fine-tune SDXL。
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
将PD与TCD结合起来,称为Progressive Consistency Distillation。先将
Consistency Distillation使用adversarial loss和MSE loss。Empirically, we observe that MSE Loss is more effective when the predictions and target values are proximate (e.g., for
训练完成后继续使用DMD进行enhancement。
使用LoRA fine-tune SDXL。
Multistep Consistency Models
与TCD的思想类似,但不同的是MCM不重新定义
SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation
使用SDE-Solver而非ODE-Solver。
使用多步SDE。
Consistency Models被参数化为预测均值和方差的模型,此时输出就是一个分布,使用KL散度优化。
Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
One Step Diffusion via Shortcut Models
Train a single model that supports different sampling budgets, by conditioning the model not only on the timestep
self-consistent性质:
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
CM在flow matching上的应用。
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
利用CD的思想学习一个用于inversion的模型fCD,CD是映射
fCD有两个问题,一是要学习的
同时训练CD和fCD,额外加两个preservation loss:采样某个boundary时间步的
We train fCD and CD separately from each other but initialize them with the same teacher model.
For fCD, we consider the unguided model with a constant
Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation
Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step
SiD with Adversarial Loss
Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation
可以分为single-step和multi-step,single-step只根据当前状态预测下一状态,如DDIM,EDM,DPM-Solver,优点是实现简单,可以自启动;multi-step需要额外的历史状态预测下一状态,如PNDM,DEIS,优点是估计更精准效果更好。
Denoising Diffusion Implicit Models
Pseudo Numerical Methods for Diffusion Models on Manifolds
DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Distilling ODE Solvers of Diffusion Models into Smaller Steps
We observe that predictions from neighboring timesteps exhibit high correlations in both denoising networks, with cosine similarities close to one. This observation suggests that denoising outputs contain redundant and duplicated information, allowing us to skip the evaluation of denoising networks for most timesteps.
We can combine the history of denoising outputs to better represent the next output, effectively reducing the number of steps required for accurate sampling. This idea is implemented in most ODE solvers, which are formulated based on the theoretical principles of solving differential equations. These solvers often adopt linear combinations or multi-step approaches, leveraging previous denoising outputs to precisely estimate the current prediction.
已有的ODE方法,如线性多步法,都有固定的组合历史预测的公式;D-ODE使用一组可学习的组合历史预测的系数,student利用第
Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy
PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future
Based on two key observations: a significant similarity in the model’s outputs at time step size that is not excessively large during the denoising process of existing ODE solvers, and a high resemblance between the denoising process and SGD.
直接使用之前timestep的预测(也可以使用ODE算法组合多个timestep的预测)作为当前timestep的预测,因此当前timestep不需要NFE。
DeepCache: Accelerating Diffusion Models for Free
每隔
Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule
类似DeepCache的思想,利用NAS技术 to search for potential inference schedules with non-uniform steps and structures.
Unraveling the Temporal Dynamics of the Unet in Diffusion Models
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models
The encoder features exhibit a subtle variation at adjacent time-steps, whereas the decoder features exhibit substantial variations across different timesteps,所以可以复用之前步数的UNet encoder的输出和feature,直接输入/skip-connect到下一步的UNet decoder。
The encoder feature change is larger in the initial inference phase compared to the later phases throughout the inference process,所以在复用集中在采样的中后期阶段。
还可以连续多步复用,这样多步就可以并行计算。
Cache Me if You Can: Accelerating Diffusion Models through Block Caching
UNet的block输出具有三个特点:smooth change over time, distinct patterns of change, small step-to-step difference. A lot of blocks are performing redundant computations during steps where their outputs change very little. Instead of computing new outputs at every step, we reuse the cached outputs from a previous step. Due to the nature of residual connections, we can perform caching at a per block level without interfering with the flow of information through the network otherwise.
重复利用之前时间步的某些block的输出,减少运算量。
Accelerating Vision Diffusion Transformers with Skip Branches
Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers
注意看剪刀,不同方法的区别在于省略的地方不同。
和之前的方法不同的是,
Accelerating Diffusion Transformers with Dual Feature Caching
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
This study reveals that, in text-to-image diffusion models, cross-attention is crucial only in the early inference steps, allowing us to cache and reuse the cross-attention map in later steps.
节省了最耗计算量的cross-attention map的计算。
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Post-training compression for self-attention. 可以用在ImageNet的DiT上,也可用在text-to-image的PixArt上。
self-attention values concentrate within a window along the diagonal region of the attention matrix. 计算采样的前两步的self-attention map的residual,之后只计算对角线附近的window self-attention,加上这个residual作为最后的self-attention map。
self-attention sharing直接共享self-attention的结果。
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
训练
训练时,采样相邻的两个时间步
推理时,某一层的
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration
改进L2C。
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers
给MHSA和FeedForward各学习一个相似度估计器,相似度超过阈值时跳过,使用上一步的cache结果。
Task-Oriented Diffusion Model Compression
专门针对image to image translation任务的加速,如InstructPix2Pix image editing和StableSR image restoration。
Depth-skip compression:和Unraveling的(b) Removing deconv blocks一样。
Timestep optimization:biased timestep selection
Token Merging: Your ViT But Faster
Token Merging for Fast Stable Diffusion
Importance-based Token Merging for Diffusion Models
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
根据attention map识别过剩的token进行merge。
ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
类似PixArt-
Tokens in close spatial proximity exhibit higher similarity, thus providing a basis for merging without the extensive com putation of pairwise similarities.
We employ a downsampling function using the Nearest-Neighbor algorithm to the keys and values of the attention mechanism while preserving the original queries.
Token Caching for Diffusion Transformer Acceleration
Accelerating Diffusion Transformers with Token-wise Feature Caching
本质和UNet那一套feature cache是一样的,只是对象换成了token。
Refining Generative Process with Discriminator Guidance in Score-Based Diffusion Models
使用预训练diffusion model生成一批样本,同时记录生成过程中的
从真实数据集中采样一批样本并加噪到某个随机的时间步,从生成样本数据集中采样一批样本,取相同时间步的
采样时,使用
Diffusion Rejection Sampling
使用rejection sampling优化每一步采样,将
概率的计算最终还是使用DG的time-dependent discriminator。
Score-based Generative Models with Adaptive Momentum
类似FDM但不需要重新训练,motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters.
The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling
UNet encoder的feature为
使用最简单的策略,定义最里层的
随机选取一些图像进行加噪再一步去噪,计算所有步的diffusion loss的和,会发现skip-tuning不会使diffusion loss降低,但会使一步去噪生成的图像在feature space上(InceptionV3,CLIP等模型提取)离原图更近,所以skip-tuning提升FID的原因是对feature的优化。所以可以使用现有的模型,加一个可训练的
Boosting Diffusion Models with Moving Average Sampling in Frequency Domain
把diffusion model生成的过程看成参数优化的过程,因此可以引入滑动平均提高稳定性和效果。
The denoising process often prioritizes reconstructing low-frequency component (layout) in the earlier stage, and then focuses on the recovery of high-frequency component (detail) later. 因此IDWT时,给不同component乘一个系数,给low-frequency component乘一个单调递减的常数,给high-frequency component乘一个单调递增的常数。
相同步数下,FID比DDIM好。
Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner
类似TS-DDPM,at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, enforcing the sampling distribution towards the real one.
Residual Learning in Diffusion Models
score-based generative models存在两种误差,离散化导致的误差和score network无法完全拟合导致的误差,所以可以在pre-trained diffusion model的基础上学习一个矫正网络来拟合这种误差。
只在
DICE: Staleness-Centric Optimizations for Efficient Diffusion MoE Inference
针对MoE网络结构的diffusion model采样加速。
Informed Correctors for Discrete Diffusion Models
针对discrete diffusion model的采样算法。
对于某个timestep
Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps
We search for such a time step within a window surrounding the current time step to restrict the denoising progress.
Input Perturbation Reduces Exposure Bias in Diffusion Models
Gaussian分布建模训练时输入网络的
DREAM: Diffusion Rectification and Estimation-Adaptive Models
Markup-to-Image Diffusion Models with Scheduled Sampling
训练diffusion model时,先从
该方法原本是用来解决自回归文本生成的exposure bias问题的。
E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models
Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models
Generating High Fidelity Data from Low-density Regions using Diffusion Models
Don’t Play Favorites: Minority Guidance for Diffusion Models
Self-Guided Generation of Minority Samples Using Diffusion Models
Diffusion Models as Cartoonists! The Curious Case of High Density Regions
We propose a practical high probability sampler that consistently generates images of higher likelihood than usual samplers.
Manifold-Guided Sampling in Diffusion Models for Unbiased Image Generation
encourage the generated images to be uniformly distributed on the data manifold, without changing the model architecture or requiring labels or retraining.
利用guidance。
BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
利用Last-layer Laplace Approximation (LLLA)技术估计diffusion model生成样本的不确定度,which can indicate the level of clutter and the degree of subject prominence in the image.不确定度高的样本背景较为混杂,可以过滤掉。
CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling
diversity低的原因:模型本身在小数据集上训练;cfg太大。
对输入模型的条件
对于class-conditional diffusion model,
Fixed Point Diffusion Models
没什么新理论,只是将DiT block中间的较大的网络换成一个较小的求不动点的
可以根据精度要求或者计算时间需求动态调整不动点网络迭代次数。
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
分布式推理。
Patch Parallelism (PP), where a single image is divided into patches and distributed across multiple GPUs for individual and parallel computations.
Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step.
Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices.
PCPP decreases the communication cost by around 70% compared to DistriFusion.
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
Evolutionary Search
类似集成学习的效果。
Understanding Hallucinations in Diffusion Models through Mode Interpolation
Diffusion models smoothly “interpolate” between nearby data modes in the training set, to generate samples that are completely outside the support of the original training distribution; this phenomenon leads diffusion models to generate artifacts that never existed in real data (i.e., hallucinations).
Diffusion Models Beat GANs on Image Synthesis
需要在
Training Diffusion Classifiers with Denoising Assistance
训练noisy classifier时,把pre-trained diffusion model预测的
Classifier-Free Diffusion Guidance
Gradient-Free Classifier Guidance for Diffusion Model Sampling
No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models
训练conditional diffusion model时不需要随机drop condition as a null condition进行训练。
unconditional distribution可以由conditional distribution计算,即
TSG:
Classifier-Free Guidance is a Predictor-Corrector
CFG可以看成一种Score-based Generative Models中Predictor-Corrector采样的过程。
Compress Guidance in Conditional Diffusion Sampling
去噪过程可以看成是对KL散度梯度下降优化的过程。
分类器指导采样同样可以看成类似的过程。
Denoising Likelihood Score Matching for Conditional Score-Based Data Generation
Classification Diffusion Models Revitalizing Density Ratio Estimation
训练一个timestep分类器,根据
可以证明
Simple Guidance Mechanisms for Discrete Diffusion Models
应用于discrete diffusion model的CFG方法。
The recent focus of the conditional diffusion researches is how to incorporate the conditioning gradient during the reverse sampling. This is because for a given loss function
用Tweedie's formula根据
Improving Diffusion Models for Inverse Problems using Manifold Constraints
Diffusion Posterior Sampling for General Noisy Inverse Problems
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
训练noisy data和condition之间的distance function并用其梯度做guidance过于消耗计算量,可以用每一步预测的噪声去计算出预测的clean data,使用现有的clean data和condition之间的distance function,即:
这种做法很普遍,但是效果却不稳定,对small domain(如人脸)效果很好,但对large domain(ImageNet)效果很差。原因是:The direction of unconditional score generated by diffusion models in large data domains has more freedom, making it easier to deviate from the direction of conditional control。
解决方案:利用RePaint的resample technique,循环进行
Universal Guidance for Diffusion Models
Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation
Manifold Preserving Guided Diffusion
Fisher Information Improved Training-Free Conditional Diffusion Model
Decoupling Training-Free Guided Diffusion by ADMM
Understanding Training-free Diffusion Guidance: Mechanisms and Limitations
两种改进方法。
TFG: Unified Training-Free Guidance for Diffusion Models
GeoGuide: Geometric Guidance of Diffusion Models
对于随机变量
data manifold
Elucidating The Design Space of Classifier-Guided Diffusion Generation
矫正,不过只能用于off-the-shelf的离散分类器上。
Diffusion Models as Plug-and-Play Priors
Variational Inference
采样过程是对引入的variational distribution的点估计采样过程,也是对negtive ELBO最小化的过程,即对variational distribution和真实后验分布之间的KL散度的最小化。
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis
Guidance with Spherical Gaussian Constraint for Conditional Diffusion
DSG enhanced DPS by normalizing gradients in the constraint guidance term and implementing a step size schedule inspired by Spherical Gaussians.
DreamGuider: Improved Training free Diffusion-based Conditional Generation
求梯度不需要过diffusion network,降低了计算量。
受SGD算法启发使用动态scale,不需要handcrafted parameter tuning on a case-by-case basis.
Guiding a Diffusion Model with a Bad Version of Itself
Guiding a high-quality model with a poor model trained on the same task, conditioning, and data distribution, but suffering from certain additional degradations, such as low capacity and/or under-training.
Self-Improving Diffusion Models with Synthetic Data
Use self-synthesized data to provide negative guidance during the generation process to steer a model’s generative process away from the non-ideal synthetic data manifold and towards the real data distribution.
使用训练集训练一个diffusion model,训练完成后使用diffusion model生成一个数据集,用这个生成的数据集再训练一个diffusion model,类似AutoGuidance做CFG。
DDIM reverse process中的
Diffusion Models Already Have a Semantic Latent Space
根据
Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition
用Tweedie's formula根据
这等价于对
CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models
将diffusion model conditional sampling看成是measurement为条件的inverse problem。
使用DDS求解该inverse problem,先无条件采样计算出
将公式中的
CFG++就是非对称逆向过程,
CFG++的计算量和CFG的是一样的,每一步都要先unconditional过一次diffusion model计算得到
还可以用于DDIM Inversion,解决了
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance
生成模型是先有隐变量(一般是随机采样的噪声)再有生成样本,Inversion是先有真实数据(非生成的)再找到能生成它本身的隐变量。
动机是对真实数据做编辑。
由于mode collapse,GAN Inversion效果相对较差,过程较为复杂。
DDIM的生成过程可以表示为
对于unconditional(
如果使用非对称的
通过grid search(PSNR越大越好)可以看到:每一行中,只有
在DiffusionAutoencoder中,如果使用控制stochastic changes的inferred
Inversion做好就可以很好的编辑了,有一些专门做精确Inversion的工作,如EDICT, Null-text Inversion, Prompt Tuning, AIDI等,参考Image Editing部分。
Zero-shot Image-to-Image Translation
DDIM Inversion时每一步使用两个loss梯度下降优化
EDICT: Exact Diffusion Inversion via Coupled Transformations
Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
Exact Diffusion Inversion via Bi-directional Integration Approximation
BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models
On Exact Inversion of DPM-Solvers
高阶采样器的inversion
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
EasyInv: Toward Fast and Better DDIM Inversion
可以用在不同任务上,比如data-driven fine-tune,RLHF fine-tune,TI fine-tune等。
LoRA: Low-rank adaptation of large language models
Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model
The standard U-Net architecture for diffusion models conditions convolutional layers in residual blocks with scale-and-shift but does not condition attention blocks. Simply adding LoRA conditioning on attention layers improves the image generation quality.
TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation
Compact SVD:
TriLoRA:
PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction
Time-Varying LoRA Towards Effective Cross-Domain Fine-Tuning of Diffusion Models
SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
类似FSGAN,对卷积层参数做SVD,
A Closer Look at Parameter-Efficient Tuning in Diffusion Models
为预训练StableDiffusion加小参数量的可训练的adapter进行transfer learning,探索了adapter位置和网络结构对训练的影响。
StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models
改进LoRA:
Navigating Text-To-Image Customization From LyCORIS Fine-Tuning to Model Evaluation
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
OFT时另一种fine-tune方法,比LoRA效果更好,参数更少收敛更快。
和OrthoAdaptation没有关系
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
改进OFT。
Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
只用SC-Tuner。
DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
针对DiT的PEFT方法。
还支持将低分辨率模型fine-tune到高分辨率,对positional embedding进行插值,比如提高一倍分辨率时,原来的
Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers
DiT for incremental class-conditional generation.
为incremental class添加class embedding。
Affiner:
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation
借助减枝理论,模型中有一些ineffective parameter,即绝对值小于某个阈值的那些参数。
These currently ineffective parameters are caused by the training process and can become effective again in following training.
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
奇异值分解,只fine-tune奇异值。
Vector Quantized Diffusion Model for Text-to-Image Synthesis
VQVAE + multinomial diffusion
transformer blocks:input
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
首次提出 text
text
第一种:最后一个vector作为ADM中AdaGN的class embedding的替代。
第二种:
64x64
classifier-free guidance
上面的conditional模型训练后,再以20%概率用空串代替文本的方式对其进行fine-tune,得到classifier-free模型。
Text-Guided Inpainting Model
使用预训练好的DDPM进行inpainting,即采样时每一步将
上面的conditional模型训练好后,随机mask
采样时和上面一样,每一步将
Hierarchical Text-Conditional Image Generation with CLIP Latents
Prior:text作为条件,DDPM建模image CLIP embedding。使用GLIDE的text encoder编码text,使用预训练好的CLIP编码text和image,使用TransformerDecoder模型,分别将encoded text,CLIP text embedding,timestep embedding,noised CLIP image embedding,placeholder embedding按顺序输入,使用causal attention mask(当前位置只和前面的做attention),placeholder embedding位置的输出预测unnoised CLIP image embedding。不使用
Decoder:image CLIP embedding和text作为条件,DDPM建模image。用GLIDE的两种condition方法,第一种是将CLIP image embedding映射到指定维度替代ADM中AdaGN的class embedding,第二种是将CLIP image embedding映射成长度为4的token sequence,然后concat到上述encoded text token sequence之后(
CFG:Prior:randomly dropping text conditioning information 10% of the time during training. Decoder:randomly setting the CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of the time during training
Improving Image Generation with Better Captions
Existing text-to-image models struggle to follow detailed image descriptions and often ignore words or confuse the meaning of prompts. We hypothesize that this issue stems from noisy and inaccurate image captions in the training dataset. We address this by training a bespoke image captioner and use it to recaption the training dataset. We then train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability. 使用recaptioned dataset训练StableDiffusion,95%使用recaption,5%使用原caption。
A text-conditioned convolutional UNet latent diffusion model on top of the latent space learned by the VAE.
Once trained, we used the consistency distillation process to bring it down to two denoising steps.
CogView3: Finer and Faster Text-to-Image via Relay Diffusion
类似DALL·E-3使用recaptioned dataset进行训练。
Base Stage是一个
SR Stage是一个latent space的RDM(原RDM是在pixel space的),只训练了
采样时将Base Stage生成的图像上采样到
High-Resolution Image Synthesis with Latent Diffusion Models
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
架构上:借鉴SimpleDiffusion第3条经验,架构上采用不均一的block分布,SD是[1,1,1,1],即4层每层1个block,downsample 3次,SDXL是[0,2,10],即3层,第一层直接降维,不做其余处理,第二第三层各2个和10个block,downsample 2次。使用了两个text encoder,输出concat在一起。参数量是原SD的3倍。
Micro-Conditioning Image Size:由于数据集图像尺寸不统一,SD的做法是直接丢弃小尺寸的数据,但这样会损失很大一部分数据;另一种做法是把小尺寸数据upsample到大尺寸,但这种数据比真的大尺寸图像模糊,会导致模型输出图像模糊。SDXL将原图尺寸作为condition输入网络,加到time embedding上,注意:网络输出的还是目标尺寸的图像,但其模糊程度由这个condition决定。The image quality clearly increases when conditioning on larger image sizes.
Micro-Conditioning Cropping Parameter:SD的一个很大的问题就是有些输出图像会截掉某个物体一部分,这是由于数据处理时,将图像长宽中较小的那一个resize后目标尺寸再对较长的那一个进行crop造成的。SDXL将crop位置作为condition输入网络,加到time embedding上,inference时输入(0,0)就能得到物体比较完整的图像。
Multi-Aspect Training:SD使用固定的输出尺寸。SDXL在目标尺寸上预训练好后,在多比例尺寸图像上进行fine-tune,做法是划分一些尺寸bucket,将图像装入最近的bucket,同一个bucket内的图像resize到bucket对应的尺寸,每次随机选一个bucket采样batch进行训练。Inference时就可以生成不同尺寸的图像(只要输入目标尺寸的噪声即可)。
improved VAE-Autoencder,batch size 256(之前是9) + EMA
先在256x256上训练(带Micro-Conditioning),再在512x512上训练(带Micro-Conditioning),再在1024x1024的分辨率上进行Multi-Aspect Training(划分bucket:以1024x1024为中心,64为步长增减长宽,保持pixel总数和1024x1024接近)。
Refinement Stage:We train a separate LDM in the same latent space, which is specialized on high-quality, high resolution data and employ a noising-denoising process as introduced by SDEdit on the samples from the base model. We follow and specialize this refinement model on the first 200 (discrete) noise scales. During inference, we render latents from the base SDXL, and directly diffuse and denoise them in latent space with the refinement model, using the same text input.
On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models
Straightforward implementation of control conditions in DiT may cause interference between the time-step and class-level or control conditions (macro-conditioning) if their corresponding embeddings are additively combined in the adaptive layer norm conditioning.
For class, we move the class embedding to be fed through the attention layers present in the DiT blocks.
For control conditions, we zero out the control embedding in early denoising steps, and gradually increase its strength.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
在Conditional Flow Matching中,
MMDiT架构,ViT in latent space,latent channel取16,since text and image embeddings are conceptually quite different, we use two separate sets of weights for the two modalities.
训练时SNR遵循什么样的分布采样很重要。
Rectified Flow (
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
We believe that the continuity of information flow through every layer of the LLM is what enables its generative power and that the knowledge within the LLM spans across all its layers, rather than being encapsulated by the output of any single layer.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models
three-stages to reduce computational demands
stage A:训练4倍降采样率的VQGAN,
Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到
stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。
stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。
生成时
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Use noise conditioning augmentation for both super resolution models。
三个模型都有CFG。
Dynamic thresholding(只针对采样)
使用比较大的classifier guidance weight时,每一步得到的
纯text encoder比image-text联合训练出来的text encoder要好。
YaART: Yet Another ART Rendering Technology
fine-tune
RL alignment for
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
同时使用T5 text encoder和CLIP text encoder。
发现了不同时间步对文本的利用程度不同
提出模型分裂法,每个子模型只针对某个子level的noise进行训练,称为expert,最终模型为Ensemble of Expert Denoisers。
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
eDiffi是使用不同timesteps的experts进行生成,这里还使用不同space的experts进行生成。
Space MoE:根据cross-attention map使用阈值法确定某个word的mask,再根据word由route网络选择某个expert,由该expert生成该word对应的feature,所有word的feature乘上对应的mask取平均作为输出。
PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
DiT加入cross-attention引入text。
在DiT架构中,AdaLN的参数量竟然占到了DiT的
三阶段训练:使用一个预训练的class-conditional ImageNet模型作为初始化,一方面可以节省text-to-image的训练时间,一方面class-conditional模型训练起来较为容易且不费时;使用高度对齐的、高密度信息的文本的数据集进行训练,实现text-image alignment;类似Emu,使用少量的高质量图像进行fine-tune。
PixArt-sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
使用更高分辨率的图像和细粒度的caption进行训练。
为了减少参数量在self-attention中使用KV compression,因为相邻的
Weak-to-Strong Training Strategy:PixArt-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
GenTron: Diffusion Transformers for Image and Video Generation
adaLN design yields superior results in terms of the FID, outperforming both cross-attention and in-context conditioning in efficiency for class-based scenarios. However, our observations reveal a limitation of adaLN in handling free-form text conditioning. Cross-attention uniformly excels over adaLN in all evaluated metrics.
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
Cascaded Training:不同分辨率的三个模型分别训练。Resolution Boost Training:先在低分辨率上训练,再在高分辨率上训练。
Time-Decoupled Training:将时间步分为两个阶段,前一阶段主要负责生成形状,后一阶段负责refine。前一阶段需要使用大量的text-image pair进行训练以让模型学习不同的concept,之前的模型都过滤掉低分辨率,但这里不需要,将低分辨率也上采样到高分辨率进行训练,因为前一阶段生成的是
Coop Diffusion:不同隐空间和不同分辨率训练的扩散模型可以一起用于采样,以image space为中介进行转换。
Paragraph-to-Image Generation with Information-Enriched Diffusion Model
解决长文本复杂场景的生成问题。
使用decoder-only的language model训练t2i模型,好处是gpt已经展现出了强大的能力,对长文本已经有很好的建模,且训练数据多,缺点是pre-trained decoder-only模型feature extraction能力不太行,所以需要adaption。efficiently fine-tuning a more powerful decoder-only language model can yield stronger performance in long-text alignment (up to 512 tokens)
KNN-Diffusion Image Generation via Large-Scale Retrieval
不需要text-image pair进行训练,用image做条件,CLIP做桥梁。训练时根据image间CLIP编码的cosine距离,使用KNN算法找出和训练image相似的N个image作为条件。采样时根据text和image间CLIP编码的cosine距离,使用KNN算法找出和采样text相似的N个image作为条件。
Retrieval-Augmented Diffusion Models
使用训练数据的k-NN的CLIP embedding作为条件进行训练,采样时,可以根据文本挑选k-NN进行生成,也可以直接使用文本的CLIP embedding。
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities. A generative model that uses retrieved information can produce high-fidelity and faithful images, even for rare or unseen entities.
Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities’ visual appearances.
在UNet encoder后加一个cross-attention与neighbors做交互,同样使用UNet encoder编码neighbors作为key-value,t设为0,所有参数一起训练。
采样时可以自己提供reference image作为neighbor,实现类似Textual Inversion的效果。
Transparent Image Layer Diffusion using Latent Transparency
用透明图像数据训练一个编码器和一个解码器:编码器根据RGB图像和alpha图像预测一个VAE latent空间的偏移量latent transparency,该偏移量加在RGB图像的latent上,相当于对latent distribution做修改,这么做的目的是为了让解码器可以根据修改后的latent预测出RGB图像和alpha图像,但同时应该尽可能少地影响VAE重构效果,让StableDiffusion可以正常运行。loss分为两部分,第一部分是解码器重构RGB图像和alpha图像的loss,第二部分是VAE重构loss,约束编码器预测的偏移量latent transparency不要影响latent distribution。
在新的latent distribution上fine-tune StableDiffusion。
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
一个background layer和
使用InstructBLIP、SAM、StableDiffusion inpainting模型造数据训练。
Ensembling Diffusion Models via Adaptive Feature Aggregation
集成学习。
AFA dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages.
Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
Diffusion Soup enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging.
Diffusion Soup approximates ensembling, and involves fine-tuning
Not All Noises Are Created Equally: Diffusion Noise Selection and Optimization
一个好的noise应该是生成再Inversion后保持不变,以此为标准可以为某个prompt选择好的noise进行生成,也可以优化出一个好的noise。
Designing a Better Asymmetric VQGAN for StableDiffusion
改进StableDiffusion要建模的隐空间。
为decoder设计了一个conditional branch,输入task-specific prior,如unmasked image in inpainting。
decoder远比encoder大,提升细节重构能力。
Counting Guidance for High Fidelity Text-to-Image Synthesis
用pre-trained counting network,输入每一步的
Iterative Object Count Optimization for Text-to-image Diffusion Models
TI的思路解决count问题。
QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
TI的思路解决count问题,meta-learning。
MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion
a training-free Multimodal-LLM agent that can progressively generate multi-object with planning and feedback control, like a human painter.
DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics
We propose a generic "naturalness" preserving loss function, kurtosis concentration (KC) loss,和diffusion loss一起训练。
FreeU: Free Lunch in Diffusion U-Net
training-free,只用两个系数提高生成效果。
UNet的decoder的feature由两部分组成,一个是自己网络生成的backbone feature,另一个是同分辨率下encoder skip connection过来的skip feature。
给backbone feature乘一个系数
实验证明了skip feature含有更多高频细节的信息,所以给skip feature的FFT乘一个稍大的系数,复原被抑制的高频信息。
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
不同区域可以使用不同
Fine-grained Text-to-Image Synthesis with Semantic Refinement
KNN-Diffusion (language-free training),采样时根据text中的semantic选取reference image,给reference image加噪,计算noised reference image和
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
对于target concept的纠错或者编辑。
可以通过prompt engineering设计enhanced and suppressed attribute,可以解决hands生成等问题。
Prompt Sliders for Fine-Grained Control, Editing and Erasing of Concepts in Diffusion Models
类似ConceptSliders,但是是在text encoder上训练LoRA。
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
训练一个adapter以结合不同预训练语言模型和预训练文生图模型。
给定任意预训练的文本编辑器
Multi-LoRA Composition for Image Generation
training-free
直接将每个LoRA的输出
Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
对于含有左右位置关系的两个物体的prompt,先正常生成其中一个物体,再利用place * on the left这样的instruction进行编辑。
编辑模型类似InstructPix2Pix,使用LLM-grounded diffusion,生成两个物体的layout,只用其中一个layout生成原图,两个layout都用生成目标图,instruction一起,LoRA fine-tune InstructPix2Pix。
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
给定一个带方位的prompt,对其进行改写(语义不变),再对其进行反义、交换等操作(改变语义),之后都送入文本编码器计算编码结果的相似度,理论上应该是改写后的prompt与原prompt最相似,但发现目前流行的文本编码器都有90%以上的失败率。
构造方法数据集SCOP fine-tune diffusion model,fine-tune时类似RoPE给QK加positional embedding to augment the conditioning text signals.
Golden Noise for Diffusion Models: A Learning Framework
图里画错了,NPNet是两个网络,分别是两种预测target noise的方法,输入都是source noise
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
利用diffusion chain回传LVLM的梯度优化更新initial noise。
直接回传梯度计算量较大,这里将所有
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
通过引入先验知识提高image-text对齐程度的优化训练算法。
利用NLP工具标注出text中的关键词,并在cross-attention中提高其与image token的attention的权重。
利用object detection检测出text中的object的区域,提高这一区域的diffusion loss的权重。
TokenCompose: Grounding Diffusion with Token-level Supervision
利用SAM提取prompt中名词对应的object的mask,fine-tune StableDiffusion,除了diffusion loss,还加了两个cross-attention map的辅助loss。
Local Conditional Controlling for Text-to-Image Diffusion Models
StableDiffusion + ControlNet
training-free
如果ControlNet的输入只包含一个物体的控制信息,比如对于prompt"a dog and a cat",ControlNet的输入只包含了cat的bounding box,the prompt concept that is most related to the local control condition dominates the generation process, while other prompt concepts are ignored. Consequently, the generated image cannot align with the input prompt. dog容易消失。
对于有local control的物体,使用控制信息大致估算出一个mask,计算该物体对应的token的cross-attention map在mask内最大值和mask外最大值的差作为loss,对于没有local control的物体,将mask外视为自己的区域,mask外视为非自己的区域,用同样的方法计算loss,loss求和,求梯度作为guidance。
将mask用在ControlNet的skip connection feature上,使得ControlNet只影响mask内的feature。
Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
相比于classifier-free guidance借助condtional score计算guidance,self-attention guidance使用internal information计算guidance,training-free and condition-free,所以比较通用,可用于任何diffusion model的enhancement。
classifier guidance的u就是要远离的目标,如果是uncondtional model,可以将其输出作为c,然后人为定义一个u,这里使用每一步生成的
self-attention mask:再次利用self-attention,unnormalized self-attention map大小为
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
类似SAG,使用
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention
破坏CFG的unconditional prediction的self-attention,和PAG一个思想。
By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction of CFG.
Guided Diffusion from Self-Supervised Diffusion Features
类似SAG,利用数据本身的UNet feature做guidance。
Our method leverages the inherent guidance capabilities of diffusion models during training by incorporating the optimal-transport loss. In the sampling phase, we can condition the generation on either the learned prototype or by an exemplar image.
需要全部重新训练。
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models
某个token的cross-attention的dominance导致了其它token的semantic的丢失。
需要额外输入一个token index集合(可以自动提取所有名词),取该集合内每个token对应的cross-attention map,对于每个cross-attention map,计算其
We choose cross-attention layers in the last down-sampling layers and the first up-sampling layers in the U-Net for optimization.
为了稳定,使用EMA更新。
Towards Better Text-to-Image Generation Alignment via Attention Modulation
training-free
self-attention temperature control:计算attention时使用较小的temperature,让softmax的分布更加集中,high attention values between patches with strong correlations are emphasized, while low attention values between unrelated patches are suppressed. After temperature control, the patch only corresponds with patches within a smaller surrounding area, leading to the correct outlines being constructed in the final generated image. We apply the temperature operation to the early generation stage of the diffusion model in the self-attention layer.
object-focused masking mechanism:对prompt进行拆分,分为带形容词的物体、动词、介词等主体,计算prompt中不同主体对应的cross-attention map之和(每个主体可能不止一个word)作为该主体的cross-attention map,之后遍历所有pixel,对于每个pixel,选出其cross-attention map响应值最大的那个主体,将该pixel分配给该主体,在其它主体的所有word的cross-attention map上mask掉该pixel(响应值设为0)。With this masking mechanism, for each patch, we retain semantic information for only the entity group with the highest probability, along with the global information related to the layout. This approach helps reduce occurrences of object dissolution and misalignment of attributes.
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
StableDiffusion
training-free
只在16x16的分辨率上进行操作。
cross-attention map有三种bad case:
做法是使用region selection算法,挑选出每个text token对应的区域,提高其cross-attention map的response,在cross-attention map中尽量分开不同token对应的区域。
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
StableDiffusion
training-free
只在16x16的分辨率上进行操作。
生成过程中每一步梯度下降优化
Divide and Bind Your Attention for Improved Generative Semantic Nursing
StableDiffusion
training-free
用total variation loss代替上面的loss,这样就不局限在某个patch点上了,激励整个区域。
另外引入了一个bind loss,其动机是prompt中还存在一些修饰subject token的形容词,这些形容词对应的cross-attention map应该和对应名词的cross-attention map是对齐的,所以引入它们(归一化后)之间的JS散度作为loss。
Linguistic Binding in Diffusion Models Enhancing Attribute Correspondence through Attention Map Alignment
StableDiffusion
training-free
利用cross-attention map计算loss,求梯度作为guidance。
Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
StableDiffusion
training-free
利用cross-attention map计算loss,求梯度作为guidance。
indensity loss: 负的cross-attention map的最大值,类似A&E。
binding loss: maximize the cosine similarity between the given object and its syntactically-related modifier tokens, while enforcing the repulsion of grammatically unrelated ones in the feature space.
Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models
Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory
StableDiffusion
training-free
利用cross-attention map计算loss,求梯度作为guidance。
Easing Concept Bleeding in Diffusion via Entity Localization and Anchoring
类似DiffEdit使用cross-attention map估计出mask,之后进行自我增强。
INITNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
根据第一步生成时的cross-attention map和self-attention map优化initial noise的重参数分布,保证物体的存在性,解决subject mixing问题。
两个score如果都低于各自的阈值,则说明不需要继续优化,直接采样并进行生成。
Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation
针对MMDiT的解决subject neglect/mixing的方法。
计算三个loss,求梯度作为guidance。
Block Alignment Loss: The blocks in the later layers gradually remove the ambiguities present in the earlier ones, 因此使用深层的cross-attention map与浅层的cross-attention map计算相似度。
Text Encoder Alignment Loss: T5与CLIP可能冲突,因此计算两者编码相同token得到对应的cross-attention map的相似度。
Overlap Loss: 计算不同token对应的cross-attention map的overlap,T5和CLIP各算一个,再两两各计算一个。
A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis
StableDiffusion
training-free
attention overlap问题。解决:计算不同token对应的cross-atttenion map的IoU。
attention decay问题:作者发现StableDiffusion生成早期的cross-atttenion map的布局是比较清晰的,但越到后期这种布局越模糊,没有保持住。所以可以利用上一步的cross-attention map估算一个mask,计算这一步的cross-attention map与这个mask的IoU。
3中的loss减去4中的loss,求梯度作为guidance。
Visual Programming for Text-to-Image Generation and Evaluation
fine-tune LLM on text-layout paris,使其可以将text转换为layout,和text一起作为条件输入GLIGEN,辅助精确可控生成。
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models
类似VP。
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
类似VP。
Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching
类似VP。
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
in-context learning:从训练集(COCO,带prompt和bounding box标注)中随机采样一批样本作为candidate set,训练一个策略网络,该策略网络根据查询prompt,从candidate set选取几个样本作为in-context examples,为ChatGPT输入in-context examples和查询prompt,生成prompt中object的bounding box(文本形式)。策略网络根据mIoU和CLIP相似度等reward训练。
GLIGEN fine-tune StableDiffusion。
DivCon: Divide and Conquer for Progressive Text-to-Image Generation
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts
RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models
利用ChatGPT生成layout后,利用L2I模型(如GLIGEN)和T2I模型的一起生成,做法是每一步生成时使用系数组合两个模型预测的噪声作为DDIM计算下一步的噪声,并根据DDIM的计算结果定义一个loss更新系数作为下一步的系数,以动态调整真实性和组合性。
Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis
CoT reasoning:in-context让GPT3.5根据prompt生成layout。
在StableDiffusion的self-attention和cross-attention之间插入一个可训练的Layout-Aware Cross-Attention,用layout生成mask作用于cross-attention map上。
Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
StableDiffusion
training-free
从带有位置关系的text中解析出一个粗糙的layout(比如middle对应图像中央一个方框,left对应左边占1/3的框,都是固定大小的),与第一步生成时产生的cross-attention map做比对,阈值法看是否有layout的不匹配,如果匹配就不介入,直接生成;如果不匹配,则进行介入。
介入:首先从
SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation
StableDiffusion
training-free
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion
generating intricate visual content from simple abstract text prompts。
自监督训练一个scene graph的discrete diffusion model,根据simple abstract text prompts生成语义更丰富的scene graph。
给StableDiffusion插入scene graph attn进行训练。
Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis
除了text,引入其它条件,如semantic or sketch or depth or normal maps等
对于某个condition加噪,使用预训练的StableDiffusion对其进行去噪,使用可训练的T2I-Adapter引入之前的所有condition,输出输入到下一个T2I-Adapter继续往后传递。
使用
训练时随机置空条件,这样采样时可以挑选任意子图进行生成。
ITI-GEN: Inclusive Text-to-Image Generation
make the pre-trained StableDiffusion to generate images which are uniformly distributed across attributes of interest.
有点类似TIME和UCE那种model editing,但这里只是修改prompt (prompt tuning),不对模型做任何更改,需要提供一个reference image dataset作为attributes of interest。
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
对某个较长的caption,使用ChatGPT将其分解为n个sub-caption,再对每个sub-caption进行recaption,并为每个sub-caption在图中分配一个layout。
生成时,分别使用这n个sub-caption进行去噪,之后将每个sub-caption对应的去噪结果按照layout进行resize并重新拼成原来的空间尺寸,为了确保concat边界的一致性,使用原caption输出的latent,与拼出来的latent做插值。
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
类似RPG。
Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG
利用knowledge graph的retrieval-augmented generation。
Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance
Semantic Guidance Tuning for Text-To-Image Diffusion Models
将prompt的score拆成不同concept的score的组合,subject concept的score直接计算,abstract concept的score由正交投影计算,组合时计算不同concept的score和prompt的score的相似度决定weight。
Is Your Text-to-Image Model Robust to Caption Noise?
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
Saliency Guided Optimization of Diffusion Latents
TweedieMix的升级版。
DreamWalk: Style Space Exploration using Diffusion Guidance
将prompt分解为不同的子prompt,使用不同子prompt的CFG的线性组合进行生成。
Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else
StableDiffusion
training-free,只在text embedding上做文章。
text中首先出现的concept往往在生成中占主导地位,可能抢占其它concept,并且这些首先出现的concept的token embedding往往有比较大的normalization,通过scale down可以缓解。
某些concept的生成可能和它对应的embedding没关系,而是根据其它embedding生成。计算当前embedding和其它embedding的相似性,用其它embedding的加权和表示当前embedding。
Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function
training-free
Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
training-free
ToMe:
ETS: As the semantic information contained in [EOT] can interfere with attribute expression, we mitigate this interference by replacing [EOT] to eliminate attribute information contained within them, retaining only the semantic information of each subject.
原prompt为
During generation, we compute these two novel losses to update the composite token during each time.
Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models
prompt="a tea cup of iced coke",现有的模型大多生成glass cup而非tea cup,这是因为训练数据中iced coke一般和glass cup一起出现,所以提出Mixture of Concept Expert,让GPT规划先生成tea cup再生成iced coke。
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
不同的LLM各有优劣,比如encoder-decoder架构的T5和decoder-only架构的GPT,后者在文本理解上更好,但是用它们训练出来的text-to-image模型,后者在图像和文本对齐程度上远没有前者好。
将不同LLM集成在一起:使用不同LLM分别对prompt进行编码,使用refiner融合它们输出的feature,使用融合后的feature训练text-to-image DiT。
Decoder-Only LLMs are Better Controllers for Diffusion Models
由于decoder-only LLM更丰富的语义,使用它编码text训练text-to-image效果会更好。
训练一个MLP将LLM text embedding转换为CLIP text embedding输入预训练的cross-attention,同时类似IP-Adapter训练一个并行的cross-attention。
LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation
Cross-Adapter Module和UNet一起使用diffusion loss训练。
使用LLaVA对数据集中的image进行caption,替代其prompt,进行训练,类似DALLE
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
收集一些complex prompt,使用T2I模型生成图像,使用BLIP进行caption,得到simple prompt,得到simple-complex prompt pair。
Train SUR-adapter to transfer the semantic understanding and reasoning capabilities of large language models and achieves the representation alignment between complex prompts and simple prompts. 让LLM编码simple prompt达到complex prompt的图像生成效果。
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis
收集low-quality prompt和high-quality prompt pair的数据集,训练一个语言模型,根据low-quality prompt生成high-quality prompt,使得prompt engineer自动化。
DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation
DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality.
LLM和用户对话,根据用户需求,只对prompt进行修改,不涉及image识别。
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
fine-tune一个MLLM,可以改写prompt,生成选择哪个模型,生成推理配置参数。
Optimizing Prompts for Text-to-Image Generation
训练优化LLM成为一个prompt改写模型。
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
类似Promptist。
Prompt Refinement with Image Pivot for Text-to-Image Generation
使用HPSv2数据集训练模型refine input prompt。
Repairing Catastrophic-Neglect in Text-to-Image Diffusion Models via Attention-Guided Feature Enhancement
自动检测生成结果中丢失的object,并重写prompt。
AP-Adapter: Improving Generalization of Automatic Prompts on Unseen Text-to-Image Diffusion Models
Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models
相比于negative prompt使用一些抽象的prompt例如low quality和ugly,contrastive prompt针对prompt设计,去除一些形容词,或使用一些反义prompt,比如with改为without。
On Discrete Prompt Optimization for Diffusion Models
利用prompt engineering找到合适的negative prompt。
Our main insight is that prompt engineering can be formulated as a discrete optimization problem in the language space.
To the best of our knowledge, this is the first exploratory work on automated negative prompt optimization.
Improving Image Synthesis with Diffusion-Negative Sampling
DNP:使用
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
增强属性绑定。
利用语法结构,提取text中的noun phrase(共
SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance
对CLIP text embedding进行adaptation,使得生成图像的语义更准确。
使用NLP parser提取text中的subject-relation-object三元组(可能有多个),每个三元组构成一个scene graph,对于每个scene graph,concat三元组单词的CLIP text embedding,过一个线性层得到scene graph embeding,原CLIP text embedding作为Q,scene graph embeding作为KV,进行cross-attention,得到refined text embedding。计算cross-attention map时,只有Q当前的token属于K当前的scene graph时才计算,其余都mask掉。
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
cross-attention map为
cross-attention map为
利用这些发现可以做检测。
缓解memorization:直接调节cross-attention的logits,给begining token的logits乘一个较大的数,让cross-attention score大都集中在begining token上。
Towards Memorization-Free Diffusion Models
Anti-Memorization Guidance:设计了三个防止生成memorization sample的度量函数,求梯度作为guidance。
Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models
We propose to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers.
By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data.
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models
In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods.
ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation
ProCreate operates on a set of reference images and actively propels the generated image embedding away from the reference embeddings during the generation process.
Exploring Local Memorization in Diffusion Models via Bright Ending Attention
In this paper, we identify and leverage a novel ‘bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models.
BE refers to a distinct cross attention pattern observed in text-to-image generations using diffusion models.
Memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches.
Memories of Forgotten Concepts
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
We propose to only apply guidance in a continuous interval of noise levels in the middle of the sampling chain and disable it elsewhere. 在EDM上定义一个
Analysis of Classifier-Free Guidance Weight Schedulers
Simple, monotonically increasing weight schedulers consistently lead to improved performances.
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
We argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.
cross-attention map的shape是
cross-attention map的segmantation很粗糙,所以使用self-attention map进行refine,做法是直接将self-attention map和cross-attention map相乘,得到的结果再进行1中的操作。
进一步优化,计算
将CFG中的
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
During the denoising process of the stable diffusion model, the overall shape and details of generated images are respectively reconstructed in the early and final stages of it.
The special token [EOS] dominates the influence of text prompt in the early (overall shape reconstruction) stage of denoising process, when the information from text prompt is also conveyed. Subsequently, the model works on filling the details of generated images mainly depending on themselves.
在early stage使用CFG,在final stage只使用unconditional score,因此减少了final stage一半的计算量。
Plug-and-Play Diffusion Distillation
CFG需要两次forward,计算量太大,因此给模型学习一个guide model作为adapter,与ControlNet对称,将scale作为参数输入,蒸馏CFG。
Segmentation-Free Guidance for Text-to-Image Diffusion Models
对于某个prompt(a dog on a cough in an office),如果在生成时negative prompt就是prompt去掉某个object的话(a dog in an office),那么最终生成的图像中,这个object(cough)就会变的更显著。
利用这一特点,在采样时,可以利用cross-attention map估计出每个pixel对应的object,在prompt中去掉这个object作为这个pixel对应的negative prompt。
因为要利用cross-attention map估计出每个pixel对应的object,所以SFG只在采样后期使用,且
FABRIC: Personalizing Diffusion Models with Iterative Feedback
reference image加噪过UNet,保留所有self-attention的key-value,生成时将这些key-value concat在生成时的self-attention的key-value后进行计算。
高分的reference image作为cfg的conditional,低分的reference image作为cfg的unconditional,使用上述方法进行生成。
Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion
使用CLIP对text-to-image数据集进行相似度打分,经过处理后转换为
训练diffusion model时,将coherence作为额外的条件。
生成时使用coherence score的CFG:
The Chosen One Consistent Characters in Text-to-Image Diffusion Models
形容某个character的不同prompt生成具有相同特征的character。
生成、聚类、用选中的类别(具有相同特征的character的图像)进行LoRA fine-tune。
OneActor: Consistent Character Generation via Cluster-Conditioned Guidance
Create consistent images of the same character.
类似Pix2Pix-Zero,使用一个可训练网络预测text embedding中character word的
SFT是在一个固定的数据集上对模型进行fine-tune,只有text-image pair,类似Emu,鼓励模型在这个text上生成对应的image,一般是收集一个高质量的数据集对模型进行fine-tune。
RLFT让模型根据某个text生成一个image,使用reward model对该image进行打分,优化reward-weighted likelihood maximization,即最大化
一些评判标准有现成的模型,如评判text-image alignment的CLIP,可以作为reward model直接使用,一些评判标准没有现成的模型,如human feedback,此时需要训练一个reward model,一般做法是通过样本之间的rank学习一个reward model(类似CLIP),比如下面的HPS。
DPO避开了reward model的训练,只需要两个样本之间的rank关系就可以训练,所以一般是SFT那样在一个固定的数据集上对模型进行fine-tune。
RLFT是一类方法,RLHF是指评判标准是human feedback并且应用RLFT方法的一种应用。
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
LLM可以通过在高质量小数据集上fine-tune的方式显著提高模型输出质量,并且不会影响其泛化能力。
假设StableDiffusion本身已经具备生成高质量图像的能力,但并没有被有效发掘,导致生成质量参差不齐,Emu通过人工筛选2000张极高质量的图像对StableDiffusion进行fine-tune,让StableDiffusion保持生成高质量图像的能力,同时不失对文本的泛化性。
early stopping(<15k iterations)避免过拟合。
该方法很通用,还适用于pixel-level diffusion models(Imagen)和masked generative models(Muse)。
Progressive Compositionality In Text-to-Image Generative Models
构造contrastive数据集。
训练时使用正样本计算diffusion loss,额外使用负样本计算一个contrastive loss
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
将
为了避免fine-tune过拟合,加了fine-tuned model生成的
Training Diffusion Models with Reinforcement Learning
Policy Gradient fine-tune pre-trained diffusion model,公式和DPOK一样,DDPO和DPOK基本是同一时间放出来的。
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
使用某个prompt
目前只用于模型评测,还未用于RLFT。
Improving Compositional Text-to-image Generation with Large Vision-Language Models
使用Large Vision-Language Models评定生成图像与文本的对齐性,主要是object number, attribute binding, spatial relationship, aesthetic quality四个方面的对齐。
RLFT模型(online)。
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference
Human Preference Dataset (HPD):一个prompt生成多张image,其中一张被用户选为preference。
Train human reference classifier:类似CLIP,分别编码image和text到同一embedding空间,然后计算相似度。
Human Preference Score (HPS):
LoRA fine-tune StableDiffusion:不仅使用high-HPS数据进行fine-tune,还使用low-HPS数据,此时给prompt加一个识别符,在采样时给prompt加一个识别符作为negative prompt。
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
Human Preference Dataset v2 (HPDv2):使用不同数据集的prompt,使用ChatGPT进行过滤,得到一个质量不错的prompt数据集,每个prompt输入不同text-to-image模型生成多张image,人工标注preference。
Train human reference classifier:结构和HPS一样,还是编码然后计算相似度的模型,但训练时针对一个prompt只随机选两张image,更prefer的label为
Human Preference Score v2 (HPSv2)同HPS。
Rich Human Feedback for Text-to-Image Generation
RichHF-18K dataset includes two heatmaps (artifact/implausibility and misalignment), four fine-grained scores (plausibility, alignment, aesthetics, overall), and one text sequence (misaligned keywords).
Aligning Text-to-Image Models using Human Feedback
通过引入人工标注反馈提高image-text对齐程度的fine-tune pre-trained StableDiffusion算法。
StableDiffusion对于一些概念生成还是会时好时坏的,比如count和color,为此可以使用count和color进行造句(可以选其它你认为没有对齐好的概念使用该算法,这里仅以count和color举例),再用每个text生成60多张image,由labeler进行0-1标注,0代表没有对齐(count错了或color错了),1代表对齐。
训练一个reward function,根据上述image和text的CLIP编码去预测对齐程度(输出0~1),用标注数据进行训练,使用MSE Loss;同时使用数据增强方法(prompt classification)提升reward function性能:对每个已经标注为对齐的image-text pair,将text中的count或color进行更改,生成N-1个与imgae非对齐的text,输入image和N个text到reward function中并输出N个预测值,softmax后使用交叉熵进行分类训练。
使用reward function RLFT模型(online)。
Behavior Optimized Image Generation
利用DDPO,align SD with a proposed BoigLLM-defined reward。
Avoiding Mode Collapse in Diffusion Models Fine-tuned with Reinforcement Learning
改进DDPO。
Aligning Diffusion Models by Optimizing Human Utility
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
现有的T2I模型大都使用预训练的text encoder,且生成时都需要prompt engineering,这都说明text encoder是suboptimal的,所以可以将T2I生成时的不对齐归因于suboptimal text encoder,所以提出使用DDPO LoRA fine-tune text encoder,让text更具visual特征。
还可以搭配上DPOK fine-tune UNet的方法一起使用,效果更佳。可以用于fix hands。
TextCraftor: Your Text Encoder Can be Image Quality Controller
类似于TexForce。
Model-Agnostic Human Preference Inversion in Diffusion Models
使用蒸馏出的一步生成的模型进和打分模型,重参数法优化初始噪声的高斯分布的均值和方差。
对于某个prompt,从标准高斯分布中随机一个噪声,再从重参数法的高斯分布中随机一个噪声,使用模型分别生成两个样本,使用打分模型分别打分,交叉熵优化均值和方差,使得后者得分更高。
可以对某个prompt专门优化,也可以使用prompt数据集进行优化。
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
LoRA + gradient checkpointing,使用reward function fine-tune StableDiffusion。
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
gradient checkpointing,使用reward function fine-tune StableDiffusion。
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
不同采样方法都可以表示为
在使用DRaFT和AlignProp时不再需要gradient checkpointing,直接屏蔽
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
预定义K个指标,训练时随机选择一个指标,在prompt前prepend这个指标的reward-specific identifier,使用DDPO进行训练。
生成时把K个reward-specific identifier concat在一起prepend到prompt。
VersaT2I: Improving Text-to-Image Models with Versatile Reward
ChatGPT生成
不同aspect的reward model各训练出一个LoRA
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
DPOK,类似fine-tune版本的TokenCompose。
Discriminative Probing and Tuning for Text-to-Image Generation
提取StableDiffusion的feature,送入一个Q-Former,使用global matching(CLIP loss)和local grounding(classification,bounding box)任务训练Q-Former。
训练完成后,给StableDiffusion的所有cross-attention加上LoRA,使用相同的loss一起训练Q-Former和LoRA。
生成时进行self-correction,对global matching的CLIP loss求梯度作为guidance。
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
A framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation.
We develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations.
Improving Long-Text Alignment for Text-to-Image Diffusion Models
利用DRTune做长文本对齐。
Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation
专为人脸设计的score微调模型。
F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration
Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models
An inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs.
Reward Incremental Learning in Text-to-Image Generation
RLHF连续学习,解决遗忘问题。
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Diffusion Model Alignment Using Direct Preference Optimization
将DPO拓展到整个diffusion chain上。
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
Existing works including Diffusion-DPO and D3PO measures the quality according to the final generated image
We build the step-aware preference model by the drawing inspiration from the training process of noisy classifier, which is able to classify noisy intermediate images. We assume the preference order between pair of images can be kept when adding the same noise. After training, the step-aware preference model can be used to predict the preference order among
随机采样一个
step-wise resampler:从
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning
Standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. SDPO is a novel alignment method tailored for few-step diffusion models.
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
为timestep-distilled diffusion model设计的DPO算法。This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution.
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
为timestep-distilled diffusion model设计的DPO算法。
Scalable Ranked Preference Optimization for Text-to-Image Generation
DPO可以利用模型计算score:
Rank Loss:
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score.
We develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. 只有postive image在所有preference dimension上都优于negative image的pair才会被用于进行DPO训练,防止混淆。
PopAlign: Population-Level Alignment for Fair Text-to-Image Generation
之前的preference都是两个单独的样本之间的比较,PopAlign将其拓展到两个群体样本之间的比较。
Aligning Diffusion Models with Noise-Conditioned Perception
A method that utilizes the U-Net encoder’s embedding space for preference optimization. Perform diffusion preference optimization in a more informative perceptual embedding space.
将Diffusion-DPO的四个diffusion loss改为UNet encoder feature之间的MSE loss。
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
结合了Curriculum Learning的Diffusion-DPO,先学简单的(分差大的)再学难的(分差小的)。
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
We propose PatchDPO, an advanced model alignment method for personalized image generation by estimating patch quality instead of image quality for model training.
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences
If the data is curated according to a reward model, then the expected reward of the iterative retraining procedure is maximized.
Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through f-divergence Minimization
Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation
We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2.
GFlowNets are a class of probabilistic methods to train a sampling policy
Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets
Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities
利用ControlNet实现StableDiffusion的中文控制。
ControlNet输入变为
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
用feature之间的L2 loss代替KD loss。
AltDiffusion: A Multilingual Text-to-Image Diffusion Model
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
The pre-trained CLIP model can merely encode English with a maximum token length of
在stage 1时,当text length超过
Mixture of Diffusers for Scene Composition and High Resolution Image Generation
分区域生成,每个区域对应一个prompt,进行harmonization组合。
harmonization的关键在于:每一步都进行融合,且相邻区域要有overlap,overlap部分进行weighted sum,harmonization就是通过overlap部分传递的
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
和Mixture of Diffusers类似,不同的是,MultiDiffusion是对去噪结果
DemoFusion: Democratising High-Resolution Image Generation With No $$$
分辨率由低到高进行生成:对上一分辨率的生成结果进行上采样,再进行diffuse,使用MultiDiffusion进行生成,生成过程中的latent与diffuse得到的latent进行系数为
对MultiDiffusion进行改进,受ScaleCrafter的启发,rather than dilating the convolutional kernel, we directly dilate the sampling within the latent representation,称为dilation sampling,比如图中diffusion model的原生成尺寸是
不进行dilation sampling的为
代码中的dilation sampling就是如右下图中只取了四块。
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning
cascade的思想。
循环进行:perform diffusion sampling starting from
类似ScaleCrafter,采样时将standard convolution layer改造成dilated convolution layer提高感受野。
类似RDM,不同分辨率使用不同的noise schedule进行diffuse,以方便relay。
FAM: Diffusion Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
类似DemoFusion。
使用low-res的高频部分来保持structure,high-res的低频部分来refine detail。
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
和DemoFusion一个范式。
restrained dilated convolution: 去噪高分辨率latent时,类似ScaleCrafter使用dilated convolution,but we only apply dilated convolution in the layers of down-blocks and mid-blocks。
scale fusion: 去噪高分辨率latent时,直接计算self-attention称为global attention,类似MultiDiffusion那样分成UNet原分辨率patch进行计算的self-attention称为local attention,
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation
MultiDiffusion这种分patch进行采样再组合的方法很容易出现object repetition的问题,主要原因是不同patch在生成时都是用了相同的prompt,所以每个patch都被迫使去生成prompt中的object,AccDiffusion decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch.
AccDiffusion introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation.
AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation
低分辨率生成,上采样后提取每个patch的canny,ControlNet控制配合patch-content-aware prompt进行生成。
ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance
先使用原模型根据prompt生成一张低分辨率的图,上采样到目标高分辨率,分成
将高分辨率生成过程的
fine-grained guidance:对
structural guidance:根据
HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
类似ResMaster。
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control
加速MultiDiffusion。
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
MultiDiffusion只能保证相邻的子区域的图片风格一致,无法保证全局风格一致。
选一个子区域作为锚点,每一步去噪前,计算所有子区域的
SyncTweedies: A General Generative Framework Based on Synchronized Diffusions
MultiDiffusion和SyncDiffusion对应case 3。
本文发现case 2效果最好。
Learned representation-guided diffusion models for large-image generation
用图像的某个patch和这个patch对应的预训练SSL模型提取的feature训练diffusion model。
生成时先生成feature,再利用MultiDiffusion的方法,逐个patch进行overlap生成。
CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method
分为两个阶段
第一阶段
第二阶段
ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance
多GPU并行加速CutDiffusion。
InstantAS: Minimum Coverage Sampling for Arbitrary-Size Image Generation
MultiDiffusion慢的原因是相邻patch需要overlap以传递信息。
InstantAS使用non-overlap的patch进行生成,每一步生成后重新划分patch,这样既传递了信息,又加快了速度。思想有点类似GoodDrag,边生成边优化。
Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
根据attention entropy理论,只需要修改attention的scaling factor就可以使模型生成不同大小的图片。
DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance
先使用StableDiffusion在原分辨率进行采样,得到
上述方法可以重复进行,直到得到目标分辨率的图像。
Low-frequency component represents the low-frequency details of the image, encompassing global structures, uniformly-colored regions, and smooth textures. 所以该方法又称为DWT-based Structure Guidance,避免了使用StableDiffusion直接生成高分辨率图像时structure的不和谐。
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
预训练好的StableDiffusion不能直接生成更高分辨率图片的原因是卷积核感受野受限。
We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
和ScaleCrafer类似,都把问题归因于卷积核,在生成更高分辨率图片对feature map就行低通滤波并对卷积核进行dilation。
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models
The generated image is highly correlated with the feature map of deep Blocks in structures and feature duplication happens in the deep Blocks. As the higher-resolution feature size of deep blocks is larger than the corresponding size in training, these blocks may fail to incorporate feature information globally to generate a reasonable structure. We contend that if the size of the higher-resolution features of deep blocks is reduced to the corresponding size in training, these blocks can generate reasonable structural information and alleviate feature duplication. Inspired by this motivation, we propose Resolution-aware U-Net (RAU-Net), a simple yet effective method to dynamically resize the features to match the deep blocks.
RAD根据输入分辨率调整第一个conv层的dilation rate,使输出的feature size匹配原模型训练时的feature size。
RAU根据输入分辨率将最后一个conv层前的插值倍数,使输出的feature size匹配当前分辨率。
Both RAD and RAU do not introduce additional trainable parameters. Therefore, RAD and RAU can be integrated into vanilla U-Net without further fine-tuning.
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
VAE不动,LoRA fine-tune StableDiffusion,预定义一些长宽比,每个长宽比对应一个图像长宽,训练时,根据图像长宽比找到一个最近的预定义长宽比,将图像resize到其对应的图像长宽,进行训练。这样就可以给定任意长宽比的噪声生成图像。
利用StableSR的tiled sampling进行超分,类似MultiDiffusion,可以超分到任意分辨率。
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
定义一个递增的分辨率序列,只需要最低分辨率上训练好的diffusion model。
训练时,任选一个分辨率的
采样时,先从最低分辨率采样得到样本,加噪到某一中间步后上采样到下一个更高的分辨率,继续采样,以此循环,直到最高分辨率。
DiffCollage: Parallel Generation of Large Content with Diffusion Models
考虑一张组合图
对应的score为
可以分别训练两个模型,一个拟合原始图像
ElasticDiffusion: Training-free Arbitrary Size Image Generation
扩散模型在
CFG采样公式
对于unconditional score,之前的方法都是带overlap的分patch采样,在overlap处取平均,(每个patch
对于class direction score,将
由于计算class direction score时上下采样都使用nearest-neighbors mode,所以
Reduced-Resolution Guidance:使用unconditional score估计出一个
MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising
MultiDiffusion
FiT: Flexible Vision Transformer for Diffusion Model
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
针对有pose和prompt的高分辨率人物生成。
direct: 使用一个已有的token,对其token embedding进行优化或适配
transform: 由一个网络将视觉信息转换为token embedding或residual
attach: 附在已有prompt之后
no pseudo word: 不需要使用已有的token或新添加token
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
StableDiffusion
diffusion loss只优化token embedding(text encoder前的embedding)。
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Imagen
[V] class, where [V] is a rare token
token embedding毕竟表达能力有限,效果不好,所以选择优化token embedding的同时也fine-tune整个模型(包括text encoder)。
fine-tuning有overfitting + language drift的缺点,所以提出Class-specific Prior Preservation Loss,类似Continual Learning的replay方法,生成一些样本和新样本一起作为训练集进行训练,防止过拟合。
改进版使用LoRA fine-tune diffusion model。
Multi-Concept Customization of Text-to-Image Diffusion
StableDiffusion
[V] class
同时训练token embedding和cross-attention KV projection matrix。类似DreamBooth,构造一个regularization dataset解决language drift问题。相当于只fine-tune cross-attention KV projection matrix的StableDiffusion版本的DreamBooth。
可以同时在多组reference images上进行训练,生成时可以使用多个pseudo words构造prompt。
DreamBooth++: Boosting Subject-Driven Generation via Region-Level References Packing
DreamBooth
组图,修改UNet的计算方式,convolution和self-attention的计算限制在各自的region内,训练时优化pseudo word embedding并且fine-tune整个UNet。
除了DreamBooth的两个loss,还加了一个cross-attention map之间的MSE loss。
An Improved Method for Personalizing Diffusion Models
StableDiffusion
[V] class
借鉴Imagic的两阶段训练法,第一阶段只训练token embedding,第二阶段只fine-tune diffusion model。
ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation
StableDiffusion
将reference image作为visual condition引入网络。
使用
使用reference image的text cross-attention map估计出一个mask,用这个mask过滤KV,只保留mask内的KV(KV长度变小),让Q只与有object的KV进行计算。
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
使用CelebA-HQ数据集,训练一个HyperNetwork预测StableDiffusion的所有attention层的LoRA参数去重构图像。StableDiffusion输入统一的"a [V] face"的prompt,其中"[V]"是稀有单词,这里不优化"[V]"的token embedding,因为作者发现只需要LoRA参数,就可以用"[V]"随意造句进行生成了。
测试时,先使用HyperNetwork预测LoRA参数作为初始化,然后再进行LoRA fine-tune,fine-tune速度比DreamBooth快25倍。
HyperNetwork架构类似Q-Former,使用迭代法从零初始化的参数生成最终参数,预测出的LoRA参数加在StableDiffusion上计算diffusion loss优化HyperNetwork。
HyperNet Fields Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories
DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion
LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization
diffusion model生成LoRA参数。
P+: Extended Textual Conditioning in Text-to-Image Generation
StableDiffusion
定义P+空间:UNet每层cross-attention使用的text embedding的集合。不同层使用不同text embedding有不同的效果。
P+空间的TI:对于某个concept,不同cross-attention层使用不同token embedding进行优化,在StableDiffusion中就是16个不同的token embedding。
只优化token embedding,不优化模型参数。
不同层输入不同concept的TI得到的token embedding,还可以达到semantic composition的效果。
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization
P+空间的TI
reference image作为positive sample,找一些同类的其它图片作为negative sample。
Textual QFormer根据sample提取P+空间的token embedding序列,Visual QFormer根据sample提取visual feature,两个QFormer的query里都有一个cls token,使用两个QFormer的cls token位置的输出计算contrastive loss。
Visual QFormer提取到的reference image的feature以IP-Adapter的形式引入diffusion model。
在P+空间的token embedding序列引入contrastive loss,拉进positive sample之间相同cross-attention层的token embedding之间的距离,拉远positive sample和negative sample之间相同cross-attention层的token embedding之间的距离。
使用dffusion loss和两个contrastive loss训练模型。
使用multi-view数据集训练,同一个物体的使用三个view,第一个作为reference image,第二个作为reference image的positive sample,第三个作为reconstruction target。
ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models
StableDiffusion
定义
不同阶段使用不同reference的token embedding,可以实现material、style、layout的transfer生成与编辑。
A Neural Space-Time Representation for Text-to-Image Personalization
StableDiffusion
P+空间是spatial层面的扩展,但diffusion model不同时间步性质表现都不同,所以在时间维度上继续扩展P+空间到
During optimization, the outputs of the neural mapper are unconstrained, resulting in represen tations that may reside far away from the true distribution of token embeddings typically passed to the text encoder. We set the norm of the network output to be equal to the norm of the embedding of the concept’s “supercategory” token. 例如学习一个cat相关的concept,最终输出为
neural mapper是一个MLP,其最后一个hidden layer前的hidden latent
Inverting a concept directly into the UNet’s input space, without going through the text encoder, could potentially lead to much quicker convergence and more accurate reconstructions. 所以让neural mapper输出两个向量,一个向量是token embedding,和其他单词一起送入CLIP text encoder,另一个向量不过CLIP text encoder,而是直接加在该CLIP text encoder输出的text embedding上,同样使用上面的normalization,防止过拟合。但是这个额外的向量只加在UNet的cross-attention层的value上,key使用不加额外向量的text encoder的输出,原理同key-locking。
Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization
将原图分解为低频和高频分量,分别对应三个pseudo word embedding,原图的pseudo word embedding等于低频和高频分量的pseudo word embedding之和,训练时从低频分量、高频分量和原图三者中随机选一个。通过分解并且分别学习,学习效果更好。
生成时使用原图的pseudo word embedding,可以结合style描述进行生成,效果比别的方法要好。
Key-Locked Rank One Editing for Text-to-Image Personalization
StableDiffusion
Personlization的两个主要目标是avoid overfitting和preserve the identity,但这两个目标天然存在trade-off,to improve both of these goals simultaneously, our key insight is that models need to disentangle what is generated from where it is generated.
cross-attention中key决定了where it is generated,value决定了what is generated,所以fine-tune diffusion model时只训练
A natural solution is then to edit the weights of the cross-attention layers,
Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting
继承PerFusion的思想,训练和生成都有两条generative trajectory(C和F),使用F的cross-attention map取代C的cross-attention map,只训练pseudo word embedding和
Cross Initialization for Personalized Text-to-Image Generation
TI使用supercategory初始化token embedding
查看token embedding
对于某个token embedding
这说明TI优化是目标是
Learning to Customize Text-to-Image Diffusion In Diverse Context
利用MLM加强pseudo word embedding的语言特性。
A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization
DreamBooth。
构造更好的regularization dataset。
User-Friendly Customized Generation with Multi-Modal Prompts
使用BLIP和ChatGPT构造更好的regularization prompt、customized prompt和generation prompt。
Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
DreamBooth。
类似DP,在训练时使用尽可能详细的描述,这样可以以减少pseudo中对不相关内容的bias。
作者总结了几种常见的bias,利用VLM生成含有这些bias描述的句子。
Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
改进采样过程,
AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models
不同方法在不同stage进行操作,比如TI在stage 1训练token embedding,CatVersion在stage 2训练text encoder,CustomDiffusion在stage 3训练cross-attention KV projection matrix。这三类方法最终都是为了修改最后的KV,送入cross-attention影响最后的图像。
对于现有的方法,对比模型关于"a cat playing with ball"和"a <sks> playing with ball"的cross-attention map会发现,由于pseudo word的引入,其它没有变的word的cross-attention map也被影响了,这是这些方法效果不好的原因。
生成时,分别使用原模型+"a cat playing with ball"和customized model+"* <sks> * * *"进行生成,将前者的pseudo word的cross-attention map替换为后者,可以使用在多种TI方法上。
PALP: Prompt Aligned Personalization of Text-to-Image Models
LoRA版的DreamBooth
test-time fine-tune时,不仅要提供reference image,还要提供生成时需要的prompt,比如"a sketch of [V]",即每次生成前都要进行fine-tune。
Personlization的一个问题是过拟合,过拟合的模型只需要一步就可以从纯噪声预测出subject的形状和特征。Our key idea is to encourage the model’s denoising prediction towards the target prompt。
除了diffusion loss,加入SDS loss,让根据prompt的预测靠近根据clean prompt的预测。
CLiC: Concept Learning in Context
StableDiffusion
Custom-Diffusion的RoI版本,对RoI区域的物体进行TI,同时优化cross-attention KV projection matrix。
SDEdit + Blended进行编辑。
MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models
Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
使用CLIP text encoder编码super class name初始化token,然后使用EM算法优化。
E-step:随机选择50个步数,对reference image加噪,和带pseudo word的prompt一起送入StableDiffusion,提取pseudo word对应的cross-attention map,取平均,阈值法求出一个mask。
M-step:使用上述mask,masked diffusion loss + masked cross-attention loss优化pseudo word embedding。
Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
Diffusion-DPO fine-tune diffusion model,目标是让模型在以含有pseudo word的prompt为条件时,更加prefer reference image。
similar loss就是reference image上的diffusion loss。
CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation
多reference image的IP-Adapter。
DreamTuner: Single Image is Enough for Subject-Driven Generation
类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。
Subject-Encoder:为了解耦内容和背景特征,使用Salient Object Detection去除背景;为了解耦内容和位置特征,可以用预训练的ControlNet引入位置信息,这样学到的都是content特征。
Subject-Encoder-Attention:StableDiffusion的self-attention和cross-attention之间插入一个可训练的cross-attention层(S-E Attention),对reference image进行重构,reference image的self-attention附加到generated image的self-attention中提供参考。
Self-Subject-Attention:The features of reference image extracted by the text-to-image U-Net model are injected to the self-attention layers, which can provide refined and detailed reference because they share the same resolution with the generated features. 生成时每一步直接对reference image随机加噪,输入UNet,提取self-attention layers的key和value,与生成时的self-attention layers的key和value进行如上交互。
使用上述方法,即使不训练pseudo word embedding,也能进行personlization生成。但使用DreamBooth的方法训练一个pseudo word embedding+fine-tune diffusion model,效果更好。
FreeTuner: Any Subject in Any Style with Training-free Diffusion
类似DreamTuner。
three feature swap operations:1) cross-attention map swap: 将reconstruction branch的subjected-related cross-attention map注入personalized branch,如这里的horse。 2) self-attention map swap: 将reconstruction branch的self-attention map的
如果还有style image,使用VGG-19提取feature计算相似度,求梯度作为guidance。
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。
使用CLIP text encoder编码query,使用CLIP image encoder编码reference image得到sequence feature,两个feature计算得到cross-attention map,提取CLIP image encoder不同层的sequence feature作为V,共
使用text-image pair自监督训练,提取关键词作为query,
这说明CLIP image encoder编码图像得到的sequence feature也是可以用于计算相似度的,不只是CLS token feature可以。
Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
diptych: 双连画。
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
针对TI生成过程的优化,对于不同TI方法都适用,如DreamBooth和CustomDiffusion等。
self-attention有两个作用,一是由QK计算出的attention map控制的图像结构,二是由V控制的visual attributes,如颜色、纹理。
TI方法生成的图像,concept的结构都比较好,但是一些具体细节,如颜色、纹理,都和reference image中的concept有出入,所以本方法通过修改TI生成过程中self-attention的V做appearance保持。
具体做法是dual branch,先对reference image进行DDIM Inversion再重构,得到reconstructive trajectory,另一条从随机噪声出发,带pseudo word的prompt为条件,得到generative trajectory。由于生成图像中concept的位置不确定,和reference image中的concept位置不一致,所以直接用reconstructive trajectory中的V替换generative trajectory中对应的V会出现位置不匹配的问题,所以使用两条trajectory的UNet decoder的一些feature做semantic corresponce,根据semantic corresponce计算出dense displacement field,根据dense displacement field对reconstructive trajectory中的V做warp,使用warp后的V替换generative trajectory中对应的V。
Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation
提出一种early stopping criterion,加速TI接近15倍,并且效果没有明显下降。
Generate Anything Anywhere in Any Scene
DreamBooth学到的word也可以用在GLIGEN这种plug-and-play模型。但DreamBooth的一个缺点是不能解耦object和位置的信息,使用GLIGEN这种有额外layout信息的模型进行生成时,一旦修改了位置,就无法很好的生成object。
实用数据增强方法训练DreamBooth:by incorporating a data augmentation technique that involves aggressive random resizing and repositioning of training images, PACGen effectively disentangles object identity and spatial information in personalized image generation.
Compositional Inversion for Stable Diffusion Models
existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space.
Textual Inversion will make the new (pseudo-)embeddings OOD and incompatible to other concepts in the embedding space, because it does not have enough interactions with others during the post-training learning。加入正则项,使得学到的embedding和一些已知的且相关的concept的embedding不要太远,比如给定和猫相关的reference images时,使得学到的embedding和cat, pet等的embedding靠近。这样学到的embedding更具一般性,和其他单词组合造句时就像用cat造句一样,模型可以识别,也可以和其他学到的embedding组合造句进行multi-concept generation。
CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization
相同的
SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
如果用传统方法学到的pseudo word进行造句,比如"[V] is running",模型不能正确生成running,但如果使用base class造同样的句子却可以正确生成,这说明学出来的pseudo word并不能继承base class的属性。
为TI引入正则,让学出来的pseudo word继承base class的属性,最小化
Enhancing Detail Preservation for Customized Text-to-Image Generation A Regularization-Free Approach
使用不加任何正则项的TI得到token embedding。
之前的工作加正则项是为了防止过拟合,但也导致了信息提取不充分。本论文提出Fusion Sampling解决这一问题。
DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation
DreamBooth
之前的工作如TI和DreamBooth都是为reference images优化一个token,DisenBooth除此之外还为每张reference image编码一个独立的subject-unrelated token,这样有助于学习到所有reference images共有的subject的特征,而忽略每张reference image其它细节(如背景等)。
使用LoRA进行fine-tune。
DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning
类似DisenBooth
借鉴classifier-free guidance,学习两个pseudo-words,positive pseudo word用于提取主要特征(相当于
生成时使用negtive pseudo word的输出作为u进行cfg生成。
StyO: Stylize Your Face in Only One-Shot
StableDiffusion
one-shot face stylization: applying the style of a single target image to the source image。
构造content和style单词,使用三个数据集进行TI,同时也fine-tune StableDiffusion,其中target和source都只有一张图像。
之后使用该prompt进行生成:
DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models
有点类似De-Diffusion,但不是显式的caption。
生成时采样即可。
SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing
StableDiffusion
借鉴Break-A-Scene,使用DINO或SAM对intended concept做分割,使用masked diffusion loss进行训练。
两阶段训练:第一阶段做TI,只训练image encoder;第二阶段fine-tune encoder+diffusion model。
输入不带pseudo word的text计算
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
StableDiffusion
两阶段训练:global和local。
global:使用CLIP作为feature extractor提取reference image feature,使用一个global mapping network将CLIP不同层的feature映射为不同token embedding,最深层的feature预测的token embedding对应subject-related information,浅层的feature预测的token embedding对应subject-unrelated information,同时训练global mapping network和cross-attention KV projection matrix。
local:去除reference image背景,使用CLIP作为feature extractor提取其feature,使用一个local mapping network将CLIP feature映射为token embedding,这里只使用最深层的word,额外添加一组cross-attention KV projection matrix进行训练,同时训练local mapping network和new cross-attention KV projection matrix。此时cross-attention的输出是global与local cross-attention的输出的和,global cross-attention依然使用global阶段生成的token embedding作为输入,且只使用最深层的word,但不参与训练。这一阶段类似LoRA,让模型将更多细节绑定到global阶段生成的word embedding上。
Designing an Encoder for Fast Personalization of Text-to-Image Models
StableDiffusion
Texutal Inversion shows that the word embedding space exhibits a trade-off between reconstruction and editability. This is because more accurate concept representations typically reside far from the real word embeddings, leading to poorer performance when using them in novel prompts. StyleGAN inversion也有这种问题, a two-step solution which consists of approximate-inversion followed by model tuning. The initial inversion can be constrained to an editable region of the latent space, at the cost of providing only an approximate match for the concept. The generator can then be briefly tuned to shift the content in this region of the latent space, so that the approximate reconstruction becomes more accurate.
每个domain (like face, cat, dog, etc.)训练一个编码器
使用
每个domain先在各自的大数据集上进行预训练,再在给定的几张图像上进行test-time fine-tuning,都用一样的训练方法。
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
引入contrastive-based regularization technique,让encoder可以处理不同domain的数据。
Cones: Concept Neurons in Diffusion Models for Customized Generation
StableDiffusion
training-free
对每一组concepts,在cross-attention层的KV参数中,找到那些屏蔽掉后能够降低DreamBooth Loss(Reconstruction Loss+Preservation Loss)的神经元(Concept Neurons),不用训练,直接屏蔽掉这些神经元,就能得到对这组concepts敏感的text2img模型。pseudo word用一些已有但不常用的单词,比如AK47等。
Cones 2: Customizable Image Synthesis with Multiple Subjects
对于某个class的subject,学习一个该class的token的residual token embedding。做法是TI训练text encoder,但这样会使整个句子中的单词偏向subject。加入正则项:使用ChatGPT对class进行造句,分别使用训练后的text encoder和原text encoder对每个句子进行编码,使得句子中非class的单词的token embedding训练前后尽量靠近。最后的residual token embedding也是所有造句中class token embedding的差的均值。(注意,某个单词单独的embedding和其在句子中的embedding是不同的)
这样每个residual token embedding都是可重复利用的,也可以和别的residual token embedding同时使用,还可以操作cross-attention map指定concept的位置。
Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion
StableDiffusion
为参考图写一句话,但不包含pseudo word,而是利用text embedding后面的空位,加上personalized embedding,训练时只优化personalized embedding。
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
和HiPer类似,优化最后5个embedding。
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
StableDiffusion
将base class word(如dog)输入CLIP,在CLIP的最后3个self-attention层,给key和value分别concat一个可训练的residual embedding,即
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
在reference image上利用RL直接fine-tune StableDiffusion。
reward定义为diffusion loss。
Subject-driven Text-to-Image Generation via Apprenticeship Learning
Imagen
对每个concept,使用
这样,使用训练好的大模型,给定3-10张unseen concept的图片和这个concept对应的文本,使用这个文本随便构造prompt,就可以生成和prompt和3-10张unseen concept图片都对齐的图像。
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Imagen
只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。
对于每个domain,使用该domain的数据集进行训练:去除每张image的背景,训练一个object encoder提取object特征,并使用caption模型生成image的text,使用object特征和text两个条件fine-tune Imagen,使用一些正则防止过拟合。
训练好的模型可以根据reference image的物体特征和用户写的prompt自由生成。
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
StableDiffusion
类似Object Encoder,只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。
对于每个domain,使用该domain的数据集进行训练:把每张image看成一个concept进行训练,训练一个encoder编码image得到两个特征,一个concept特征,一个visual特征,concept特征替换text embedding中pseudo word所在位置的embedding,同时将visual特征通过GLIGEN引入StableDiffusion,同时训练encoder和GLIGEN的adapter,使用数据增强和去除背景等方法防止过拟合。并不优化pseudo word的token embedding。
推理时可以使用pseudo word构造prompt,encoder编码reference images得到的concept特征取均值后替换text embedding中pseudo word所在位置的embedding。
Instruct-Imagen: Image Generation with Multi-modal Instruction
Re-Imagen + Instruction Tuning
Re-Imagen的目的是为了让模型condition on multi-modal input
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models
类似DreamTuner,训练网络直接识别reference image就可以直接生成,不需要test-time fine-tuning。训练整个Reference UNet和Base UNet的self-attention layers里的四个矩阵。
造数据进行自监督训练。
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
类似M2M的image sequence生成方法。
Break-A-Scene: Extracting Multiple Concepts from a Single Image
提取一张图中多个concept。
给定有分割标注的图片,一次性提取图片中不同object的pseudo word,利用masked diffusion loss + masked cross-attention loss进行训练。
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
从
use the combination of the score of different concepts (a learnable word embedding) to reconstruct images using diffusion loss.
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
无监督版的Break-A-Scene。
利用聚类得到多实例的大致分割,为每个实例分配一个可学习的token embedding,使用masked diffusion loss进行学习。
使用了对比损失和正则项进行辅助和增强。
Attention Calibration for Disentangled Text-to-Image Personalization
CustomDiffusion
提取一张图中多个concept。
suppress:cross-attention map的平方(element-wise multiplication),抑制low response,增强high response。
AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization
CustomDiffusion
提取一张图中多个concept。
先不使用pseudo word,使用concept对应的class word,在某个较小的时间步,使用DatasetDiffusion的方法提取每个concept的mask,使用CustomDiffusion的方法学习时,优化每个pseudo word的cross-attention map和对应的mask之间的KL散度。
UnZipLoRA: Separating Content and Style from a Single Image
用ZipLoRA的方法直接从reference image中学
Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
TI和P+这种只优化token embedding的方法,如果reference image是in-domain的,那就够用了,但如果是out-of-domain reference image效果就不好了。
(c) 对于token embedding和模型参数都优化的方法(如DreamBooth和CustomDiffusion),如果只用优化好的token embedding和原模型参数进行生成,生成的都较为相似,说明token embedding捕捉的还是in-domain的信息,out-of-domain的信息蕴藏在更新的模型参数中。
(d) 为了将更多的信息转移到token中,采用P+的layer-wise embedding并使用multi-word embedding。
单个concept的学完后,如何融合多个LoRA参数
LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models
解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。
training-free方法,需要提供不同concept的bounding box。
将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
正则项
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
loss和ZipLoRA完全一样,但是训练一个HyperNetwork,输入
使用大量不同LoRA训练,这样可以zero-shot,不需要像ZipLoRA那样每对LoRA都要重新训练。
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
Given a pre-trained LoRA
Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
Continual Learning setup: a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns.
MultiBooth: Towards Generating All Your Concepts in an Image from Text
思想类似LoRA-Composer。
MC2: Multi-concept Guidance for Customized Multi-concept Generation
类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。
training-free方法,但不需要提供不同concept的bounding box。
将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,是一种通用的方法,可以用在不同TI方法上,甚至不同TI方法训练出来的pseudo word和LoRA也可以一起生成。
使用两阶段进行生成。第一阶段先用general class word替代pseudo word,使用原StableDiffusion进行生成,保留生成过程中所有general class word对应的cross-attention map,使用SAM得到生成结果中general class word对应的mask;第二阶段和第一阶段进行一样的生成过程,但在每一步,对于每个concept,使用pseudo word和对应的LoRA进行生成,所有concept预测的噪声使用第一阶段的mask进行blending,同时也使用第一阶段的cross-attention map替换pseudo word对应的cross-attention map,以做到layout preservation。
FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
training-free方法,需要提供不同concept的mask。
MRSA:inject KV of self-attention in reference path into composition path。
Orthogonal Adaptation for Modular Customization of Diffusion Models
LoRA fine-tune时,不同concept使用互相正交的B,固定B,只训练A,这样学到的多个concept可以同时生成,正交性使得不同concept的LoRA参数可以直接相加在一起使用。
Mixture of LoRA Experts
解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。
类似MoE,训练一个gating function,其根据LoRA的输出计算一个gating value,使用gating value线性组合不同LoRA的输出,使用训练LoRA时的数据和loss进行训练。
Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization
DreamBooth+LoRA(加在cross-attention上),需要同时学两个pseudo word(不同的reference image),一个是content,一个是style。有两个baseline:一个是公用同一个LoRA联合训练,另一个是分开学LoRA然后直接加在一起使用。
做矩阵分解,
Cones 2: Customizable Image Synthesis with Multiple Subjects
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation
StableDiffusion
任务:给一个句子,和句子中某些单词对应的图像,生成句子对应的图像,其中给定图像的单词对应的object要和给定图像相似,相当于可以做composition。类似PbE的self-supervised learning:利用预训练目标检测模型,在LAION数据集上,标注出句子中具体单词对应的object在图像中的位置,构建新的数据集。
不fine-tune模型,只训练一个MLP,将给定图像的CLIP image embedding转换为token embedding,用TI方法训练这个MLP,类似FastComposer。
Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
StableDiffusion
训练一个open domain并且不需要test-time fine-tuning的模型。
数据集:使用BLIP为图像生成caption,提取caption中的subject,使用DINO+SAM分割出每个subject对应的bounding box,在caption后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,构成数据集。
训练时,使用CLIP image encoder编码每个subject对应的bounding box内的内容,使用编码结果直接替换上述的[placeholder_i]的embedding,并且重新训练text encoder,这样就在建模text前引入融合了图像信息(实验发现这样比建模句子后再融合要好);同时训练cross-attention KV projection matrix(因为他们负责转换text feature);类似GLIGEN在self-attention和cross-attention之间加一个adapter,引入bounding box信息(帮助识别区分多物体)。
推理时,给定一个caption,在caption之后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,为每个[placeholder_i]提供一张reference image,还可以为每个[placeholder_i]指定一个bounding box。
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation
基于IP-Adapter的多subject组合生成。
为text prompt中某些subject token提供image prompt,阈值法使用subject token对应的text cross-attention map估计出一个mask,乘到对应的image prompt的image cross-attention map上,所有image prompt的image cross-attention的输出加权和。
A&E防止object missing。
侧重于发掘之前没有的concept。
The Hidden Language of Diffusion Models
decomposing an input text prompt into a small set of interpretable elements.
对于某个concept,造句生成的100张图像,找一堆base word,学习一个MLP,为每个base word预测一个权重,所有base word的线性组合去重构这100张图像。目的是学习这个concept可以由哪些base word解释。
CusConcept: Customized Visual Concept Decomposition with Diffusion Models
类似Conceptor。
Exploiting Interpretable Capabilities with Concept-Enhanced Diffusion and Prototype Networks
ConceptLab: Creative Generation using Diffusion Prior Constraints
利用DALL
PartCraft: Crafting Creative Objects by Parts
StableDiffusion
使用DINOv2对数据集进行unsupervised part discovery,分为三阶段k-means,第一阶段
可以实现不同part的任意组合,生成新物种。
ReVersion: Diffusion-Based Relation Inversion from Images
TI训练优化一个relation token,提取reference images中共同存在的relation特征而不是object特征,比如握手,之后用relation token造句可以生成具有相同relation的图像。
Relation-Steering Contrastive Learning:relation token应该具有介词词性,使用一个contrastive loss,拉近relation token与已有的介词的距离,拉远relation token与其它词性的单词的距离。
Customizing Text-to-Image Generation with Inverted Interaction
类似ReVersion,TI学习物体之间交互关系。
Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
invert any concepts in exemplar images, such as "frozen in ice", "burnt and melted", and "closed eyes"
使用contrastive learning,构造concept的同义词作为positive,反义词作为negtive,计算InfoNCE loss。
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
TI训练优化一个action token embedding,提取reference images中共同存在的action特征而不是object特征,比如倒立,之后用action token造句可以生成具有相同relation的图像。
对于某个要学习的action,在每一个cross-attention层都优化一个token,这样就不必局限于单个token,语义更丰富。
避免学到与action无关的特征:
ImPoster: Text and Frequency Guidance for Personalization in Diffusion Models
左上角是source image,左下角是driving image,先用这两张image fine-tune diffusion model。
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models
FSViewFusion: Few-Shots View Generation of Novel Objects
Customizing Text-to-Image Diffusion with Camera Viewpoint Control
Given multi-view images of a new object, we create a customized text-to-image diffusion model with camera pose control.
Learning Continuous 3D Words for Text-to-Image Generatio
Learn a continuous function that maps a set of attributes from some continuous domain to the token embedding domain.
CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models
构造同一object不同view的图像作为数据,编码object和view作为条件,使用IP-Adapter进行训练。
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation
使用预训练的ViT架构的人脸识别模型,提取3,6,9,12和最后一层的CLS token位置的feature,concat在一起,分别使用2个MLP将其转化成2个token embedding,使用diffusion loss和token embedding的L2正则进行训练。
使用不同层的feature的原因是最后一层的feature蕴含的都是比较高级的语义信息,缺少一些细节。
类似DreamIdentity利用预训练人脸模型的multi-scale feature,同时使用一个预训练expression encoder提取表情feature,以20%概率替换为一个可学习的代表无表情的向量,两个feature concat在一起,使用mapping network转化为token embedding,使用diffusion loss进行训练。
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
StableDiffusion
训练集由多个不同的人物id组成,每个人物id包含同一个人的多个image-text pair,text中包含man或woman描述image。训练时,使用CLIP image encoder将某个id的
生成时不再需要额外训练,任意给定某个人物的几张image,编写prompt进行生成。
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
类似PhotoMaker。
Foundation Cures Personalization: Recovering Facial Personalized Models Prompt Consistency
MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation
Arc2Face: A Foundation Model of Human Faces
IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models
图里没画出来,在prompt后加了"the woman is sks",并且at the first embedding layer of the text encoder, we replace the text embedding of the identifier word “sks” with the identity text embedding,但没有优化sks的token embedding,而是用学到的embedding取代。
Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction
We use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation.
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
使用多人脸图像自监督训练。
采样时只需要提供aligned faces。
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
给定一张人脸图像,和对场景和表情的描述,先使用StableDiffusion根据场景描述生成一张图像作为训练数据,再根据表情描述从数据库中选择一张具有该表情的图像作为表情条件,人脸图像作为id条件。
将训练数据的人脸部分mask掉(保留场景),concat在
使用diffusion loss + identity loss + expression loss 一起训练diffusion model,不需要自监督。
DemoCaricature: Democratising Caricature Generation with a Rough Sketch
ROME
Identity-Preserving Aging of Face Images via Latent Diffusion Models
DreamBooth
计算Class-specific Prior Preservation Loss时,将人脸数据按age分组,每组一个组名,如child,old等,使用带有组名的prompt和图像作为数据集。
训练后,使用photo of a
Inserting Anybody in Diffusion Models via Celeb Basis
StableDiffusion的text embedding是可以插值进行生成的,基于这一发现,可以收集一些CLIP text encoder能够识别的名人的人名,使用PCA算法计算出它们token embedding的一组基,这组基可以看成人脸特征在token embeeding space的表示。
训练时,给定任意一张人脸的图片,训练一个MLP去modulate这组基,组成该人脸对应的pseudo word的embedding,插入"a photo of _",使用Textual Inversion方法训练这个MLP。
CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models
人物工厂,不是TI,不需要reference image,直接生成随机的可用的pseudo word embedding。
使用GAN生成fake embedding,采样名人的人名作为real embedding,对抗训练。
StableIdentity: Inserting Anybody into Anywhere at First Sight
受Celeb Basis启发,寻找一些名人的人名,得到他们的word embedding。通过一个MLP将输入人脸图像转化为两个word embedding,通过AdaIN转化到celeb word embedding空间(celeb word embedding的均值和方差分别充当shift和scale),TI训练这个MLP。学到的两个word embedding可以用于任何text-based generative model,比如ControlNet,text2video。
SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation
LCM-Lookahead for Encoder-based Text-to-Image Personalization
专注人脸的IP-Adapter。
RealFill: Reference-Driven Generation for Authentic Image Completion
有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。
Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention
类似RealFill,有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。
Personalized Restoration via Dual-Pivot Tuning
有reference images的restoration任务。
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive.
DreamBench++ is a human-aligned benchmark automated by advanced multimodal GPT models.
Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models
Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion
training-free的方法大致有两种,一种是类似nursing的操作,设计一种
Sketch-Guided Text-to-Image Diffusion Models
为预训练好的StableDiffusion引入sketch。
使用预训练好的edge提取器生成训练数据(自监督),训练一个可以根据UNet的各层feature maps预测edge的MLP。方法类似于Label-Efficient Semantic Segmentation With Diffusion Models。
采样时用MLP损失函数的梯度做classifier guidance,只在T到0.5T加guidance。
使用dynamic guidance scheme:
Sketch-Guided Scene Image Generation
先利用每个object的sketch和只含有object的prompt单独生成该object,之后对该object进行TI学习。
将所有object按sketch的位置拼在一起进行blended生成。
It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models
只在很少的skecth-image pair数据集上训练,没有text。
使用CLIP编码sketch,取最后一层的feature sequence,只训练一个sketch adapter,将其转化成CLIP text embedding,送入StableDiffusion的cross-attention进行训练,除了diffusion loss,还有两个额外的loss:每一步预测的
ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model
模拟人类画图的思路,先生成sketch,再生成palette,最后生成图像。使用ShiftDDPMs的公式,以sketch或palette而不是pure noise为起点进行训练。
Training-Free Sketch-Guided Diffusion with Latent Optimization
training-free
KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models
CGC将sketch和text feature进行融合,融合后当做text输入diffusion model。
FGC是ControlNet或者T2I-Adapter,乘一个系数进行knob。
Compositional 3D Scene Generation using Locally Conditioned Diffusion
给定
Semantic-Driven Initial Image Construction for Guided Image Synthesis in Diffusion Model
受Initial Image Editing的启发,只需要精心构建
利用StableDiffusion,最深层cross-attention map的一个值对应
生成时,从物体对应的noise block数据库中采样,填在指定的bounding box内进行生成。
NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
masked cross-attention:layout之内的image feature与object prompt进行cross-attention,layout之外的image feature与global prompt进行cross-attention,两者结果相加。
The Crystal Ball Hypothesis in Diffusion Models: Anticipating Object Positions from Initial Noise
A trigger patch is a patch in the noise space with the following properties: (1) Triggering Effect: When it presents in the initial noise, the trigger patch consistently induces object generation at its corresponding location; (2) Universality Across Prompts: The same trigger patch can trigger the generation of various objects, depending on the given prompt.
We try to train a trigger patch detector, which functions similarly to an object detector but operates in the noise space. 随机噪声,生成图像,使用预训练好的object detector检测物体,检测得到的结果作为该噪声的ground truth,训练trigger patch detector。
生成时,随机噪声,检测trigger patch,移动trigger patch到目标位置。
LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation
adapt pre-trained unconditional or conditional diffusion models,在每个attention layer后加一个带residual的layout attention layer,即h=LayoutAttn(h)+h。
LayoutAttn(h)将layout分成每个instance单独的layout(即只标识了一个object),每个layout当成mask,提取h中该object的region feature map,然后为每个feature加上该object对应的class label或者caption的learnable embedding,然后做self-attention;对于h,使用空标签或者空字符串的learnable embedding加到每个feature上,做self-attention,作为背景;然后乘上mask加在一起,重叠部分取平均。类似ControlNet,参数初始化为0,LayoutAttn(h)一开始输出为0,训练开始前不影响原网络。
LayoutDiffusion Controllable Diffusion Model for Layout-to-image Generation
重新设计UNet,全部重新训练。
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
MMDiT
Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation
The proposed regional cross-attention layer is inserted into the original diffusion model right after each self-attention layer. The weights of the output linear layer are initialized to zero, ensuring that the model equals to the foundational model at the very beginning.
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
layout control map:将layout转换为semantic mask,让对应的word的cross-attention map只有semantic mask内的响应值,但由于StableDiffusion是在8倍下采样的latent上运行的(深层的feature map更小),对mask采取同样的下采样可能会导致一些小物体被忽略,所以这里通过感受野计算mask,对于feature map上每个image token,如果其在原图尺寸上的感受野与当前物体的semantic mask有交集,则设为1,否则设为0。使用原cross-attention map与乘上mask后的cross-attention map的插值。
Semantic Alignment Loss:encourages image tokens to interact more with the same and related semantic regions in the self-attention module, thereby further improving the layout alignment of the generated images. 通过cross-attention控制self-attention,对于某个word,将其cross-attention map(
Layout-Free Prior Preservation Loss:由于数据集较小,为了防止过拟合,使用一些文生图数据计算diffusion loss,此时把layout control map中的semantic mask cross-attention map的插值系数设为0即可。
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis
在StableDiffusion原有的cross-attention output上乘上mask,再额外训练一个并行的cross-attention(enhancement attention),在output上乘上mask,两者相加作为当前instance的shading result;再额外训练一个并行的self-attention(layout-attention),在output上分别乘上前景和背景的mask,得到两个shading result;
只在mid-layers (i.e., 8 × 8)和the lowest-resolution decoder layers (i.e., 16 × 16)上应用MIGC。
在COCO上使用diffusion loss训练,同时还优化cross-attention map上背景区域的响应值之和(类似TokenCompose)。
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation
Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
StableDiffusion
training-free
box: 对text中有bounding box的object对应的cross-attention map,定义一些bounding box附近的sliding box,bounding box内的响应值减去bounding box外的响应值再加上这些sliding box内的响应值与bounding box内的响应值的IoU(保证均匀),作为object reward。
bind: attribute的cross-attention map与对应的object的cross-attention map在bounding box内的响应值的KL散度的相反数,作为attribute reward。
两个reward加在一起求梯度作为guidance。
R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation
StableDiffusion
training-free
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
有点类似SpaText,每个object都对应一个region map,其大小和图像一致,并在bounding box内填上可训练的对应object的embedding,bounding box外填上可训练的background的embedding。所有region map分成patch,不同region map的同一个位置的patch组成一个序列,序列前再prepend一个agg embedding,送入一个ViT,不需要线性映射,不需要加positional embedding,取agg embedding的输出。所有位置都按此处理,按位置排列所有输出,组合成图像大小的一个layout embedding。训练一个diffusion model,将layout embedding与
Spatial-Aware Latent Initialization for Controllable Image Generation
Directed Diffusion: Direct Control of Object Placement through Attention Guidance
StableDiffusion
training-free
在生成时,提高text token对应的cross-attention map的bounding box区域的权重。
Grounded Text-to-Image Synthesis with Attention Refocusing
StableDiffusion
training-free
attention refocusing
cross-attention refocusing:类似Attend-and-Excite,
self-attention refocusing:
采样时计算上述loss,用
Boundary Attention Constrained Zero-Shot Layout-To-Image Generation
StableDiffusion
training-free
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
StableDiffusion
training-free
只在16x16的分辨率上进行操作
类似Attention Refocusing,生成时给定text中某些子字符串对应的bounding box,在对应的cross-attention map中分别使用Inner-Box Constraint (增强bounding box中的response,鼓励当前物体出现在bounding box内),Outer-Box Constraint (削弱bounding box外的response,防止当前物体出现在bounding box外),Corner Constraint (鼓励当前物体填满bounding box,而不是在bounding box生成一个很小的物体),多个loss的和对
Training-free Composite Scene Generation for Layout-to-Image Synthesis
StableDiffusion
training-free
只在16x16的分辨率上进行操作
类似BoxDiff,设计多个constraint loss的和对𝑥𝑡的梯度作为guidance。
Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis
StableDiffusion + StableInpainting
training-free
PACA:增大除了SOT之外所有token的cross-attention map中的mask区域内的响应值。对于SOT有一个很有意思的特点,其cross-attention map中的值哪里被增大了,最终输出的图像哪里就会变成背景,所以可以利用这一特点,对SOT的cross-attention map进行反向操作,增大mask区域外的响应值。
ReGCA:inpainting的cross-attention,背景和前景使用不同的KV,只对背景使用global prompt。
Localized Text-to-Image Generation for Free via Cross Attention Control
Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control
Enhancing Image Layout Control with Loss-Guided Diffusion Models
StableDiffusion
training-free
Cross Attention Control
除了text之外,额外提供了
SpaText: Spatio-Textual Representation for Controllable Image Generation
每个segment对应一个text,可以分区域生成,指定物体之间的空间关系。
自监督训练,使用预训练分割模型提取图像segments,用CLIP提取每个segment的CLIP image embedding,初始化一个全为0的segmentation map,大小和图像一样,通道数和CLIP image embedding维数一样,将每个segment的CLIP image embedding放到segmentation map中对应位置。
改造DALL
推理时用DALL
Enhancing Object Coherence in Layout-to-Image Synthesis
修改StableDiffusion网络结构,fine-tune。
Freestyle Layout-to-Image Synthesis
将StableDiffusion的cross-attention改为rectified cross-attention:将text token对应的cross-attention map中,在bounding box之内的保留原值,在bounding box之外的设为负无穷。By forcing each text token to affect only pixels in the region specified by the layout, the spatial alignment between the generated image and the given layout is guaranteed。再使用任何layout-based数据fine-tune StableDiffusion。
Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
传统训练方法只是将layout作为条件输入模型优化diffusion loss,并没有对layout的显式监督,可能导致生成结果和layout不匹配。一个解决方法是使用预训练的segmentor对
引入对抗训练,判别器:训练将ground truth每个pixel正确分类到N个real class,将
multistep unrolling:由于layout是diffusion生成早期阶段就决定的,但此时
Dense Text-to-Image Generation with Attention Modulation
StableDiffusion
training-free
和rectified cross-attention一样的思路,只不过是training-free的,可以直接采样:At cross-attention layers, we modulate the attention scores between paired image and text tokens to have higher values. At self-attention layers, the modulation is applied so that pairs of image tokens belonging to the same object exhibit higher values。这里的paired image and text tokens意思是当前image token的位置在text token所描述的object的bounding box内。
Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis
In real-world applications, semantic image synthesis often encounters noisy user inputs. SCDM enhances robustness by stochastically
perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion.
MagicMix: Semantic Mixing with Diffusion Models
noisy latents linear combination版本的SDEdit,削弱原图的细节,只保留基本的结构和外观信息。
DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
DiffEdit+MagicMix
Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt
translate geometric conditions to text(包括object坐标等),fine-tune StableDiffusion。
GLoD: Composing Global Contexts and Local Details in Image Generation
Masked SEGA.
StablePose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
相机参数:bokeh blur, lens, shutter speed, temperature等。
对于每个相机参数
用这些视频LoRA fine-tune T2V模型,同时训练一个contrastive camera encoder编码相机参数,编码结果拼在invariant scene description的编码结果之后。
contrastive camera encoder: 因为前后帧之间只有某个相机参数不同,所以做差取feature。
推理时,既可以根据给定的相机参数和prompt生成图像(所有帧使用相同相机参数),也可以对已有图像进行相机参数的编辑(从原图相机参数平滑过渡到目标相机参数)。
用T2V的原因:即使固定随机种子,只要prompt稍有差异,T2I生成图像也会有很大差异,但T2V可以保持前后帧scene的一致性。
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models
Train a DiffuseSG model (Graph Transformer) to produce layout and then utilize a pretrained layout-to-image model to generate images.
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
将scene graph分为多个三元组(object1-relation-object2),所有三元组拼在一起作为条件输入denoising model进行训练。
除了diffusion loss,还加了两个contrastive loss,从同一个batch中采样具有相同relation的三元组作为positive,batch内其余三元组作为negtive,利用relation的cross-attention map之间的cosine similarity计算一个contrastive loss,再利用三元组的diffusion loss之间的MSE计算一个contrastive loss。
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
A large-scale dataset with high-quality structural annotations of scene graphs (SG).
TraDiffusion: Trajectory-Based Training-Free Image Generation
定义cross-attention map和trajectory之间的energy function,求梯度作为guidance进行采样。
Compositional Text-to-Image Generation with Dense Blob Representations
GLIGEN with blob tokens
DiffUHaul: A Training-Free Method for Object Dragging in Images
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
StableDiffusion
CLIP image encoder提取image embedding,训练一个线性层将其映射到长为4的sequence,类似StyleAdapter,加一个和text cross-attention layer并行的可训练的image cross-attention layer,使用原来的数据集,训练线性层和image cross-attention layer。
训练好的模型可以与ControlNet和T2IAdapter一起使用,无需额外训练。
IP-Adapter+:在text cross-attention layer之后加可训练的image cross-attention layer。
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts
基于IP-Adapter+,在image cross-attention layer再加一个text cross-attention layer,与instruction进行交互,使用instruction editing数据进行训练。
使用prompt,ip image,instruction一起生成。
使用成对的图像数据集,其中一张作为condition,另一张作为target,重新训练一个U-ViT的diffusion model,we do not use any text inputs and only rely on image conditioning.
使用预训练的CLIP或者DINO编码图像得到的token sequence或者CLS token作为condition,当使用token sequence时使用cross-attention,当使用CLS token时使用FiLM。
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
We constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes 1 M high quality generated images with visual attribute annotations.
PuLID: Pure and Lightning ID Customization via Contrastive Alignment
IP-Adapter在训练时使用从原图中提取的feature,这一定程度上会导致模型过拟合,除了diffusion loss,还引入了两个alignment loss和一个ID loss。
训练时构造两条contrastive paths,one path with ID:两个cross-attention都用;the other path without ID:只用text cross-attention。为了确保sementic alignment使用text作为Q,image feature作为KV,计算cross-attention map,优化两条paths的cross-attention map之间的MSE loss。The insight behind our semantic alignment loss is simple: if the embedding of ID does not affect the original model’s behavior, then the response of the UNet features to the prompt should be similar in both paths.
为了确保layout alignment,同时优化两条paths的image feature的MSE loss。
使用
InstantID: Zero-shot Identity-Preserving Generation in Seconds
上半部分类似IP-Adapter,只是将CLIP image embedding换成了face id embedding。但是作者认为这种方法不够好,因为image token和text token本身提供的信息就不同,控制的方式和力度也不同,但是IP-Adapter却把他们concat在一起,有互相dominate和impair的可能。
提出使用另一个IdentityNet(ControlNet架构)提供额外的空间信息,根据上述原因,这里的ControlNet去掉了text的cross-attention,只保留face id embedding的cross-attention。这里只提供双眼、鼻子、嘴巴的key points作为输入,一方面是因为数据集比较多样,更多的key points会导致检测困难,让数据变脏;另一方面是为了方便生成,也可以增加使用文本或者其他ControlNet的可编辑性。
在人脸数据集上自监督训练。
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning
A general framework to achieve identity preservation via feedback learning.
Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
类似ObjectStitch,训练一个SeeCoder将reference image转换为CLIP text embedding,然后使用其替换StableDiffusion的CLIP text encoder,实现只使用reference image生成图像。还可以使用ControlNet引入其它条件。
Many-to-many Image Generation with Auto-regressive Diffusion Models
构造一个image sequence数据集。
训练时每个样本是一个image sequence
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Manga Generation via Layout-controllable Diffusion
Late-Constraint Diffusion Guidance for Controllable Image Synthesis
为预训练好的StableDiffusion引入各种条件,算是SKG的升级版。
使用预训练好的模型抽取image的各种conditions(如mask、edge等),训练一个可以根据UNet的各层feature maps预测conditions的condition adapter。
采样时,用当前的feature maps输入到condition adapter得到预测的conditions,与给定的conditions计算距离,求梯度作为guidance。
这类方法本质上还是训练一个noisy classifier,但使用的是diffusion model的feature。
Readout Guidance: Learning Control from Diffusion Features
和Late-Constraint类似,分为spatial和relative两种head。
spatial包含pose,edge,depth等,训练模型根据diffusion feature预测ground truth,采样时根据预测和给定的label计算MSE loss,求梯度作为guidance。
relative包含corresponce feature和appearance similarity,训练模型根据两个不同图像的diffusion feature进行预测。
drag:corresponce feature head uses image pairs with labeled point correspondences and trains a network such that the feature distance between corresponding points is minimized, i.e., the target point feature is the nearest neighbor for a given source point feature. We compute pseudo-labels using a point tracking algorithm to track a grid of query points across the entire video. We randomly select two frames from the same video and a subset of the tracked points that are visible in both frames. 训练时,将输入的diffusion feature转化为一个feature map,image pairs的feature map之间的corresponding point feature之间计算loss;编辑时,先将原图输入UNet得到diffusion feature,再送入网络提取feature map,计算其staring point处的feature与生成图像的feature map的target point处的feature的距离,求梯度作为guidance。
Modulating Pretrained Diffusion Models for Multimodal Image
将
Amazing Combinatorial Creation Acceptable Swap-Sampling for Text-to-Image Generation
给定两个object text,生成两个concept融合在一起的图像,类似MagicMix。
对于一个0-1的列交换向量,其长度和CLIP编码结果的维度相同,若向量某位置为0,则选取第二个object text的CLIP编码结果的该位置的列向量,若向量某位置为1,则选取第一个object text的CLIP编码结果的该位置的列向量,组合成一个新的CLIP编码结果,将其输入到StableDiffusion是可以生成两个concept融合在一起的图像的。
实践中,随机采样一堆列交换向量,每个列交换向量按上述流程生成图像,再使用一些选取策略从所有图像中选出最符合标准的图像。
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
通过对skip conncetion的feature做editing实现fine-tune或者可控生成。
SC-Tuner:
CSC-Tuner:
GLIGEN: Open-Set Grounded Text-to-Image Generation
Stableiffusion
除了caption,额外给定一组entity和对应的grounding信息(比如layout),进行spatial control。
在self-attention和cross-attention之间加一个可训练的gated self-attention层,把grounding token和visual token接在一起做self-attention,输出只保留visual token所在位置的部分,乘上一个可训练的gate标量,residual连接。gate标量初始化为0,类似ControlNet的zero-conv,保证一开始的网络和Stableiffusion有一样的效果。
grounding token由entity和对应的grounding的feature同时输入一个可训练的MLP预测。entity可以是文本或者图像,为文本时就用预训练文本编码器提取其feature,为图像时就用预训练图像编码器提取其feature,grounding使用Fourier embedding提取其feature,如果是layout,就是左上右下两个坐标,如果是keypoint,就是一个坐标,如果是depth map,此时就没有entity了,直接使用一个网络将其转换为
ReGround: Improving Textual and Spatial Grounding at No Cost
把GLIGEN改成类似IP-Adapter的并行attention形式,不用重新训练,直接把训练好的GLIGEN改成ReGround的形式,效果也能变好。
InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
定义interaction是一个三元组,分别是主体(subject)、动作(action)和客体(object),三者分别对应一个文本描述和一个bounding box,主体和客体使用同一个MLP,将文本(预训练文本编码)和bounding box(Fourier embedding)转化一个token,动作用另一个MLP也转化为一个token。
如果一张图中有多个interaction,那么不同interaction之间无法区分,所以为每个interaction加一个可训练的embedding,类似positional embedding。同样,一个interaction中三元组之间也无法区分,所以为三者各加一个可训练的embedding,所有interaction公用该embedding。
得到最终的embedding后,类似GLIGEN进行训练。
InstanceDiffusion: Instance-level Control for Image Generation
Adding Conditional Control to Text-to-Image Diffusion Models
为预训练好的StableDiffusion引入类似PDAE的条件模块ControlNet。
ControlNet:固定StableDiffusion,复制StableDiffusion的UNet的encoder和middle block的每个block进行训练,输出与UNet对应的decoder的输出进行加和。zero convolution是所有参数都初始化为0的1x1卷积层,这样在训练前整个trainable copy的输出为0,不影响原网络。
condition一般和原图尺寸一样。由于要和原网络的input相加,所以尺寸必须和原网络的input相同。StableDiffusion的input是降维后latent,所以condition也需要降维,所以就需要额外训练一个encoder对condition进行编码降维。
多个ControlNet可以组合使用。
StableDiffusion一般必须用classifier-free guidance才能生成较好的图像,此时ControlNet可用于both unconditional and conditional prediction,也可只用于conditional prediction。但是如果想不使用prompt进行生成,此时如果将ControlNet用于both,cfg退化,效果不好;如果将ControlNet只用于conditional prediction,会导致guidance太强,解决方案为resolution weighting。
ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models
ControlNet存在information delay的问题,即某个时间步的去噪时,SD encoder不知道control信息,ControlNet encoder不知道generative的信息。
ControlNet-XS让两个encoder之间同步information,一个的feature map过一个可训练的convolution后加在另一个上,反之亦然,这样ControlNet encoder就不需要复制SD encoder了,而是可以使用参数量更少的处理同维度feature map的网络,随机初始化进行训练即可,效果还比ControlNet要好。
CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation
FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
StableDiffusion + ControlNet
training-free
将多实例输入进行分离,修改cross-attention,每个实例过一次cross-attention,所有实例的输出相加得到最后输出。在UNet feature上进行操作,所以在UNet encoder部分,只融合text信息,在UNet decoder部分,同时融合control信息和text信息。
SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Relax the visual condition on the areas that are conflicted with text prompts. 如使用deer的depth map生成tiger时,鹿角部分需要舍去。
ControlNet可以使用一个
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
加噪后去噪一步,使用
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
train a universal compatible adapter so that plugins of the base stable diffusion model (such as ControlNet on SD) can be directly utilized in the upgraded diffusion model (such as SDXL).
训练一个mapper,将base model的decoder的feature映射到upgraded model的decoder的feature维度并加上去,使用upgraded model的diffusion loss训练mapper。注意训练时,upgraded model输入的是empty prompt。
CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models
The way to add new conditional controls to the pre-trained CMs.
ControlNet can be successfully established through the consistency training technique.
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
和CCM的目标一样,使用consistency distillation在预训练diffusion model的基础上训练一个类似ControlNet的网络进行快速的条件生成。
类似ControlNet,也是复制一个UNet encoder出来,但并不是skip connect到预训练diffusion model UNet decoder,而是将其每一层输出的feature与预训练diffusion model UNet encoder对应的每一层的输出进行线性插值插值,插值系数也是可学习的,初始化为
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
类似X-Adapter。Pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden for many users.
ControlNeXt: Powerful and Efficient Control for Image and Video Generation
We remove the control branch and replace it with a lightweight convolution module composed solely of multiple ResNet blocks. We integrate the controls into the denoising branch at a single selected middle block by directly adding them to the denoising features after normalization through Cross Normalization.
不再是复制原模型,而是使用一个轻量级的模块处理条件,并且只将结果在原模型的某个中间的block引入。极大的降低了参数量和计算量。
Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion
针对visual controls are misaligned with text prompts的问题,比如prompt中提到了某个object,但visual control中没有对应的edge,这样使用ControlNet生成出的图像会丢失这个object。
这本质上是ControlNet主导了生成的结果,所以提出了一种training-free的方法,根据每个object的edge提取mask,所有mask组合在一起,将ControlNet的feature乘上该mask再加到UNet decoder的feature上,目的是让ControlNet只负责生成有visual controls的objects,our experimental results show that the application of masks to ControlNet features substantially mitigates conflicts between mismatched textual and visual controls, effectively addressing the problem of object missing in generated images.
针对属性不绑定的问题,计算attibute和object的cross-attention map之间的overlap,梯度下降优化
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
StableDiffusion + ControlNet
自监督训练,对于某张图像,提取salient object的mask,图像乘上mask即为foreground图像,图像乘上mask的补码再对salient object部分进行inapinting得到background图像。分别对foreground和background图像提取depth。
提取foreground和background图像的CLIP image embedding,经过一个网络后concat在text embedding后,在ControlNet的cross-attention层用上mask,让Q和foreground K只在mask区域有值,让Q和background K只在mask区域之外有值。
foreground和background是不对等的,对调它们的输入会生成不同位置关系的图像,所以叫3D depth aware。
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
StableDiffusion
training-free
DDIM Inversion时,UNet decoder第一个self-attention之前的feature(query, key, value)为
利用这一属性,先生成一些target concept的图片,得到
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
为预训练好的StableDiffusion的encoder输出的各分辨率的feature map加上由condition计算出的同尺寸的feature map,只优化T2I-Adapter。
Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples
训练时不需要text,且只需要几十到几百个样本。
类似T2I-Adapter,训练一个prompt-free condition encoder,其输出的feature map加在StableDiffusion的encoder输出的各分辨率的feature map上。prompt-free condition encoder从StableDiffusion的encoder复制而来,去掉了cross-attention层,每个尺寸的feature map输入一个额外的zero convolution层。
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
StableDiffusion
self-attention和cross-attention之间插入可训练的local self-attention和global self-attention进行多模态训练。
Universal Guidance for Diffusion Models
StableDiffusion
forward guidance:利用Tweedie's formula根据
backward guidance:在上述guidance的基础上,使用Decomposed Diffusion Sampling优化一个
采样的每一步都使用resample technique重复多次forward guidance + backward guidance。
baseline:ControlNet、T2I-Adapter等模型,不同condition单独训练好后,可以通过feanture插值的方式进行组合使用,实现multi-condition控制。
Composer: Creative and Controllable Image Synthesis with Composable Conditions
用各种预训练网络提取图像的各种结构、语义、特征信息,然后作为条件训练GLIDE。
训练技巧:以0.1的概率丢弃全部conditions,以0.7的概率包含全部conditions,每个condition独立以0.5概率丢弃。
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models
一个ControlNet接收不同模态输入进行训练,图中的不同task使用的是相同的网络。
将不同模态在每层计算完成后得到的feature进行merge然后skip-connect到UNet decoder,merge后的feature再unmerge为原来的数量输入到下一层。
merge策略:对于每个spatial位置,计算两个feature之间的相关性,如果大于某个预设的阈值,就取两个feature的平均;如果小于阈值,就分别计算它们相对于各自整个feature的标准差,选择标准差较大的那个feature。
baseline是Multi-T2I Adapter和Multi-ControlNet,即每个task单独训练一个T2I Adapter或ControlNet,然后一起使用。
OmniControlNet: Dual-stage Integration for Conditional Image Generation
先为不同模态分别学习一个pseudo word,例如使用几张depth map images和"use <depth> as feature"利用TI学习"<depth>"的word embedding。
之后使用不同模态训练ControlNet,其中trainable copy的prompt之前加上对应条件的模态的"use <depth> as feature",这样一个ControlNet就可以处理不同模态的条件。
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation
多个模态的condition融合,输入到一个ControlNet进行训练,实现任意种模态的condition组合生成。
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
ControlNet
multi-control fusion block是cross-attention,让query token与visual token进行交互(text token不参与,直接输入下一层),visual token要加positional embedding以区分不同spatial control,multi-control alignment block就是self-attention,让query token获取信息。
query token最终的输出送入ControlNet的cross-attention。
训练时随机drop不同spatial control,以让模型适用于不同数量的spatial control。
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
FaceComposer: A Unified Model for Versatile Facial Content Creation
类似Composer,专做人脸,还支持talking face生成。
Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
解决prompt中有但是depth map中没有的物体在生成时丢失的问题。
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
多个音乐源拼接在一起进行训练,训练时所有音乐源都使用相同的时间步,噪声不一样。
total generation
partial generation:blended inpainting,配乐。
source separation:将某个要分离出来的音乐源视为所有音乐源的和减去其它音乐源的和。
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
使用预训练编码器将image和text都转换为token,额外训练两个decoder,可以根据token重构image和text。
text-image联合训练,使用U-ViT架构,训练时两者采样不同的时间步和噪声,这样可以做到unconditional(另一个模态一直输入噪声),conditional(另一个模态一直输入条件),joint(同步生成) sampling。
One Diffusion to Generate Them All
类似UniDiffuser。
Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC
Making Multimodal Generation Easier When Diffusion Models Meet LLMS
BiDiffuser:fine-tune UniDiffuser,只进行image-to-text和text-to-image,
将BiDiffuser和LLM联合。
Any-to-Any Generation via Composable Diffusion
目标:generate any combination of output modalities from any combination of input modalities.
We begin with a pretrained text-image paired encoder, i.e., CLIP. We then train audio and video prompt encoders on audio-text and video-text paired datasets using contrastive learning, with text and image encoder weights frozen。这样每个模态就能得到一个encoder,且编码结果共享一个common embedding space。每个模态以编码结果为条件训练一个diffusion model。
上面训练得到的是单模态的diffusion model,只能单对单自生成,还不能多对多生成。使用text-image数据,为text diffusion model和image diffusion model的UNet各自加入新的cross-attention层,训练时只训练这个cross-attention层,cross-attention的方式是为每个模态的noisy latent设计一个independent encoder,将不同模态的noisy latent嵌入到一个common embedding space,attend这个embedding token,除了diffusion loss同时也利用contrastive learning进行训练,这样text和image的noisy latent就可以通过它们的encoder对齐。之后固定住text的encoder和cross-attention weights,用text-audio数据,重复该方法,训练得到audio的encoder和cross-attention weights。之后固定audio的encoder和cross-attention weights,用audio-video数据,重复该方法,训练得到video的encoder和cross-attention weights。这样在cross-attention中,四种模态的noisy latent都被对齐了,之后可以interpolation不同noisy latent的encoder embedding进行joint sampling,即使这种combination可能没训练过。
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
对于多模态数据,利用Codi的multimodal encoder,将其它模态的编码结果(feature sequence)送入LLM进行训练,对输出(feature sequence)进行回归,同时将其输入对应模态的diffusion model计算diffusion loss,两个loss一起训练。
text还是token prediction loss进行训练。
本质还是feature-based而非token-based。
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
为不同模态语料(如语音、外文等)学习一个编码网络,使编码结果(分布)与现有的StableDiffusion的text encoder的编码结果(分布)对齐。
这样就可以无缝切换,使用训练好的编码网络为StableDiffusion提供cross-attention的kv,做不同模态的生成。
不用fine-tune StableDiffusion,而且fine-tune会导致对之前模态的遗忘。
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
模仿InstructGPT训练可根据instruction进行生成的StableDiffusion。
将不同任务整理成统一形式的task,每个task包含task instruction(如segmentation to image),prompt,visual conditon(segmentation)和target image,训练时使用ControlNet架构,prompt输入StableDiffusion,task instruction和visual condition输入ControlNet,多个task一起训练。可以泛化到zero-shot task和zero-shot task combination(如segmentation + skeleton to image)。
In-Context Learning Unlocked for Diffusion Models
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts
prompt由一个example pair和一个text构成,example pair由query image(如segmentation、edge map等)和query image对应的real image组成,之后给定一个新的query image,模型需要根据example pair和text生成对齐的图像。
训练好的模型还可以适用于unseen example pair,即In-Context Learning(无需训练的学习框架)。
模型架构和ControlNet一致,只是输入的条件变成了example pair和新的query image的组合。
Context Diffusion: In-Context Aware Image Generation
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
和PromptDiffusion一样的In-Context Learning,example pair + query image + target image组成一个2
InstructGIE: Towards Generalizable Image Editing
和ImageBrush类似。
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model
和ImageBrush类似。
ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models
和ImageBrush类似。
HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
skeleton也用VAE encoder编码,concat在
把
HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion
From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation
Appearance Encoder的输入不加噪,且每个part image独立输入提供reference feature,输入的text为该part image对应的类别,如face、hair等。
Shared Self-Attention的思想类似GLIGEN,进行self-attention后只保留image feature。如果有part image的mask,attention时只attend unmask部分的pixel。
Decoupled Cross-Attention是IP-Adapter,两个并行的cross-attention layer分别处理text和part image。
HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting
hand depth map + ControlNet
Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation
先生成手再生成body。
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
以hand params为中介进行生成。
RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance
将畸形的手从原图中割下来,输入RHanDS进行修复,之后再粘贴回原图。
RHanDS的训练包含两个阶段,第一阶段构造数据集(同一个人的两只手作为一对数据)训练保持style,第二阶段使用一个3D模型提取mesh训练根据structure重构。该3D模型也可以根据畸形的手提取出正常手的mesh。
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance
AbHuman数据集:使用StableDiffusion生成human图像,人工标注了异常分数以及异常的区域,之后训练一个打分模型和一个异常目标检测模型。
在AbHuman上fine-tune一下StableDiffusion,不然StableDiffusion无法识别含有异常描述的prompt,之后CFG + score guidance进行生成。
之后的refine是可选项。
Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
MoE的方法组合LoRA参数。
TextDiffuser: Diffusion Models as Text Painters
生成带文字的图片。
先训练一个Transformer生成文字的layout,再训练一个以layout的mask为条件的diffusion model生成图片。
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
训练一个LLM对text rendering进行layout planning,之后训练一个diffusion model根据layout planning进行生成。
CustomText: Customized Textual Image Generation using Diffusion Models
Conditional Text-to-Image Generation with Reference Guidance
GlyphControl: Glyph Conditional Control for Visual Text Generation
自监督训练,使用OCR模型识别带文字图像中的文字,并将其输入ControlNet训练重构原图。
GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently
所有条件输入UNet重新训练。
GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models
ControlNet
How Control Information Influences Multilingual Text Image Generation and Editing?
ControlNet
AnyText: Multilingual Visual Text Generation And Editing
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model
ControlNet + cross-attention mask constraint
LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions
object-layout control module由GLIGEN实现。
visual-text rendering module由ControlNet实现(在GLIGEN的基础上),类似ControlNet-XS解决information delay问题一样,为了让layout与glyph信息有交互,让skip feature与backbone feature进行cross-attention后再进行skip-connection。
AMOSampler: Enhancing Text Rendering with Overshooting
training-free
使用Text Rendering部分计算cross-attention map,we then average the attention map over different layers and
heads and rescale its values between 0 and 1.
ODE Overshooting: 从
根据ODE Overshooting时的步子,对不同image patch加不同大小的噪,让所有image patch回到相同时间步,得到
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation
training-free.
ARTIST: Improving the Generation of Text-rich Images by Disentanglement
text module:先使用只有text的黑白图片训练一个diffusion model。
visual module:固定text module,使用带text的真实图片训练一个diffusion model,for each intermediate feature from the mid-block and up-block layers of text module, we propose to use a trainable convolutional layer to project the feature and add it element-wisely onto the corresponding intermediate output feature of the visual module.
JoyType: A Robust Design for Multilingual Visual Text Creation
ControlNet
diffusion loss +
Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
TextMaster: Universal Controllable Text Edit
Towards Visual Text Design Transfer Across Languages
Collage Diffusion
将不同collage拼在一起并保证harmonization(无重叠)。
使用TI将每个collage编码进text embedding,同时修改StableDiffusion的cross-attention,类似MaskDiffusion引入mask信息,一起训练。
生成时为每个collage的pseudo word对应的cross-attenion map引入mask。
Zero-Shot Image Harmonization with Generative Model Prior
Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with that of the background.
To achieve image harmonization, we can leverage a word whose attention is mainly constrained to the foreground area of the composite image, and replace it with another word that can illustrate the background environment.
DiffHarmony: Latent Diffusion Model Meets Image Harmonization
DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model
RecDiffusion: Rectangling for Image Stitching with Diffusion Models
task:rectangling
PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering
pixel composition:按照mask直接拼接在一起。
correlation diffuser:object的inversion过程中的self-attention layer的KV取代pixel composition的self-attention layer的KV,注意只取代
RCA:限制object对应的cross-attention在mask内,mask之外的响应值赋为负无穷。
每一步latent都要和background的inversion过程中的latent再做pixel composition,以保持背景。
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior
利用DDS优化图像进行image composition。
对于object removal:
对于image composition:带T2I-Adapter的DDS,
对于image harmonization:
Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation
TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition
将reference image注入到main image中,并且符合为main image的风格。
使用exceptional inversion将两个image编码到噪声,然后将reference image的编码噪声resize并注入到main image的编码噪声中,再生成。
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization
Composite Diffusion
scaffolding stage: 根据condition生成到某一中间步,只有大致的结构。
harmonization:text-guided generation or blended(若有segmentation condition)
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis
vision guidance: 给
MagicFace: Training-free Universal-Style Human Image Customized Synthesis
图中画错了,应该是
利用cross-attention和self-attention估计出每个concept的mask。
RSA: self-attention时concat上所有concept的K和V,计算self-attention map时乘上一个mask(也是concat在一起),抬高不同concept对应区域的权重。
RBA: 每个concept单独计算出一个self-attention map,只留下mask区域内的。
Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control
使用TI分别学习concept和scene,如果直接用concept+scene造句,生成效果不佳。可以先用concept生成,然后提取mask。然后分别用concept和scene进行生成,到某一步
AnyScene: Customized Image Synthesis with Composited Foreground
Foreground Injection Module是ControlNet架构自监督训练。
Generative Photomontage
Note that we inject the initially generated self-attention features for all images except for
MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
通用框架,类似survey。
除了提供text,还需要指定需要编辑的区域,编辑时使用text-guided inpainting方法,保持unmask部分不变,参考Inpainting部分。
Guided Image Synthesis via Initial Image Editing in Diffusion Model
対生成图像不满意的地方,可以对
Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion
Image Inpainting Models are Effective Tools for Instruction-guided Image Editing
Grounded-SAM获取mask后使用inpainting方法进行编辑。
MagicQuill: An Intelligent Interactive Image Editing System
难点在于如何保持图像除编辑外的背景和其它内容与原图一致。
DDIM Inversion + Conditional Generation
Text-Guided SDEdit
LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing
为了保持原图的细节,最直接的做法就是将原图注入生成过程中,SDEdit相当于只是单步注入,LASPA在每一步都注入,使用最简单的插值法。
Text Guided Image Editing with Automatic Concept Locating and Forgetting
Text-Guided SDEdit
Text-Guided SDEdit方法会使编辑生成的concept受限于原图,如shape等,因此使用语法分析器分析出要忘记的concept
Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations
Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator.
Taming Rectified Flow for Inversion and Editing
RF-Solver not only significantly enhances the accuracy of inversion and reconstruction, but also improves performance on fundamental tasks such as T2I generation.
用RF-Solver进行inversion后进行类似P2P的编辑。
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
RF inversion
Prompt-to-Prompt Image Editing with Cross Attention Control
Imagen
text2img模型生成的图片的结构主要由随机种子和cross-attention决定,通过保持随机种子不变(使用DDIM时就是控制起始噪声不变),操控cross-attention可以实现内容保持。
此方法并不是对已有图片做编辑,而是从高斯噪声开始的,并行地生成两张图,一张根据source prompt生成,一张根据target prompt生成(程序运行前并不知道原图是什么样),相当于两条并行的使用source prompt的reconstruction generative trajectory和使用target prompt的editing generative trajectory,前者为后者提供cross-attention map用于修改自身的cross-attention map以达到编辑的效果。
对Imagen的text
KV都变成了visual token+target prompt token,对新的QK的计算结果即cross-attention map做操纵,主要有三种:word swap:除了被换的词,其它都用原来的cross-attention map;adding a new phrase:旧phrase部分都用原来的cross-attention map;attention re-weighting:给原来的cross-attention map要增强/减弱的词乘常数系数。
上述都是generated image editing方法,如果想做real image editing,需要进行DDIM Inversion。先使用source prompt对原图进行DDIM Inversion加噪,从得到的
Null-text Inversion for Editing Real Images using Guided Diffusion Models
StableDiffusion
解决P2P的real image editing时使用较小的
先使用
editing时,从
Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models
StableDiffusion
不需要source prompt,所以DDIM Inversion时
类似Null-text Inversion,先使用
editing是非P2P模式时:从
InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models
需要先perform manual pixel-level editing using techniques such as brush strokes, image pasting, or selective edits得到大概的编辑图,再进行refine。这与NTI这种从原图开始编辑的方法不同。
refine的过程:使用edit prompt对大概的编辑图进行DDIM Inversion到某一中间步后再CFG生成,类似purification。
DDIM Inversion过程中,根据
BARET: Balanced Attention based Real image Editing driven by Target-text Inversion
类似Prompt Tuning Inversion,但以target prompt embedding初始化
Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models
不用像Null-text Inversion优化
这样DDIM Inversion和reconstruction时,无论
editing时使用source prompt作为negtive prompt。
ProxEdit: Improving Tuning-Free Real Image Editing with Proximal Guidance
improved NPI
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
类似NTI,先使用
除了使用
editing时,从
FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
第一阶段:对DDIM Inversion得到的
第二阶段:对第一阶段得到的图像进行SDEdit,生成过程中注入reconstruction generative trajectory的self-attention的KV,与原图的特征对齐。
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code
P2P的reconstruction generative trajectory每一步都做修正,使
training-free,不需要任何优化。
SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing
改进P2P,disentangle the guidance scale for the source and target branches to reduce the error.
Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing
We introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image.
Inversion-Free Image Editing with Natural Language
DDIM选取
利用这一点就不需要对原图DDIM Inversion就可以进行编辑。
IterInv: Iterative Inversion for Pixel-Level T2I Models
针对含有super-resolution stage的inversion。
KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing
The contents (texture and identity) are mainly controled in the self-attention layer, we choose to learn the K and V embeddings in the self-attention layer.
先使用
editing时,从
EDICT: Exact Diffusion Inversion via Coupled Transformations
StableDiffusion
非P2P模式,直接用source prompt进行DDIM Inversion,然后用target prompt生成,都使用较大的
利用Flow-based Generative Models中的Affine Coupling Layer的思想,设计了可逆的denoising过程,确保使用较大的
Exact Diffusion Inversion via Bi-directional Integration Approximation
BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models
Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
DDIM的生成过程
(这一条与本论文提出的算法无关)本论文发现,P2P中使用非对称的
本论文使用P2P算法,在DDIM Inversion时使用上述不动点算法,DDIM Inversion和reconstruction generative trajectory都使用较小的
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
类似AIDI,也是求不动点。DDIM的生成过程
可以用于多种编辑方法,如P2P,MasaCtrl,PNP,ELITE
Fixed-Point Inversion for Text-to-Image Diffusion Models
不动点。
Exploring Fixed Point in Image Editing Theoretical Support and Convergence Optimization
不动点。
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing
基于P2P的soft editing。
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
P2P是替换cross-attention map,但是需要找到real image的prompt,虽然可行但效果不好。本文发现替换self-attention map也是可以的。
real image editing时,DDIM Inversion不需要prompt,reconstruction也不需要prompt。
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
StableDiffusion
P2P是将generative trajectory的cross-attention map注入到editing trajectory里,本论文直接将DDIM Inversion时的attention map注入到editing trajectory,此时就不需要generative trajectory了。这样做重构的效果也很好。
都使用较大的
Addressing Attribute Leakages in Diffusion-based Image Editing without Training
解决editing attribute leakage的问题,其它物体受被编辑物体的影响也被改变了。
每张图有
ORE: 编码base target prompt得到text embedding
RGB-CAM: 分别使用base target prompt和
BB: 根据background mask对reconstrution generative trajectory和editing generative trajectory的latent进行blend(加权和)。
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
Tuning-Free Inversion-Enhanced Control for Consistent Image Editing
StableDiffusion
先用source prompt对原图进行DDIM Inversion加噪,从得到的
只修改UNet decoder的后几层的self-attention:the Query features in the shallow layers of U-Net (e.g., encoder part) cannot obtain clear layout and structure corresponding to the modified prompt。
只在中间的几步进行操作:performing self-attention control in the early steps can disrupt the layout formation of the target image. In the premature step, the target image layout has not yet been formed.
同时,每一步,两条generative trajectory都使用阈值法根据cross-attention map计算一个object的mask,限制editing generative trajectory的object区域的self-attention只参考reconstruction generative trajectory的object区域的信息。
相比于P2P只操控cross-attention,MasaCtrl只操控self-attention,操控cross-attention适合做物体增删,操控self-attention适合做动作改变。
DiT4Edit: Diffusion Transformer for Image Editing
DiT版本的MasaCtrl。
Multi-Region Text-Driven Manipulation of Diffusion Imagery
MultiDiffusion版本的P2P,对不同region进行编辑。
Localizing Object-level Shape Variations with Text-to-Image Diffusion Models
StableDiffusion
对图像中某个物体做变换,而其它部分不改变,如将篮子变成盘子。两条并行的generative trajectory,在某个时间段内将句子中的单词替换。
shape preservation:在cross-attention map上使用阈值法标定出某个需要shape preservation的word对应的object的位置,然后在之前的self-attention map中,将该object所有的pixel对应的self-attention map的行和列注入到新的generative trajectory上。也可以将要编辑的object标定出来,然后把标定之外的pixel当做背景,对这些pixel做shape preservation。
使用Null-text Inversion可以做real image editing。
Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing
Delta Denoising Score
将图像本身看成参数,就可以利用SDS进行编辑(输入target prompt,梯度更新图像),但这样会导致图像模糊,如图中上半部分。
导致这种情况的原因是SDS loss中含有偏离项,因此将SDS loss分为两项,一项是用于编辑的,一项是使得图像变模糊的偏离项。提出DDS loss,即
DDS对于每个编辑需求都要进行反向传播更新,比较消耗计算资源,进一步可以通过DDS训练一个编辑模型,如图所示。
DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation
将DDS的随机采样时间步和噪声改为在DDIM采样过程中进行score distillation,不需要提供原图的prompt也能进行DDS编辑。
Specifically, in contrast to the original DDS method that adds newly sampled Gaussian noise to
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
确保diffusion model的latent space smoothness,smooth latent spaces ensure that a perturbation on an input latent (
做法是训练时加一个正则项Step-wise Variation Regularization。
对
InstructPix2Pix: Learning to Follow Image Editing Instructions
利用GPT3,StableDiffusion,P2P(generated image editing)创建一个数据集,每条数据包含原图,原图描述,目标描述和目标图片,训练一个新的StableDiffusion,以原图和目标描述为条件,建模目标图片,这样在推理时就不需要原图描述了。
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
类似IP2P,创建新数据集进行训练。
像Emu一样,训练完后使用少量高质量数据进行fine-tune。
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
A large-scale (~4M editing samples), automatically generated dataset for instruction-based image editing.
SeedEdit: Align Image Re-generation to Image Editing
自举式迭代训练。
Instruction-based Image Manipulation by Watching How Things Move
类似AnyDoor标注instruction-based editing dataset。
将source image concat到
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
与IP2P使用P2P构造数据集不同,PbI使用PbE的思想构造数据集。
editing model和IP2P一样。
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
使用分割数据集+inpainting模型造数据。
训练时根据inpainting结果和text去预测原图,除了diffusion loss,还训练一个小模型预测mask,对最终结果进行blend。
Referring Image Editing: Object-level Image Editing via Referring Expressions
比general image editing更加精细
利用现有的image composition model、region-based image editing model、image inpainting model构造数据集进行训练。
编辑模型是一个conditional diffusion model,source image和referring expression作为条件送入cross-attention进行训练。
EditWorld: Simulating World Dynamics for Instruction-Following Image Editing
使用GPT生成input text,instruction和output text,使用SDXL根据input text生成
editing model与IP2P一样。
LIME: Localized Image Editing via Attention Regularization in Diffusion Models
预训练好的InstructPix2Pix。
提取原图的UNet features,resize,concat,normalization,聚类,得到segmentation。
提取目标描述中related token的cross-attention map,算出响应值最高的几个点,这几个点所在的segment拼在一起,即为RoI区域。
在IP2P生成时做blended editing,同时利用RoI修改cross-attention map,对于unrelated token的cross-attention map,RoI区域内的都减去一个较大的常数值,避免unrelated token对编辑造成影响。
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
预训练好的InstructPix2Pix。
从instruction中提取关键词,使用该关键词对应的cross-attention map,多次进行平方+norm的操作拉开高低值之间的差距,使用阈值法估算出一个mask。
对instruction的所有token的cross-attention map,mask区域内的响应值做增强,mask区域外的响应值使用
采样时,对
Watch Your Steps: Local Image and Scene Editing by Text Instructions
预训练好的InstructPix2Pix。
类似DiffEdit,在编辑之前先计算一个mask,在InstructPix2Pix生成时做blended editing。
ZONE: Zero-Shot Instruction-Guided Local Editing
预训练好的InstructPix2Pix。
description-guided model类似StableDiffusion的cross-attention map是token-wise的,instruction-guided model类似InstructPix2Pix的cross-attention map是consistent的。所以在InstructPix2Pix的cross-attention map上利用阈值法估计出一个mask。但这个mask过于粗糙,所以将InstructPix2Pix的编辑结果送入SAM,利用IoU选出重叠最大的segment作为mask。得到mask后,用原图的mask之外的部分替换InstructPix2Pix的编辑结果的mask之外的部分,再利用一些平滑操作去除artifact。
Visual Instruction Inversion: Image Editing via Visual Prompting
基于IP2P做Visual Instruction的Textual Inversion。
IP2P的输入是原图和instruction,输出是编辑后的图像。现给定一对原图和编辑后的图像的示例,在IP2P上利用TI的思想学习一个instruction的embedding,之后就可以把这个学到的instruction embedding用在其它图像上,实现与示例类似的编辑效果。
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance
DDIM Inversion使用
Queries for structure and layout, whereas keys and values for textures and appearance. 对于保持layout的编辑,选择替换Q,此时就不需要下面的优化;对于需要编辑layout的编辑,选择替换KV,此时需要下面的优化。
类似DiffusionCLIP,两个loss优化Q的projection matrix,一个是CLIP direction loss
Imagic: Text-Based Real Image Editing with Diffusion Models
Imagen
只给原图和target prompt
先以target prompt embedding为起点,使用TI优化出一个source prompt embedding,之后fix source prompt embedding,fine-tune Imagen,之后使用source prompt embedding和target prompt embedding线性插值进行生成。
不fine-tune Imagen做不到图像保持,类似DragDiffusion,所以fine-tune很重要。
FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning
改进优化Imagic,使得每次编辑的速度提升20多倍。
优化一:不使用text-to-image模型,而是使用image-to-image模型,其可以根据CLIP image embedding生成图像,这样就不需要TI优化source prompt embedding了,原图的CLIP image embedding就可以作为source embedding,这里利用了CLIP embedding的对齐性质。
优化二:使用原图的CLIP image embedding对diffusion model进行fine-tune重构原图,这里根据原图的CLIP image embedding和target prompt的CLIP text embedding的差异度选择fine-tune的时间步范围,减少fine-tune次数。
优化三:使用LoRA fine-tune,减少fine-tune参数量。
fine-tune结束后,类似Imagic,可以使用原图的CLIP image embedding和target prompt的CLIP text embedding的插值进行编辑生成。
Forgedit: Text Guided Image Editing via Learning and Forgetting
setting与Imagic相同,做法稍有差异。
vision language joint learning:使用BLIP为原图生成source prompt,将source prompt输入CLIP得到source prompt embedding,再使用该embedding和原图一起fine-tune Imagen,这里embedding也参与优化。fine-tune Imagen时只更新一部分参数,并且发现The encoder of UNets learns the pose, angle and overall layout of the image. The decoder learns the appearance and textures instead.所以可以forget参数:If the target prompt tends to edit the pose and layout, we choose to forget parameters of encoder. If the target prompt aims to edit the appearance, the parameters of decoder should be forgotten.
生成时,计算target prompt embedding与优化得到的source prompt embedding正交的部分作为editing embedding,使用优化得到的source prompt embedding与editing embedding的线性组合进行生成,目的是为了保持原图细节。
On Manipulating Scene Text in the Wild with Diffusion Models
和Imagic顺序相反,因为这里提供了source prompt。
先fine-tune diffusion model,再使用预训练好的text recognition model的交叉熵loss优化target prompt embedding。
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
StableDiffusion
使用uncondtional DDIM Inversion(输入
feature injection:和MasaCtrl得出一样的结论,UNet深层的feature有更好的structure信息。使用reconstruction generative trajectory的UNet较深层的feature map替换editing generative trajectory的。但这样虽然很好了的保留了原图的structure信息,但也有一些纹理信息泄露到了生成图像中。
self-attention map injection:使用reconstruction generative trajectory的self-attention map(
Diffusion Self-Guidance for Controllable Image Generation
用cross-attention map或者UNet feature map计算loss并求梯度作为guidance,实现物体移动、改变大小、改变外观等编辑功能。
position:object对应的word对应的cross-attention map的质心位置。
shape:对object对应的word对应的cross-attention map使用阈值法得到一个二值mask。
apperance:使用上述mask乘上UNet feature map后求均值。
编辑时两条trajectory,一条generative或者reconstruction trajectory,一条editing trajectory,计算所有不想改变的物体对应的word对应的shape和apperance之间的MSE loss,再根据编辑需求计算loss,求梯度指导editing trajectory生成。
物体移动:计算某个object对应的word对应的position与期望位置之间的MSE loss。
改变大小:计算某个object对应的word对应的shape与期望的shape之间的MSE loss。
改变外观:计算某个object对应的word对应的appearance与期望apperance之间的MSE loss。
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
使用DDIM Inversion过程中的latent和editing过程中的latent输入UNet,计算两者self-attention map和feature之间的MSE loss,求梯度进行guidance。
注意计算loss时输入的都是source prompt,目的是保持layout一致。
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance
结合了MCG和DDS的guidance方法,使用任意loss指导采样。
Diffusion Models Already Have a Semantic Latent Space
训练时先对数据集图像使用
Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models
虽然还是在h-space做,但采样不再是非对称的了,还是用原DDIM的公式,过UNet时修改
unsupervised global: 生成一些样本,保存所有时间步的
unsupervised image-specific: 比如一个睁眼闭眼的编辑方向,对某张带着墨镜的人脸是没有意义的。使用类似h-space微分几何的方法,在h-space中找到能使
supervised: 使用标注的数据对,每对数据中正例含有某个属性,负例不含该属性,每对正例的
ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation
CLIP Direcitonal Loss为Diff-AE的
Zero-Shot Inversion Process for Image Attribute Editing with Diffusion Models
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models
把StyleGAN已经学到的interpretable direction迁移到StbaleDiffusion上,使用两个loss学习一个CLIP text embedding
NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
identify interpretable directions in text embedding space of text-to-image diffusion models
In noisy space, for edits carried out by the same direction to be attracted towards each other, while edits conducted by different directions to repel one another, in line with the core principles of contrastive learning.
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
StableDiffusion
训练好之后还可用于image editing,只能用在符合
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
StableDiffusion
类似DreamBooth,用原图和带有pseudo word的prompt,fine tune pseudo word embedding和StableDiffusion,每编辑一张图就要fine-tune一次模型。
提出Patch-Based Fine-Tuning,假设StableDiffusion LDM尺寸为
编辑时使用model-based classifier-free guidance,把fine-tuned模型看作专门生成这个single image的unconditional模型。
不需要DDIM Inversion。
SEGA: Instructing Diffusion using Semantic Dimensions
CFG的线性组合
DiffEdit: Diffusion-based Semantic Image Editing with Mask Guidance
自动计算mask的Blended Diffusion。
对于text-to-image模型,分别输入source prompt和
理论证明了,使用uncondtional DDIM Inversion加噪,比SDE直接一步加噪,重构效果更好。
LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing
有待编辑物体的reference image的编辑任务。
先利用TI技术学习一个待编辑物体的pseudo word,利用pseudo word简单造句得到identity-aware prompt,之后使用source prompt对原图进行DDIM Inversion,记录所有latent,再使用source prompt和identity-aware prompt进行重构,利用重构最后一步时pseudo word的cross-attention map估算出一个mask,从inversion得到的噪声开始使用target prompt和identity-aware prompt进行编辑生成,编辑的每一步,根据pseudo word的cross-attention map估算出一个mask,取两个mask的并,根据这个mask进行latent的blend,mask区域内取编辑生成的latent,mask区域外取inversion时的latent,以保持背景不变。
DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images
自动计算mask,转换为inpainting问题。
FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference
类似DiffEdit自动计算mask:利用P2P的方法操作cross-attention map,使用两个generative trajectory输出的feature map计算出difference mask记为要编辑的区域。
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks
类似DiffEdit自动计算mask:利用target prompt的start token对应的cross-attention map具有全局语义信息的性质,计算其余token的cross-attention map与其的相似度,使用最相似的那个token的cross-attention map,处理后估计一个mask。
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
Unsupervised Representation Learning from Pre-trained Probabilistic Diffusion Models
训练自编码器,在隐空间训练线性分类器,利用属性超平面的法向量作为编辑方向。
Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models
DisControlFace: Disentangled Control for Personalized Facial Image Editing
使用预训练的Diff-AE,额外训练一个ControlNet引入控制信息,但这样训练有一个问题,the pre-trained Diff-AE backbone can already allow near-exact image reconstruction, only limited gradients can be generated during error back propagation, which are far from sufficient to effectively train ControlNet。所以引入masked-autoencoding的思想,训练时使用masked
采样时,先估计出原图的控制信息,然后可以对控制信息进行编辑,再生成,同时使用
User-friendly Image Editing with Minimal Text Input: Leveraging Captioning and Injection Techniques
传统的编辑方法类似P2P需要用户提供source prompt和target prompt,本论文使用现有的caption模型为原图生成source prompt,用户只需要指出需要修改source prompt中哪些concept即可。
HIVE: Harnessing Human Feedback for Instructional Visual Editing
训练一个StableDiffusion,以原图和target prompt为条件,对目标图像进行去噪。
引入human feedback,使用learned reward function fine-tune上述StableDiffusion。
DialogPaint: A Dialog-based Image Editing Model
StableDiffusion
multi-turn editing
Iterative Multi-granular Image Editing using Diffusion Models
StableDiffusion
multi-turn editing,在StableDiffusion的latent space上进行多轮编辑。
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
利用GPT-4调度使用各种编辑方法进行编辑。
GPT-4是只能生成text的MLLM,所以只能帮助做plan,无法直接根据需求生成图像。
Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing
针对复杂编辑要求的DDS方法。
使用GPT-4V分解编辑需求和编辑区域,得到原图的prompt序列
GPT-4V是只能生成text的MLLM,所以只能帮助做plan,无法直接根据需求生成图像。
Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing
利用LLM将ambiguous instruction改写为多个specific instructions,利用IP2P模型组合多个instructions进行编辑。
Image Translation as Diffusion Visual Programmers
CFG的strength很敏感,很小的改动会导致生成图像很大的不同,每张图都去要调整strength,不实用。受style transfer的instance normalization的启发,提出Instance Normalization Guidance:
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing
构造CoT数据fine-tune MLLM。
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
fine-tune VLM,根据reference image和text instruction,generates much more precise editing instructions.
Guiding Instruction-based Image Editing via Multimodal Large Language Models
使用InstructPix2Pix的数据集,让MLLM根据图像和old instruction生成new instruction,给new instruction后加一些可训练的[IMG] token。
将old instruction、原图和new instruction输入LLaVA,训练生成new instruction的text部分,同时将[IMG]部分的feature作为editing command,和原图一起输入一个diffusion model,生成目标图像,所有可训练模块一起训练。
LLaVA是只能生成text的MLLM,无法直接根据需求生成图像,这里借助了MLLM的编码能力,为其feature训练一个diffusion decoder。
Customization Assistant for Text-to-image Generation
和MGIE类似。
EmoEdit: Evoking Emotions through Image Manipulation
根据emotion生成instruction,使用预训练IP2P进行编辑。
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing
We introduce ReasonPix2Pix, a dataset specifically tailored for instruction-based image editing with a focus on reasoning capabilities. 构造数据集时生成具有联想能力的instruction,比如使用the owner of the castle is a vampire代替make the castles dark.
原图和instruction输入MLLM,使用MLLM输出的feature和原图作为条件fine-tune StableDiffusion,生成目标图像。
Pathways on the Image Manifold: Image Editing via Video Generation
可以看成image-guided inpainting,参考Inpainting部分的text-guided inpainting,只是将条件从text换成了image。
ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models
Latent Variable Refifinement: match the low-pass filter feature of noisy latents to that of reference image
Filter-Guided Diffusion for Controllable Image Generation
类似ILVR,设计filter让生成样本与reference image在特定属性上一致。
Paint by Example: Exemplar-based Image Editing with Diffusion Models
StableDiffusion
输入原图,mask,reference image,输出原图mask部分被reference image取代并融合的图片。整体架构和text-guided image inpainting类似,将reference image看成text,作为condition输入到StabelDiffusion中,masked image和
self-supervised learning:使用带有bounding box的图像数据集进行自监督训练,即将bounding box内区域作为mask,bounding box内图片作为参考图片。这样训练时模型很容易过拟合,模型只学到学到一个简单的复制粘贴,提出两个解决方案:Information Bottleneck:因为我们需要将参考图片移植到原图mask区域,模型很容易去记忆图片空间信息而不是去理解上下文信息,所以我们将参考图片压缩,提高重构难度,即将其剪切并使用CLIP image encoder编码,结果作为StableDiffusion的KV进行cross-attention。Strong Augmentation:自己造的数据集存在domain gap between train-test,因为训练集中的参考图片本来就是原图切下来的,而测试集中基本都是无关的,所以我们对训练集中的参考图片进行数据增强(翻转、旋转、模糊等),又由于bounding box都是紧贴物体的,不利于模型泛化,所以对mask区域也进行数据增强,先用Bessel曲线拟合bounding box,再在曲线上均匀采样20个点,随机延伸1~5个像素点。
类似inpainting的blended采样。
classifier-free guidance:20%的概率用可一个训练的向量替代CLIP image encoder编码结果,采样时guidance scale可以控制融合程度。
Reference-based Image Composition with Sketch via Structure-aware Diffusion Model
在PbE的基础上,还需要提供mask部分的sketch作为条件(concat),进一步提高可控性。
ControlCom: Controllable Image Composition using Diffusion Model
挖掉图像前景做自监督训练。
一个额外的indicator决定是否改变被挖出来的前景的illumination和pose,indicator也作为条件输入diffusion model进行训练。
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
SAM + inpainting挖掉图像前景做自监督训练。
The model supports flexible prompts
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
使用multi-view数据集训练一个image encoder(主体DINOv2 + 小adapter,两者都参与训练),输入一个view的图像生成embedding序列,送入StableDiffusion,重构另一个view的图像。训练image encoder和StableDiffusion的decoder。
固定image encoder的主体部分,重新训练一个diffusion model,自监督训练,image encoder的adapter也参与训练。
DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models
两个条件,一个reference image,一个text,这样不仅可以将reference image填入mask,还能通过text进行控制,比如动作等。
之前的方法使用CLIP编码reference image,缺少了对细节的提取,这里使用预训练扩散模型UNet encoder编码reference image,时间步为0,取
Paste, Inpaint and Harmonize via Denoising Subject-Driven Image Editing with Pre-Trained Diffusion Model
将exemplar去除背景,直接paste在目标区域,作为条件输入ControlNet进行类似PbE的self-supervised learning。
Reference-based Painterly Inpainting via Diffusion Crossing the Wild Reference Domain Gap
在Versatile Diffusion基础加了一个mask branch,reference image(训练时是被mask掉的部分)做context flow,masked image做mask branch,进行self-supervised的inpainting训练。
ObjectStitch: Generative Object Compositing
用的是pre-trained text2img diffusion model,由于给的是object图片而不是text,所以需要一个模块将object图片转换为text embedding,即content adaptor,类似TI:使用训练好的CLIP和大规模image-caption数据训练一个content adaptor,content adaptor将CLIP的image embedding映射到text embedding空间,得到translated embedding,然后让它尽量靠近CLIP的text embedding。训练好之后再用pre-trained text2img diffusion model和textual inversion方法fine-tune content adaptor。
固定content adaptor,fine-tune pre-trained text2img diffusion model。
类似inpainting的blended采样,diffusion model只输入translated embedding。
LogoSticker: Inserting Logos into Diffusion Models for Customized Generation
基于TI。
AnyDoor: Zero-shot Object-level Image Customization
使用DINOv2提取物体的ID tokens,既用了global token(
之前的使用图像自监督训练的方法虽然有数据增强,但还是会导致多样性不足的问题,所以提出使用视频数据集造数据:对同一场景随机采样两帧,提取一帧的物体作为target,另一帧作为目标。
BIFRÖST: 3D-Aware Image compositing with Language Instructions
类似AnyDoor,额外加上了language instruction作为条件。
AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status
Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance
Locate:StableInpainting
Assign:IP-Adapter
第一阶段:StableInpainting+IP-Adapter训练Diffusion UNet。
第二阶段:把第一阶段训练好的Diffusion UNet复制出一个RefineNet,RefineNet UNet decoder的self-attention前的feature送入Diffusion UNet,与对应的feature concat在一起进行self-attention,只训练RefineNet的image cross-attention。
self-supervised learning,训练时subject image是从scene image中挖出来的,使用LLaVA生成subject image的caption作为text。
blended采样。
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
使用预训练模型提取图像的segmentation map作为图像的structure特征,再使用一个预训练的图像编码器编码图像,提取浅层feature map,取segmentation map中每个segment对应区域的feature map的spatial pool作为该segment的appearance特征,两者作为条件训练diffusion model。
structure编辑:对分割图进行编辑(比如改变某个object的形状、去掉某个object)
appearance编辑:提供一张reference image,用其全图的或者其中某个object的appearance特征替换某个segment对应的appearance特征,进行生成。
注意,编辑时不需要DDIM Inversion,直接根据条件从噪声开始生成即可。但毕竟structure和appearance不包含图像全部特征,所以未编辑部分会有一些变化。但编辑时可以对未编辑的segment进行mask,类似inpainting的blended采样。
CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models
使用SAM提取出原图中的object和background,估算出object的viewpoints,使用zero-1-to-3生成一个随机viewpoints的novel view object,训练一个diffusion model,novel view object、background和viewpoints作为条件,预测原图。
生成时,可以指定object的角度、在图像中的位置以及背景。
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
给定一张image和几张reference images,将image中某个object替换为reference image中的concept。
Custom-Diffusion方法提取reference image中的concept到pseudo word。
Prompt2Prompt + Null-text Inversion做real image editing,用pseudo word替换prompt中object对应的word。
DreamEdit: Subject-driven Image Editing
和CustomEdit一样,但是基于mask的,DreamBooth做完TI后,做text-guided inpainting采样(blended)。
DreamCom: Finetuning Text-guided Inpainting Model for Image Composition
self-supervised learning,给定3~5张reference images,每张都有bounding box (mask)标注其中物体,将mask和masked image concat在
生成时,给定背景图和想要object出现的位置的bounding box (mask),使用上述句子进行生成。
SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing
有reference image的P2P。
Tuning-Free Visual Customization via View Iterative Self-Attention Control
CLiC的无pseudo word版本,直接使用self-attention KV注入实现concept替换。
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
构造数据集训练。
Thinking Outside the BBox: Unconstrained Generative Object Compositing
准备自监督训练数据,对背景图进行inpainting,训练时有
提取mask时,还提取了object的shadow mask和reflection mask,使得模型在object stitch的同时可以生成影子。
TryOnDiffusion: A Tale of Two UNets
cascade模式
使用Parallel UNet是为了解决channel-wise concatenation效果不行的问题,所以改用cross-attention机制,绿线代表将feature当成KV送入主UNet。
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果。
CLIP image embedding丢了很多信息,所以在decoder block之间再插入一个zero cross-attention block引入细节。
在text cross-attention里,某个word对应的cross-attention map是这个物体的大致轮廓,但是在zero cross-attention block里是image cross-attention,query里衣服上某个image token对应的cross-attention map应该是key中同样位置的image token,而非整个衣服区域,所以cross-attention map应该是尽量集中于一点的,所以额外使用了一个attention total variation loss, which is designed to enforce the center coordinates on the attention map uniformly distributed, thereby alleviating interference among attention scores located at dispersed positions. 即让query里不同image token对应的cross-attention map差异尽量大。
TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On
MMDiT
MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation
StableDiffusion的cross-attention换为Multi-Modal Attention block,self-attention换为Multi-Reference Attention block。
TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On
Try-On-Adapter: A Simple and Flexible Try-On Paradigm
Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles
类似StableVITON,使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果,Dynamic Extractor使用CLIP image encoder编码图像,但是之后的MLP是可训练的。
HF-Map输入一个可训练的ControlNet。
StableGarment: Garment-Centric Generation via Stable Diffusion
Improving Virtual Try-On with Garment-focused Diffusion Models
Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
Paint by Example是重新训练整个conditional StableDiffusion,这里改用ControlNet架构。
Improving Diffusion Models for Authentic Virtual Try-on in the Wild
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
利用semantic correspondecce,分别将穿着garment的person图像和garment图像输入同一个StableDiffusion,提取feature,计算相似性,可以得到correspondecce作为监督数据,这样生成时可以指定衣服的穿着方式,比如衣角扬起等。
Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On
Controllable Human Image Generation with Personalized Multi-Garments
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on
使用额外的off-the-shelf clothes flattening network进行监督。
M&M VTO: Multi-Garment Virtual Try-On and Editing
ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
构造数据自监督训练。
BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training
FaceStudio: Put Your Face Everywhere in Seconds
人脸挖出来,自监督训练。
HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping
换头,预训练模型进行blended inpainting生成。
EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model
使用ChatGPT生成不同makeup style的prompt,使用LEDITS对没有makeup的人脸图像进行编辑,生成带makeup的人脸图像,监督训练。
类似IP-Adapter,将CLIP提取的global token加patch tokens送入cross-attention。
SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models
Stable-Hair: Real-World Hair Transfer via Diffusion Model
造数据:要transfer什么就把什么留下,对其它部分进行inpainting。
可以实现多种task,如text2img generation,personalization,editing等。
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
利用BLIP的方法,先使用大规模image-text数据预训练一个multimodal image encoder,可以从image中提取text-aligned特征。
给定subject image和subject text,输入multimodal image encoder,得到subject image的特征,再训练一个MLP将其转化为text embedding。之后利用subject image构造训练image(如替换背景等)和对应的prompt,将subject image特征转化后的text embedding接在prompt之后,输入text encoder,输出再输入StableDiffuion进行训练。multimodal image encoder、MLP、text encoder和StableDiffuion一起训练。
给定subject image、subject text和prompt就能生成,不需要test-time fine-tune了。
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion
自监督训练:先用caption模型得到图像的caption,再用Grounding DINO和SAM得到caption中的object的图像,将caption中的object word替换为图像,得到interleaved数据集,输入预训练MLLM进行编码,编码结果(所有token的last hidden layer的输出)送入StableDiffusion重构图像,只训练StableDiffusion。
因为MLLM输入中可能包含image entity,为了让生成结果更好地保持image entity的细节,在StableDiffusion的cross-attention增加
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
MLLM:使用CLIP提取image embedding,use attentive pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练MLLM和CLIP的最后一层,只在text token上算loss,类似Emu2的caption阶段。
AlignerNet:为了直接使用StableDiffusion(不需要训练)进行生成,训练一个AlignerNet,将Kosmos-G的输出转换到CLIP text embedding的domain,训练时只给一个text,分别使用Kosmos-G(所有token的last hidden layer的输出)和CLIP text encoder编码,得到
We can also align MLLM with Kosmos-G through directly using diffusion loss with the help of AlignerNet. While it is more costly and leads to worse performance under the same GPU days.
Generative Multimodal Models are In-Context Learners
caption:使用CLIP提取image embedding,use mean pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练CLIP(注意不是MLLM),只在text token上算loss,该阶段的目的是得到一个image encoder。
caption+regression:固定image encoder,用interleaved text image数据进行next-token prediction训练MLLM,在text token上算分类loss,在image feature上算regression loss。
StableDiffusion:训练StableDiffusion对image encoder的编码结果进行解码。
Generating images with multimodal language models
caption:类似LLaVa,用image-caption数据进行next-token prediction训练一个projection layer,只在text token上算loss。
producing image:给LLM的embedding层加
类似Kosmos-G,训练一个Q-Former将这
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Transfusion的多任务版本。
Diffusion Self-Guidance for Controllable Image Generation
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
StableDiffusion
先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步
Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing
和DragDiffusion方法类似,只是在Eq(4)的motion supervision时引入了一个额外的CLIP direction loss,使用文本提高drag编辑的效果。
对
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models
DragDiffusion在
DRAGTEXT: Rethinking Text Embedding in Point-based Image Editing
DragDiffusion中计算
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
和DragDiffusion一样,先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步
we observe a forgetting issue where subsequent denoising processes tend to overlook the manipulation effect by simply performing diffusion semantic optimization on one timestep. Propagating the bottleneck feature to later timesteps does not have a significant influence on the overall semantics, we copy this optimized bottleneck feature
AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing
EasyDrag: Efficient Point-based Manipulation on Diffusion Models
不需要LoRA fine-tune,直接将图像DDIM Inversion到某个时间步
motion supervision时,EasyDrag始终使用原图的
reference guidance使用DDIM Inversion时的
StableDrag: Stable Dragging for Point-based Image Editing
在point tracking时,除了使用传统的training-free的差异计算法,还使用一个可训练的track model,其是一个可训练的
在进行long-range drag时,图像内容难免会发生较大变化,point feature也会发生改变,此时让它和原图的starting point feature保持一致就不科学了,not only ensuring high-quality and comprehensive supervision at each step but also allowing for suitable modifications to accommodate the novel content creation for the updated states. 因此根据point tracking的结果计算一个confidence score,当confidence score较大时,就使用上一步的point feature作为监督优化latent,当confidence score较小时,就使用原图的starting point feature作为监督优化latent。
FreeDrag: Feature Dragging for Reliable Point-based Image Editing
feature dragging:之前的方法的point dragging是一片区域内的feature计算point-to-point的损失函数再求和,feature dragging计算一片区域内的feature aggregate
line search with backtracking:we constraint
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
StableDiffusion
从DIFT获取灵感,模型输出的feature具有correspondence性质,相同物体对应区域的feature具有很高的相似性。
类似P2P+self-guidance,两条并行的generative trajectory,一条是reconstruction,一条是editing,用各自第2,3层的输出feature(self-guidance是用attention)计算loss(原区域和目标区域的feature的相似度),求梯度作为guidance。
将editing generative trajectory的UNet decoder的self-attention的key-value替换为reconstruction generative trajectory的。
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
DragonDiffusion的改进版
先使用LAION训练一个image prompt encoder:具体做法是先使用预训练的CLIP image encoder将图像编码为长257的embedding sequence,作为cross-attention的key-value送入一个QFormer,输出长64的embedding sequence,送入StableDiffusion的cross-attention,只训练这个QFormer。在编辑时,在editing generative trajectory上使用原图的image prompt,效果更好。
作者发现如果在DragonDiffusion中使用随机初始化而非DDIM inversion得到的
利用RePaint的resample technique,即从
Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner
editing guidance就是DragonDiffusion的guidance。
Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models
与SKG和Late-Constraint类似。
对于分割数据集
对图像的segmentation map进行编辑,同时根据编辑结果计算一个mask,算mask-based方法。
编辑时先DDIM Inversion到某一中间步,再生成,将生成时的UNet feature map输入语义分割模型生成segmentation map,计算其和编辑后的segmentation map之间的loss,求梯度作为guidance。
RegionDrag: Fast Region-Based Image Editing with Diffusion Models
Point-Based Drag有两个缺点,一是语义不明确,只给一个起点和一个终点,合理的但语义不同的编辑结果可能有很多,二是过程过于复杂,所以提出Region-Based Drag。
Readout Guidance: Learning Control from Diffusion Features
The Blessing of Randomness: SDE beats ODE in General Dfusion-based Image Editing
方法就是CycleDiffusion
unified framework
The first stage initially produces an intermediate latent variable
The second stage starts from
We show that the additional noise in the SDE formulation (including both the original SDE and Cycle-SDE) provides a way to reduce the gap caused by mismatched prior distributions (between
操控过的
Drag
When the target point is far from the source point, it is challenging to drag the content in a single operation. To this end, we divide the process of Drag-SDE into
RotationDrag: Point-based Image Editing with Rotated Diffusion Features
the point-based editing method under rotation scenario
Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
使用a differentiable off-the-shelf optical flow estimator估算diffusion model每一步生成的
根据用户给定的光流估算一个mask,blended生成。
才用Repaint的resample technique。
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
detail extractor和synthesizer都用StableDiffusion初始化,都去掉了cross-attention,input block都做了扩展,两者都参与训练。
相当于给synthesizer的self-attention之后加了个cross-attention,Q是自己,KV是detail extractor的self-attention之前的feature。
类似AnyDoor,使用视频造数据进行训练。
生成时从
Move Anything with Layered Scene Diffusion
类似Locally-Conditioned-Diffusion和MultiDiffusion,给定图像和其对应的layout,可以通过对layout进行移动从而实现对物体的移动。
除了移动,增删layout可以实现物体的增删,调整layout图层顺序可以实现物体的前后调整。
InstantDrag: Improving Interactivity in Drag-based Image Editing
两个模型:a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion).
InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation.
FlowGen:根据drag生成光流图。
FlowDiffusion:使用视频数据集自监督训练一个根据当前帧和光流生成目标帧的模型。
TIME: Editing Implicit Assumptions in Text-to-Image Diffusion Models
当prompt没有指明时,模型会做一些Implicit Assumptions进行生成,比如生成的玫瑰都是红色,医生都是男性。本方法将编辑这种Implicit Assumptions(是编辑,不是去除),比如将玫瑰是红色编辑为玫瑰是蓝色,这样模型以后再见到带有玫瑰的prompt时,就会默认生成蓝色的玫瑰。
做法是为所有cross-attention训练新的KV projection matrix,让新矩阵与玫瑰的乘积靠近原矩阵与蓝色玫瑰的乘积,这样新矩阵就会默认将玫瑰映射到原来模型里的蓝色玫瑰的投影。
Unified Concept Editing in Diffusion Models
和TIME类似,闭式解修改所有cross-attention的KV projection matrix。
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
改进UCE的公式。
MACE: Mass Concept Erasure in Diffusion Models
最后的融合多个LoRA成一个LoRA的方法类似Mix-of-Show中的方法。
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Erasing Concepts from Diffusion Models
fine-tune StableDiffusion
反向编辑,对图像中与文本相关的内容进行擦除。
反向利用classifier guidance,fine-tune模型让预测的噪声与预训练模型的反向classifier guidance的噪声靠近。
Ablating Concepts in Text-to-Image Diffusion Models
让StableDiffuion忘记一些concept,比如使用带有"in the style of Van Gogh"的prompt时,模型就会忽略"Van Gogh",生成正常style的图片。
使用"in the style of Van Gogh"构造一些prompt
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient
对抗训练,让模型在grumpy cat和cat时预测的noise无法分辨,这样修改后的模型遇到grumpy cat时会按cat生成,忽略grumpy。
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
对于一些想让StableDiffuion忘记的concept,收集一些reference images,并用concept造一些prompt,fine-tune整个StableDiffusion,loss为所有cross-attention layer处的concept对应的cross-attention map的所有响应值的平方和。
注意fine-tune时不需要diffusion loss。
Pruning for Robust Concept Erasing in Diffusion Models
Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs.
stage 1: We use a numerical criterion to identify concept neurons.
stage 2: We validate concept neurons are sensitive to adversarial prompts.
ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning
We first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning.
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts
original prompt是"a girl",erasure prompt是"naked",erasure prompt的cross-attention map注入original prompt的cross-attention map中并进行抑制。
Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts
学习一个prompt embedding,其可以直接concat在CLIP text emebdding后送入cross-attention。
类似EM算法,轮流更新prompt embedding
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
只针对"... without xxx"句型的prompt仍然生成带有"xxx"的图像的情况。
zero xxx和zero EOT都不解决问题,只有同时zero才有效;EOT之间距离也很近。
对x和EOT的矩阵(
Separable Multi-Concept Erasure from Diffusion Models
All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models
Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models
使用带有二维码、水印、文字的image-text pair数据集,将二维码、水印、文字的位置信息加进text,fine-tune StableDiffusion,这样生成时只用原text就可以避免生成二维码、水印、文字。
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?
Localizing and Editing Knowledge in Text-to-Image Generative Models
不同属性的知识(objects style color action)分布在UNet中不同block中,只对想要编辑或者ablate的concept对应的属性对应的block做fine-tune。
EraseDiff: Erasing Data Influence in Diffusion Models
在训练时,对于需要遗忘的数据使用非高斯分布的噪声进行加噪,这样采样时就不会生成这些数据。
Robust Concept Erasure Using Task Vectors
Training-free Editioning of Text-to-Image Models
和erasing相反,让模型专注于某个concept的生成。
Position: Towards Implicit Prompt For Text-To-Image Models
erase concept后,用户依然可以通过implicit prompt生成该concept,比如erase了"Eiffel Tower",使用"Located in France, an iconic iron lattice tower, symbolizing the romance of Paris and French engineering prowess."依然可以生成。
针对这一问题提出了Benchmark,但没有提出解决方案。
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models
Existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric.
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
针对image而非prompt的unlearning,prompt unlearning虽然阻止了模型在碰到特定的prompt时触发生成相应的内容,但diffusion model还是有生成该内容的能力的,而image unlearning直接让diffusion model失去生成该内容的能力。
做法是针对某个prompt分别收集retain和forget样本,使用Diffusion-DPO优化diffusion model。
Meta-Unlearning on Diffusion Models Preventing Relearning Unlearned Concepts
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models
Unveiling Concept Attribution in Diffusion Models
SDEdit: Image Synthesis and Editing with Stochastic Differential Equations
需要source domain和target domain上训练好的diffusion model。
Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training
two-stage SDEdit
UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models
unpaired
domain translation function提取domain信息。
LaDiffGAN: Training GANs with Diffusion Supervision in Latent Spaces
类似Diff-Instruct,使用diffusion model训练image-to-image translation的GAN模型。
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
unpaired
使用ControlNet将源域
Palette: Image-to-Image diffusion models
paired,self-supervised learning,自动生成paired数据,如colorization,inpainting等
condition source image through concatenation
Denoising Diffusion Bridge Models
paired
扩散过程变成从一个分布的point扩散到另一个分布的paired point,修改了公式进行训练,
和ShiftDDPMs类似。
Diffusion Bridge Implicit Models
DDBM的DDIM版,DBIM相对于DDBM相当于DDIM相对于DDPM,可以使用预训练好的DDBM进行加速采样。
Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation
EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models
Consistency Diffusion Bridge Models
Score-Based Image-to-Image Brownian Bridge
Feedback Schrodinger Bridge Matching
ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models
unpaired
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
unpaired
一个单词对应一个domain对应一个fine-tuned模型。
pre-trained unoncditional DDPM + pre-trained CLIP
对数据集图像使用
GPU-efficient:从latents开始生成的
High-Fidelity Diffusion-based Image Editing
unpaired
DiffusionCLIP
训练网络预测卷积层的LoRA参数,这样不需要像DiffusionCLIP那样递归优化。
EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations
只需要target domain上训练好的diffusion model,给定source domain上的原图做SDEdit,使用两个预训练好的energy function进行指导采样。
改变domain-specific特征: 训练一个domain classifier,去除分类层变为一个编码器,计算生成的latent和原图的noisy latent的feature之间的cosine similarity,求梯度作为guidance。
保留domain-independent特征: low-pass filter,计算生成的latent和原图的noisy latent的低通滤波之间的L2距离,求梯度作为guidance。
Dual Diffusion Implicit Bridges for Image-to-Image Translation
需要source domain和target domain上训练好的diffusion model。
Probability Flow ODE在source domain和target domain之间构成薛定谔桥。
Cycle Consistency:source domain上的样本
cycle的前半段即为translation。
DECDM: Document Enhancement using Cycle-Consistent Diffusion Models
DDIB在文档模型上的应用。
Unifying Diffusion Models' Latent Space, With Applications to Cyclediffusion and Guidance
需要source domain和target domain上训练好的diffusion model。
translation过程和DDIB一样,但使用DPMEncoder代替Probability Flow ODE。
如果使用同一个text-to-image模型,两个不同text作为condition,可以分别看成source domain和target domain上训练好的DPM,可以用这种方法既可以做image-to-image translation也可以做image editing。
先用source domain模型编码
再用target domain模型解码
注意DPM-Encoder是针对stochastic diffusion models的。
An Edit Friendly DDPM Noise Space
方法同DPM-Encoder(作者声称和DPM-Encoder不一样,但并没有看出有什么区别,有可能说的是之前版本的DPM-Encoder?)
LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
DDPM-Inversion+SEGA(多个guidance的combination)
LEDITS++: Limitless Image Editing using Text-to-Image Models
用DPM-Solver做inversion,同时使用cross-attention map和DiffEdit的方法估计mask,做mask-based editing。
TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models
SDXL-Turbo + DDPM-Inversion做加速编辑。
Zero-shot Image-to-Image Translation
需要预训练好的StableDiffusion,做类似cat
使用BLIP生成原图(cat图)的描述,使用CLIP对描述进行编码得到
使用
用
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation
和Pix2Pix-Zero一样的任务。
使用source prompt将原图DDIM Inversion后,将source prompt换为target prompt进行生成做translation效果不好,原因是在去噪早期阶段text embedding的abrupt transition。
we formulate a noise prediction strategy for the text-driven image-to-image translation by progressively updating the text prompt embedding via time-dependent interpolations of the source and target prompt embeddings.
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation
需要预训练好的StableDiffusion,做类似man
unpaired
训练diffusion model的同时训练两个encoder,一个编码content,一个编码style,利用inductive bias,content是一个spatial layout mask,使用时降/上采样到feature map的尺寸;style是一个向量,代表高维语义。在UNet每一层用AdaGN,style做channel-wise affine transformation,content和AdaGN输出做spatial上的乘。
采样时,先用自身编码结果DDIM Inversion到噪声,再用目标图像的content或style进行生成。
Diffusion-based Image Translation using Disentangled Style and Content Representation
SDEdit + guidace + resample technique
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption
unpaired
使用source domain A上预训练好的diffusion model和少量的target domain B上的样本做model adaption,得到一个target domain的diffusion model。做法是使用source domain A上预训练好的diffusion model初始化模型,使用任意source domain的图片
Directional Distribution Consistency Loss:先使用数据集和CLIP得到一个cross-domain direction vector
做translation时类似SDEdit,只用target domain的diffusion model。
Fine-grained Appearance Transfer with Diffusion Models
unpaired
利用DIFT做semantic matching和feature transfer。
S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion
unpaired
将原图DDIM Inversion到
structure loss: 生成的图像和原图的sobel gradient之间的MSE loss
appearance loss: 选取几张target domain的图像,autoencoder编码后计算均值,和生成的
Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation
自监督训练ControlNet。Training to reconstruct the lossless image features
translation时,先对原图进行DDIM Inversion,再使用target prompt和原图的不同frequency进行生成。
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
加噪150 steps,去噪50 steps,每一步用Tweedie's formula根据
Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer
对原图先加噪到中间步,再从中间步噪声开始去噪,每一步用Tweedie's formula根据
Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style
StableDiffusion
每个style(如Flatten Design, Fantasy, Food doodle等)收集几十对text-image数据,做数据增强,fine-tune StableDiffusion,作为这个style的specialist diffusion,输入文本就可以生成这个style的图像。
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
StableDiffusion
类似IP-Adapter:使用CLIP编码所有reference images,输入一个可训练的StyEmb网络得到style feature,给StableDiffusion插入一个可训练的cross-attention层,image token与style feature做cross-attention,其输出与text cross-attention层输出加在一起送到下一层(Tow-Path Cross-Attention),训练StyEmb和新插入的cross-attention层对reference images进行重构。
采样时根据prompt和style image进行生成。
For data augmentation, we apply the random crop, resize, horizontal flipping, rotation, etc., to generate K = 3 style references for each input image during training.
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
StableDiffusion
使用frozen CLIP提取reference image的feature,Q-Former的query与feature和"content"/"style"进行cross-attention,Q-Former的输出输入StableDiffusion的text cross-attention,新训练一个KV projection maxtrix(Q还用text cross-attention的),将Q-Former的输出project之后与text的KV concat在一起进行计算,算是IP-Adapter的变种。
训练时如果使用"style",就用style相同但content不同的image pair进行训练,"content"同理。注意推理时只使用"style",训练时的"content"是为了让style representation的提取更加解耦。
ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text
自监督训练,提取图像的sketch,对图像加噪,加噪结果和sketch加在一起作为UNet的输入,将UNet的cross-attention改造为linear层,使用预训练CLIP提取图像的image embedding送入linear层,所有参数一起训练进行重构。
采样时,提取原图的sketch和reference image的CLIP image embedding输入网络进行生成,保持原图的结构,完成风格向reference image的转化。
还可以使用文本对reference image embedding进行manipulation,因为CLIP的text embedding和image embedding已经对齐了,所以可以在CLIP embedding空间,根据所给text和scale对reference image embedding进行manipulation。
ArtFusion: Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models
训练一个以content和style为条件的diffusion model,以输入数据自身的content(LDM的VAE提取)和style(vgg feature)为条件做self-reconstruction。
采样时使用不同的content image和style image。
SGDiff: A Style Guided Diffusion Model for Fashion Synthesis
类似ArtFusion,使用输入数据的patch作为style。
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
先使用预训练Style Removal模型去除原图和reference image的style,类似DiffusionCLIP,用CLIP directional loss fine-tune模型,一个style一个模型,在CLIP image embedding空间,上面两个的差应该和下面两个的差相似。
ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors
ControlNet + DiffusionCLIP。
One-Shot Structure-Aware Stylized Image Synthesis
给定
利用SDEdit方法,使用
SPN:a structure-preserving network (SPN), which utilizes a
CSGO: Content-Style Composition in Text-to-Image Generation
构造数据集训练ControlNet。
Visual Style Prompting with Swapping Self-Attention
生成时,将decoder某一层之后的所有self-attention的key和value替换为reference image生成时的相应位置的self-attention的key和value。
Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
类似VisualStylePrompt,self-attention的KV注入。
Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting
如果没有
使用style guidance CFG的原因是让target image尽量偏离content image。
ZePo: Zero-Shot Portrait Stylization with Faster Sampling
和PortraitDiffusion对比。
Diffusion Cocktail: Fused Generation from Diffusion Models
通常对于每个style会fine-tune得到一个模型,使用任意一对模型做any-to-any style transfer,将一个模型生成的图像作为content,用另一个模型对其进行style transfer。
做法类似PnP,做feature和self-attention map的注入,但不同的是,由于保存原图的feature和self-attention map太消耗存储,所以本文提出只保存原图生成过程中的latent,在style transfer时由当前模型再推理一次得到feature和self-attention map,效果和使用原模型的feature和self-attention map相差无几。
Training-free Content Injection using h-space in Diffusion Models
Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models
不做任何训练,只对预测的
FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models
受FreeU的启发,将原图(content image)送入UNet encoder+decoder得到的feature作为backbone feature,含有大量低频信息(content),给其乘一个系数;
Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer
使用P2P,直接将style prompt拼在original prompt(即content prompt),之后进行text-guided style transfer,但这种方式会破坏原图信息,如头发等。
使用content prompt和style prompt分别进行cross-attention得到feature
MagicStyle: Portrait Stylization Based on Reference Image
AdaIN技术。
Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer
AdaIN技术。
Inversion-Based Creativity Transfer with Diffusion Models
StableDiffusion
用CLIP编码reference image,训练一个网络,根据image embedding预测一个text token embedding(非CLIP编码),输入到预训练好的StableDiffusion(先过CLIP),用TI方法训练这个网络。
StyleBooth: Image Style Editing with Multimodal Instruction
InstructPix2Pix,构造数据集训练
DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning
Given a few-shot target dataset of a specific domain such as sketches painted by an artist, we expect to generate images that fall into the domain.
Customizing Text-to-Image Models with a Single Image Pair
ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank
ISPB:每个style对应一个learnable parameter matrix,由该style专属的SSAM转化为一个token embedding,使用该style的一些images进行TI训练,只优化ISPB。
Stochastic Inversion:Random noise is hard to predict, and incorrectly predicted noise can cause a content mismatch between the stylized image and the content image. To this end, we first add random noise to the content image and use the denoising U-Net in the diffusion model to predict the noise in the image. The predicted noise is used as the initial input noise during inference to preserve content structure.
Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt
类似ProSpect,将
生成时,除了DDIM Inversion,还使用一个预训练的edge的ControlNet保持content image的结构。
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models
art generation dataset,包含prompt,negative prompt和生成的图像。
HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models
RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control
AttnMod: Attention-Based New Art Styles
modify attention for creating new unpromptable art styles out of existing diffusion models
Improving Diffusion Models for Inverse Problems using Manifold Constraints
DPS前身。
Diffusion Posterior Sampling for General Noisy Inverse Problems
Diffusion Posterior Proximal Sampling for Image Restoration
Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data
Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint
Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction
Consistency Models Improve Diffusion Inverse Solvers
Deep Data Consistency: a Fast and Robust Diffusion Model-based Solver for Inverse Problems
Learning Diffusion Priors from Observations by Expectation Maximization
Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems
Prototype Clustered Diffusion Models for Versatile Inverse Problems
Reducing the cost of Posterior Sampling in Linear Inverse Problems via task-dependent Score Learning
Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling
Think Twice Before You Act: Improving Inverse Problem Solving With MCMC
Online Posterior Sampling with a Diffusion Prior
Variational Diffusion Posterior Sampling with Midpoint Guidance
Free Hunch: Denoiser Covariance Estimation for Diffusion Models Without Extra Costs
Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised h-transform
类似PDAE,使用inverse problem的pair data训练一个gradient estimator进行指导采样。
Pseudoinverse-Guided Diffusion Models for Inverse Problems
Inverse Problems with Diffusion Models: A MAP Estimation Perspective
DreamGuider: Improved Training free Diffusion-based Conditional Generation
CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems
类似CCM,为CM训练ControlNet,只用很少的步数求解Inverse Problem。
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion
之前的方法都用first order Tweedie's formula计算
DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models
左边是类似DPS的方法,使用DDIM每一步预测的
Fast Samplers for Inverse Problems in Iterative Refinement Models
Conditional Conjugate Integrators
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis
Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution
real world:
或许也可以用Variational Inference
Blind Inversion using Latent Diffusion Priors
利用EM算法。
An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations
利用EM算法。
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models
推理时仅有
Bayesian Conditioned Diffusion Models for Inverse Problems
Amortized Posterior Sampling with Diffusion Prior Distillation
Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
amortized variational inference
Denoising Diffusion Restoration Models
Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model
Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
Image Restoration with Mean-Reverting Stochastic Differential Equations
ShiftDDPMs中的PriorShift的SDE。
Deep Equilibrium Diffusion Restoration with Parallel Sampling
DEQ-based
Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)
We propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs.
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models
Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion
Generative Diffusion Prior for Unified Image Restoration and Enhancement
Diffusion Priors for Variational Likelihood Estimation and Image Denoising
Blind Image Restoration via Fast Diffusion Inversion
类似DMPlug,we aim to find the initial noise sample that can generate the image when applied to DDIM.
FlowIE: Efficient Image Enhancement via Rectified Flow
直接使用Flow建模两个分布间的path,可以适用于多种任务,如inpainting,colorization和super resolution。
AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion
训练一个能handle different degradations的image restoration网络。
训练一个网络识别输入图像属于哪种预定义的degradation(如blur),填入某个template形成prompt(如"a photo needs {blur} artifact reduction")。
使用多种预定义的degradation的数据进行训练一个LDM,原图concat在
使用时,将原图输入2,得到prompt,再一起输入LDM进行restoration。
UIR-LoRA: Achieving Universal Image Restoration through Multiple Low-Rank Adaptation
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
PromptIR: Prompting for All-in-One Blind Image Restoration
Exploiting Diffusion Priors for All-in-One Image Restoration
Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
类似AutoDIR
TIP: Text-Driven Image Processing with Semantic and Restoration Instructions
ControlNet,ControlNet输入degration指令,StableDiffusion输入prompt,自监督训练。
Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation
create pairs of (clean, corrupted) images and utilize them for fine-tuning to enable the recovery of corrupted images to their clean states.
PromptFix: You Prompt and We Fix the Photo
We compile approximately two million raw data points across eight tasks: image inpainting, object creation, image dehazing, image colorization, super-resolution, low-light enhancement, snow removal, and watermark removal. For each low-level task, we utilized GPT-4 to generate diverse training instruction prompts Pinstruction. These prompts include task-specific and general instructions. The task-specific prompts, exceeding 250 entries, clearly define the task objectives. For example, "Improve the visibility of the image by reducing haze" for dehazing.
For watermark removal, super-resolution, image dehazing, snow removal, low-light enhancement, and iimage colorization tasks, we also generate "auxiliary prompts" for each instance. These auxiliary prompts describe the quality issues for the input image and provide semantic captions.
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
借助MLLM生成prompt,ControlNet引入LQ,送入SDXL生成HQ。
Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration
类似SUPIR。
InstantIR: Blind Image Restoration with Instant Generative Reference
ReFIR: Grounding Large Restoration Models with Retrieval Augmentation
A Modular Conditional Diffusion Framework for Image Reconstruction
Taming Generative Diffusion Prior for Universal Blind Image Restoration
guidance
PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance
partial guidance,类似GradPaint,整理了很多任务使用统一框架。
PFStorer: Personalized Face Restoration and Super-Resolution
有reference image的restoration,LQ和StableSR引入方式一样,reference image以类似ControlNet的方式引入。
RestorerID: Towards Tuning-Free Face Restoration with ID Preservation
CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models
DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior
ControlNet
Towards Unsupervised Blind Face Restoration using Diffusion Prior
AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior
DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration
OSDFace: One-Step Diffusion Model for Face Restoration
将LR图像上采样到HR的resolution,该问题就可以转化为LQ到HQ的restoration问题。
SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models
diffusion model以LR为condition建模HR与upsample(LR)之间的residual。
Image Super-Resolution via Iterative Refinement
低分辨率图像上采样到高分辨率,concat在
Exploiting Diffusion Prior for Real-World Image Super-Resolution
LR上采样到HR的resolution,经过VAE encoder编码后,输入一个可训练的time-aware encoder,得到multi-scale feature,再训练一个小卷积网络(SFT),根据feature预测scale and shift去affine StableDiffusion对应的feature,只训练encoder和SFT。
color correction:预测结果每个通道减去自己的均值再除以自己的标准差,之后乘以LR在该通道的标准差再加上均值。
训练一个CFW模块,利用VAE encoder的feature
ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting
super resolution,扩散过程以HR为起点LR为终点,不断增加LR和HR的residual,类似ShiftDDPMs推导后验公式,建模逆向过程。
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
将ResShift扩展到DDIM的确定性采样,之后蒸馏为一步。
Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs
TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution
confidence-driven loss:
使用
GRM得到coarse HR feature
Regularization by Texts for Latent Diffusion Inverse Solvers
使用text引导的超分和去模糊。
Image Super-Resolution with Text Prompt Diffusion
LR上采样后concat到
Text-guided Explorable Image Super-resolution
CoSeR: Bridging Image and Language for Cognitive Super-Resolution
类似PromptSR,根据LR生成一些粗略的HR的reference image和prompt,两者作为条件训练diffusion model进行超分生成。
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
CasSR: Activating Image Power for Real-World Image Super-Resolution
根据LR生成一些粗略的HR的reference image,和LR一起作为条件训练diffusion model进行超分生成。
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
类似SUPIR借助MLLM生成prompt。
SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution
SAM辅助。
Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution
SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution
action 0 is to perform the reverse diffusion process with the current state, while action 1 is to skip the diffusion process.
Effcient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution
使用两个loss重新训练一个分数模型。
每次训练时,先根据LR用分数模型的PF ODE采样得到结果,与HR计算perceptual loss,即为
Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution
BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution
most methods are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adapt ability to real-world applications that involve complex unknown degradations.
引入对degradation level的估计。
CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution
Blind Image Super-Resolution
DiffFNO: Diffusion Fourier Neural Operator
RFSR: Improving ISR Diffusion Models via Reward Feedback Learning
One-Step Effective Diffusion Network for Real-World Image Super-Resolution
类似Diff-Instruct的思想。
Arbitrary-steps Image Super-resolution via Diffusion Inversion
AdaDiffSR: Adaptive Region-aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors
TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution
Adversarial Diffusion Compression for Real-World Image Super-Resolution
HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution
TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution
OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs
步数蒸馏加速。
Blended Diffusion for Text-driven Editing of Natural Images
Blended Latent Diffusion
training-free,text-free + text-guided
pre-trained unoncditional diffusion model + pre-trained CLIP as guidance。
类似于inpainting,每一步采样结果的unmask部分用
extending augmentations
LatentPaint: Image Inpainting in Latent Space with Diffusion Models
training-free,text-free
对latent representation(比如h-space)做blended
RePaint: Inpainting using Denoising Diffusion Probabilistic Models
training-free,text-free
resample technique
TD-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning
training-free,text-free
节省了RePaint的resample过程,加速。
Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models
training-free,text-free
对随机DDIM每步采样结果进行梯度更新,loss为其
同样使用resample technique。
GradPaint: Gradient-Guided Inpainting with Diffusion Models
training-free,text-free
CoPaint的梯度guidance版。计算每步生成结果和原图的unmasked region之间的MSE,求梯度作为guidance,类似Posterior Sampling。
Image Inpainting via Tractable Steering of Diffusion Models
Tractable Probabilistic Models
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
training-based,text-guided
参考Text-Guided Inpainting Model。
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
training-based,text-guided
Imagen版本的GLIDE的Text-Guided Inpainting Model,直接降采样并concat会导致mask边缘不和谐,所以训练一个encoder进行降采样。
High-Resolution Image Synthesis with Latent Diffusion Models
training-based,text-guided
StableDiffusion版本的GLIDE的Text-Guided Inpainting Model,在LAION上使用随机mask进行训练,masked image也用VAE encoder编码,mask降采样到和
Improving Text-guided Object Inpainting with Semantic Pre-inpainting
training-based,text-guided
StableInpainting有两个缺点:Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model.
pre-inpainting
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model
training-based,text-guided
self supervised learning using panoptic segmentation dataset
mask augmentation + background preservation with mask prediction
编辑时还可以通过mask指定shape。
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
training-based,text-guided
和StableInpainting一样的训练方法,额外在文本中引入可训练的prompt,作为该任务的prompt。
Adding Conditional Control to Text-to-Image Diffusion Models
training-based,text-guided
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
training-based,text-guided
ControlNet改进版,新加的网络去掉了cross-attention layer,只处理图像。
Brush2Prompt: Contextual Prompt Generator for Object Inpainting
根据unmask部分的内容及mask的形状,自动生成用于inpainting的prompt,之后使用text-guided inpainting model进行inpainting。
LoMOE: Localized Multi-Object Editing via Multi-Diffusion
training-free,text-guided。
使用BLIP生成图像的prompt,使用regularized DDIM Inversion得到
因为是多区域编辑,所以使用基于mask的MultiDiffusion,每个区域使用自己的edit prompt单独去噪一次,然后根据mask加在一起。
经典的two branch方法,两个branch之间计算loss梯度下降优化
HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models
基于StableInpainting的training-free方法,text-guided。
使用上述训练好的StableInpainting模型,将所有self-attention layer替换为Prompt-Aware Introverted Attention layer(PAIntA layer),其也是self-attention的计算方式,但对每个masked pixel对应的self-attention map做修改,给masked pixel对应的self-attention map中的unmasked pixel的响应值乘一个系数,该系数等于该unmasked pixel与所有word的cross-attention map的响应值的和,目的是让masked pixel更加注重那些与text有关的unmasked pixel。由于StableInpainting中所有self-attention layer(即PAIntA layer)都在cross-attention layer前,所以计算时借用下一个cross-attention layer的参数。
Reweighting Attention Score Guidance:计算每个word对应的cross-attention map,根据mask计算交叉熵, maximize the cross-attention scores in the masked region and minimize the cross-attention scores in the unmasked region,所有word的计算结果求和,求梯度作为guidance。一般的guidance会使采样结果偏离,导致采样质量下降,这里将guidance除以其标准差,将随机版本的DDIM采样公式中的噪声替换为guidance,因为随机版本的DDIM采样公式是可以保证采样结果不偏离的,因为其噪声是标准正态分布,所以这里将guidance除以其标准差以匹配单位方差,但保持了其均值以实现guidance。
训练一个超分LDM,对上述inpainting结果进行超分。
MagicRemover: Tuning-free Text-guided Image inpainting with Diffusion Models
training-free,text-guided,是专门做object removal的,text为想要remove的object。
optimizing
将reconstructive generative trajectory的self-attention的KV注入inpainting generative trajectory,但可以使用MasaCtrl的思路,使用reconstructive generative trajectory的cross-attention估计一个object的mask,让inpainting generative trajectory的self-attention中的object区域只参考reconstructive generative trajectory的KV中mask之外的区域。
Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance
training-free,需要提供要remove的object的mask。
AAS: 提高mask区域与background区域的self-attention map的weight,让remove的地方更符合background;降低mask区域与mask区域的self-attention map的weight,因为现在要inpaint mask区域,mask区域没有可参考的东西;降低background区域与mask区域的self-attention map的weight,防止background区域被影响,因为mask区域没有可参考的东西。
SARG: 类似PAG的attention guidance,
MagicEraser: Erasing Any Objects via Semantics-Aware Control
TI学习一个形容词,加在背景相关词的前面,如这里的$\text{A photo of } S_{\star} \text{ sky}.,同时LoRA fine-tune diffusion model,在构造的数据集上训练。
提高与mask region相关region的self-attention map的weight,降低不相关region的self-attention map的weight。
zero-shot推理时,用含有
Coherent and Multi-modality Image Inpainting via Latent Space Optimization
training-free,text-guided
We believe that the early stage of the reverse process determines the semantics of the generated image,因此只在前
Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model
training-based,text-free + text-guided
We found blended is insufficient since the known information is inserted externally rather than generated by the model itself, the model lacks full context awareness, potentially causing incoherent semantic transitions near hole boundary. 只需要masked finetune一下模型,继续使用blended生成。
进一步才用masked attention,对于cross-attention,只让text和masked区域内的pixel做attention,对于self-attention,只让masked区域内的pixel之间互相做attention。
Multi-modality Guided Image Completion
training-based,text-based,StableInpainting Model。
每个模态训练一个encoder提取模态的feature,feature是multi-scale的,每个scale注入到UNet的encoder对应scale的feature上。对于structure-form(如segmentation,edge等),直接相加;对于context-form(如text,style等),将feature进行pool后注入cross-attention作为context vector。训练时freeze StableDiffusion Inpainting Model,只训练模态encoder,且每个模态单独训练。有点类似ControlNet。
采样时可以多个模态encoder一起用,不过不能再用上面的注入方式了(因为feature不具备可加性),而是使用StableDiffusion Inpainting Model的UNet的multi-scale feature和引入单个模态encoder后得到的multi-scale feature计算MSE loss,求梯度作为guidance,因为梯度具有可加性,这样可以实现多模态的guidance而不用重新训练多模态。
Inpaint Anything: Segment Anything Meets Image Inpainting
SAM + 任意inpainting model
Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
training-based
IR-SDE的公式,masked image作为
sparse structure:例如the grayscale map和edge map。
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
StableInpainting with feedback learning
假设diffusion model使用
Sketch-guided Image Inpainting with Partial Discrete Diffusion Process
只对mask部分的token进行discrete diffusion,构造数据集进行自监督训练。
Lazy Diffusion Transformer for Interactive Image Editing
使用Pixel-
使用Pixel-
类似SmartBrush构造数据集进行自监督训练。
AsyncDSB: Schedule-Asynchronous Diffusion Schrödinger Bridge for Image Inpainting
Generative Powers of Ten
zoom stack
Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach
随机crop两个view,一个archor view一个target view,resize到相同shape,用左上角坐标计算RPE,以archor view为条件训练生成target view。
Salient Object-Aware Background Generation using Text-Guided Diffusion Models
We use Stable Inpainting as a base model and add the ControlNet model on top to adapt it to the salient object outpainting task.
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
SODA: Bottleneck Diffusion Models for Representation Learning
UNet有
Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models
Diffusion Bridge AutoEncoders for Unsupervised Representation Learning
编码器编码
相比于Diff-AE和PDAE,数据的信息分别存储与
Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation
DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
在Diff-AE的隐空间上学习解耦表征。
DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models
Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement
Closed-Loop Unsupervised Representation Disentanglement with beta-VAE Distillation and Diffusion Probabilistic Feedback
Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning
Exploring Diffusion Time-steps for Unsupervised Representation Learning
Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models
Object-Centric Slot Diffusion
Learning to Compose: Improving Object Centric Learning by Injecting Compositionality
Denoising Diffusion Autoencoders are Unified Self-supervised Learners
Diffusion Models as Masked Autoencoders
Unified Auto-Encoding with Masked Diffusion
mask区域使用MAE loss,noisy部分使用diffusion loss,一起训练。
额外引入
Masked Diffusion as Self-supervised Representation Learner
动态mask ratio版的MAE。
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
根据文本生成图像作为数据。
Can Generative Models Improve Self-Supervised Representation Learning?
根据原图生成图像作为数据,instance-guided generation作为一种augmentation进行SSL。
Unlike StableRep, we do not replace a real dataset with a synthetic one. Instead, we leverage conditional generative models to enrich augmentations for self-supervised learning. In addition, our method does not require text prompts and directly uses images as input to the generative model.
Personalized Representation from Personalized Generation
Contrastive Learning with Synthetic Positives
Multi Positive Contrastive Learning with Pose-Consistent Generated Images
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
DreamDA: Generative Data Augmentation with Diffusion Models
给h-space的feature加一个高斯噪声用于预测
DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling
Deconstructing Denoising Diffusion Models for Self-Supervised Learning
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models
Diffusion Model as Representation Learner
Distill the intermediate representation from a pre-trained diffusion model to a recognition student.
After the distillation phase, the student is reapplied as a feature extractor and fine-tuned with the task label.
Reinforced Time Selection for Distillation.
De-Diffusion Makes Text a Strong Cross-Modal Interface
text as representation, encoder is a captioning model, decoder is a text2img model
gumbel softmax
Do text-free diffusion models learn discriminative visual representations?
利用UNet intermediate feature maps做判别。
Diffusion Feedback Helps CLIP See Better
DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text).
CLIP image embedding和text embedding在同一空间,所以可以作为条件输入StableDiffusion。
Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
One Diffusion to Generate Them All
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
不同任务转变为不同instruction,原图、instruction、目标图像作为数据,instruction作为文本输入,训练一个StableDiffusion生成目标图像,原图concat到
利用InstructPix2Pix的数据训练也可以做编辑。
DreamOmni: Unified Image Generation and Editing
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
instruction作为StableDiffusion的文本输入,原图
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Toward a Diffusion-Based Generalist for Dense Vision Tasks
DiffX: Guide Your Layout to Cross-Modal Generative Modeling
Robust Classification via a Single Diffusion Model
Diffusion Models are Certifiably Robust Classifiers
Few-shot Learner Parameterization by Diffusion Time-steps
在few-shot dataset上LoRA fine-tune StableDiffusion,prompt用"a photo of [C]",使用类似上述RDC的公式推理,但给公式引入了一个时间步weight,并指明这个weight很重要。
Image Captions are Natural Prompts for Text-to-Image Models
对于只有类别标注的图像数据集,如ImageNet,利用预训练caption模型,对某个图像生成一个caption,拼在"a photo of class"之后组成一个prompt,再利用预训练StableDiffusion生成这个prompt的图像,用这个生成的图像替代原图,组成数据集。最终合成的数据集与原数据集大小相同。用合成的数据集训练分类器,效果更好。
Classification-Denoising Networks
Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning
A Simple and Efficient Baseline for Zero-Shot Generative Classification
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Unified Text-to-Image Generation and Retrieval
DiffusionDet: Diffusion Model for Object Detection
CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models
FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection
DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object
DiffuBox: Refining 3D Object Detection with Point Diffusion
Monocular: 3D Object Detection and Pose Estimation with Diffusion Models
SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection
Digging into contrastive learning for robust depth estimation with diffusion models
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation
SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
Depth Any Video with Scalable Synthetic Data
FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models
DFormer: Diffusion-guided Transformer for Universal Image Segmentation
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis
image和segmentation在channel维拼成一条数据训练text-guided diffusion model,使用Gaussian-Categorical distribution新公式,可以根据text同时生成image和segmentation的互相生成。
SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow
We train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks.
UniGS: Unified Representation for Image Generation and Segmentation
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
Domain Adaptive Semantic Segmentation,利用image translation做分割模型的迁移。
需要source domain的图像和分割图,target domain的图像。
使用source domain的图像和分割图训练一个分割模型,使用target domain的图像训练一个扩散模型,对source domain的图像做SDEdit,分割模型和分割图计算loss做梯度修正,生成该分割图在target domain对应的图像,然后使用它们fine-tune source domain的分割模型,就可以得到target domain的分割模型。
DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting
pix2gestalt: Amodal Segmentation by Synthesizing Wholes
Diffusion Model for Dense Matching
Apply Diffusion Model on Image Captioning
DiffCap: Exploring Continuous Diffusion on Image Captioning
Text-Only Image Captioning with Multi-Context Data Generation
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Parallel Vertex Diffusion for Unified Visual Grounding
Language-Guided Diffusion Model for Visual Grounding
Exploring Iterative Refinement with Diffusion Models for Video Grounding
DDP: Diffusion Model for Dense Visual Prediction
DIFFANT: Diffusion Models for Action Anticipation
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
Faster Diffusion Action Segmentation
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation
DiffusionTrack: Diffusion Model For Multi-Object Tracking
DINTR: Tracking via Diffusion-based Interpolation
DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
MomentDiff: Generative Video Moment Retrieval from Random to Real
Conditional Diffusion Model for Open-ended Video Question Answering
DiffSED: Sound Event Detection with Denoising Diffusion
Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?
用diffusion model生成的数据作为训练集,输入到pre-trained teacher网络,蒸馏到student网络。这样不需要限制在真实数据集上,效果好,用生成的low fidelity的图像(减少采样步数等方法)效果更好。
Knowledge Diffusion for Distillation
用teacher网络提取到的feature作为数据训练一个diffusion model,将student网络提取到的feature作为teacher的feature的noisy version进行去噪,去噪后的feature和teacher的feature计算KL loss,优化student网络。
data attribution: for a generated image, which training data contribute to it much?
Evaluating Data Attribution for Text-to-Image Models
Intriguing Properties of Data Attribution on Diffusion Models
Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond
Efficient Shapley Values for Attributing Global Properties of Diffusion Models to Data Group
SemGIR: Semantic-Guided Image Regeneration based method for AI-generated Image Detection and Attribution
MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models
Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model
Latent Dataset Distillation with Diffusion Models
Dataset distillation aims to generate a small set of representative synthetic samples from the original training set.
D4M: Dataset Distillation via Disentangled Diffusion Model
Feature Denoising Diffusion Model for Blind Image Quality Assessment
eDifFIQA: Towards Efficient Face Image Quality Assessment Based On Denoising Diffusion Probabilistic Models
DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild
Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents
利用预训练生成模型网络辅助理解模型,或者使用生成数据提升理解模型。
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
CleanDIFT: Diffusion Features without Noise
AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks
Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models
Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction
Scaling Laws of Synthetic Images for Model Training
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors
To effectively transfer learned features to discriminative tasks while ensuring compatibility, an intuitive approach is to introduce the prior knowledge of the recognition model. 使用预训练好的ResNet-18引入判别先验
U-head有两个flow,down-sample flow生成global feature,用于分类等任务,up-sample flow生成spatial feature,用于分割等任务。
Scaling Properties of Diffusion Models for Perceptual Tasks
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features
Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques
Diffusion Models Trained with Large Data Are Transferable Visual Models
We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data.
使用预训练好的diffusion model,输入原图,时间步为1,fine-tune diffusion model预测target,如depth等。
Add-SD: Rational Generation without Manual Reference
利用diffusion model编辑图像添加物体,解决下游的分类、分割、检测等任务的类别长尾分布问题。
Diffusion Models as Data Mining Tools
Diffusion Models Beat GANs on Image Classification
UNet feature + classification head
Feedback-Guided Data Synthesis for Imbalanced Classification
Analyzing and Explaining Image Classifiers via Diffusion Guidance
Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images
Active Generation for Image Classification
Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model
Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models
Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection
利用StableDiffusion制造detection数据,与Attention as Annotation类似。
先用已有的detection数据只加一步噪声输入StableDiffusion,训练一个可以根据UNet feature map pyramid生成bounding box的Detection Adaptor。之后固定Detection Adaptor,构造一些简单通用的prompt,给已有的detection数据图片加噪声再生成(类似SDEdit),将最后一步的feature map pyramid输入Detection Adaptor,输出作为生成图片的bounding box标注。
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
利用StableDiffusion制造detection数据,方法是分别生成前景和背景,再拼接粘合。
Data Augmentation for Object Detection via Controllable Diffusion Models
Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception
Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector
No Annotations for Object Detection in Art through Stable Diffusion
Representative Feature Extraction During Diffusion Process for Sketch Extraction with One Example
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
用预训练网络标注StableDiffusion生成图像的depth and salient数据集。
Extract the intermediate output of some self-attention layer at some sampling step. Interpolate lower resolution predictions to the size of synthesized images. A linear classifier is trained on it to predict the pixel-level logits.
之后就可以使用StableDiffusion和linear classifier进行预测。
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps).
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage
Label-Efficient Semantic Segmentation With Diffusion Models
对数据进行加噪,输入到预训练好的DDPM的UNet中,用decoder各层输出的feature map上采样到图片尺寸后concat起来,每个pixel对应一个vector,将其输入到MLP中进行标签预测,进行训练。
经过实验,选取
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models
we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generatin fine-grained segmentation maps without any additional training.
MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
选多个时间步的不同层的attribution prompt的cross-attention map估计segmentation。
attribution prompt不一定是最好的描述,借用TI的思想,可以用一些数据做token optimization,即优化attribution prompt的token embedding,效果很更好。
Diffusion-Guided Weakly Supervised Semantic Segmentation
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
training-free
Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation
Diffusion Features to Bridge Domain Gap for Semantic Segmentation
Unleashing Text-to-Image Diffusion Models for Visual Perception
cross-attention map
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment
enhanced VPD
Harnessing Diffusion Models for Visual Perception with Meta Prompts
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation using Stable Diffusion
DiffSeg utilizes a pre-trained StableDiffusion model and specifically its self-attention layers to produce high quality segmentation masks.
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter
设计prompt,和图像一起送进StableDiffusion,利用某个单词的cross-attention map得到该物体的大概的分割图,再利用self-attention map进行调整和补全。
Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion
不再依赖人工标注,利用StableDiffusion生成大量图像和分割图(cross-attention map)的数据训练分割模型
MaskFactory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis
使用现有的分割数据集(image, segmentation map),将segmentation map编码成三通道的图像的形式,使用BLIP2得到image的caption,fine-tune SDXL(caption
使用(image, segmentation map)训练一个ControlNet,得到一个Mask2Img模型。
之后可以使用这两个网络生成新的分割训练数据:使用现有分割数据集的某张原图,使用BLIP2得到原图的caption,输入到Text2Mask模型中,得到一系列segmentation map,再输入到Mask2Img模型,得到segmentation map对应的原图,组成数据对。
对于相同的分割模型,在现有的分割数据集基础上,额外使用生成的数据进行训练,效果比只使用现有的分割数据集的模型有明显提升。
Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models
Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation
用scribble训练一个ControlNet,生成分割训练数据数据。
Outline-Guided Object Inpainting with Diffusion Models
利用少量的instance segmentation数据,使用StableInpainting对这些数据的做object variation,扩增数据。
Explore In-Context Segmentation via Latent Diffusion Models
ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
给定一张图片和一句描述图片中某个object的text,利用预训练好的text2img模型StableDiffusion预测目标object的mask。
Guiding Text-to-Image Diffusion Model Towards Grounded Generation
利用预训练好的text2img模型StableDiffusion,在根据text输出image的同时,还会输出image对应的segmentation mask。
先用StableDiffusion生成图片,再用预训练好的object detector生成这些图片的segmentation mask,构建了一个数据集,再使用这个数据集训练grounding module,方法也类似Label-Efficient Semantic Segmentation With Diffusion Models。
Generative Prompt Model for Weakly Supervised Object Localization
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
exploit Stable Diffusion features for semantic and dense correspondence
Emergent Correspondence from Image Diffusion
不需要训练,直接使用Stable Diffusion features做匹配即可。
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
prompt tuning
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
for a given feature map
DDIM generation和Inversion效果相似,所以既适用于synthetic images也适用于real images。
For semantic correspondence, we flatten the descriptor maps for a pair of images and compute the cosine similarity between every possible pair of points. We then supervise with the labeled corresponding keypoints using a symmetric cross entropy loss in the same fashion as CLIP.
DiffGlue: Diffusion-Aided Image Feature Matching
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models
SynthVLM is a novel data synthesis pipeline for VLLMs.
Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs.
TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models
使用layout-to-image model,根据tracklet生成video sequence作为训练MOT的数据。
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation
EGC: Image Generation and Classification via a Diffusion Energy-Based Model
energy function,需要求二阶导优化。
类似Denoising Likelihood Score Matching for Conditional Score-based Data Generation
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
Unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation
类似MAGE的生成与理解统一的模型。
复制一个UNet decoder作为Mask Generator,每一步生成
也可以做real image segmentation,只需要加噪一步再去噪一步即可。
Unseen Image Synthesis with Diffusion Models
使用某个域内预训练的diffusion model生成域外的样本。
DDIM inverse 2k个OOD样本到500步得到2k个
IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models
目标:给定两张图像,做插值。
SD上分别作TI,得到两张图对应的text embedding。
用上述两个text embedding LoRA fine-tune SD。
用
对text embedding进行插值,cfg生成。
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion
生成插值序列(视频)
AID: Attention Interpolation of Text-to-Image Diffusion
对两个图像生成过程的cross-attention的KV进行插值,替代当前插值点生成过程中的cross-attention中的KV。
NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation
对diffusion model generated images来说DDIM Inversion + slerp插值法效果很好,但对real images效果就不好,通过一些方法矫正noise可以解决这一问题。
Prompt Mixing in Diffusion Models using the Black Scholes Algorithm
prompt插值生成。
Concept-centric Personalization with Large-scale Diffusion Priors
新任务,将StableDiffusion个性化为专门生成某个概念图像的模型,和TI的区别是,该任务专注于某个更抽象的concept而非reference images中的concept,比如人脸,emphasizes fidelity and diversity in the generative results,所以需要提供至少上k的该concept的图像。
做法是将concept和其它控制条件分离,在提供的concept数据集上fine-tune StableDiffusion (全部使用null text),得到concept-centric diffusion model,使用CFG进行生成,其它控制条件也可以通过CFG引入,比如text和ControlNet。
Neural Network Diffusion
parameter autoencoder + latent diffusion model
Diffusion-based Neural Network Weights Generation
Conditional LoRA Parameter Generation
FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes
Fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories.
新型CFG:训练和采样时使用superclass label embedding替代null embedding。
Factorized Diffusion: Perceptual Illusions by Noise Decomposition
生成illusion。
通过不同的decomposition方法(如高低频,颜色,运动等),将图像