TheoryTweedie's formulaAwesome ArchitectureCascadedMDMADMAsCAND3PMGGMImageBARTVQ-DiffusionCATDMLDMWuerstchenStableCascadeBinaryLDMSiDSiD2Scaling-Law-1Scaling-Law-2Scaling-Law-3DiT-Scaling-LawRDMU-ViTDiffusion-RWKVDiTDiT-MoEEC-DiTDyDiTPoMU-DiTSwitch-DiTSiTHDiTEDTDiMRFlag-DiTNext-DiTFiTVisionLLaMADiGCLEARTransfusionMonoFormerCausalFusionACDiTDARTNeural-RDMMambaDiSDiffuSSMZigMaDiMDiMDimbaLaMamba-DiffLinFusionOthersCrossFlowInfinite-DiffDoDINFDCANEffectiveness and Efficiency Enhancementhybrid generative modelD2CDDGANDiffuseVAE-VAEFM-BoostingPDMrefinement of network architecturesLDMLiteVAEWuerstchenSimpleDiffusionBK-SDMKOALADuoDiffHDiTSiTDiGBLASTHFDMSnapFusionSnapGenMobileDiffusionSpiking NetworkNASDDSMScaleLongQ-DMDiffuSSMEDM-2ReDistillQuantumOpticalpre-trained network compressionquantizationPTQQ-DiffusionBiDMSD-PTQLDM-PTQEMFTerDiTVQ4DiTPTQ4DiTDPQDiTASHQ-DiTSVDQuantStableQBinaryDMAPQ-DMStepbaQTFMQ-DMEfficientDMEDA-DMMemory-EfficientMixDQMPQ-DMCOMQQNCDTAC-DiffusionTCAQ-DMTFMnetwork pruningDiffPruningDiffPruningLTDDPMLAPTOP-DiffLD-PrunerLayerMergeDiP-GOSD2-PruningSVSDistillationDKDMenhancement of trainingnovel model familyiDDPMPDFMRectFlowEDMVDMVDM++DiffEncSPDoptimal noise scheduleCosineLaplaceFixFlawSingDiffusionSODCoptimal loss weightingP2-weightingDebiasSpeeDoptimal timestep samplingB-TTDMAdaTSMulti-Task LearningMoEDeMeDual-OutputCas-DMDBCLREPAMasked DiffusionMaskDMMDTMaskDiTSD-DiTPatchPatchDiffusionPatch-DMnovel diffusion formulaShiftDDPMsContextDiffGUDCARDExposureDiffusionRDDMBeta-DiffusionothersFDMDDDMDeeDiffDMPCompensationCEPSADMConPreDiffQACDCDMTDNNAttention-MediatorsDiff-TuningSFUNetWMGMenhancement of samplingoptimal sampling scheduleDPRLAdaDiffBudgetFusionJYSCSAYSLD3OFSBeta-SamplingDistillation-basedDenoisingStudentO2MKDDiffusion2GANInstaFlowPDCFG-PDSFDDMSDXL-LightningTRACTSpeedUpNetPDAE-PDImagine-FlashDDGANSIDDMUFOGenYOSOHiPAADD (SDXL-Turbo)LADDNitroFusionSwiftBrushSwiftBrushv2DISDXSDMDDMD2MomentMatchingFGMCMLCMLCM-LoRARG-LCDCTMPCMGCTMSFDTCDTSCDMCMSCottSCFlowShortcutConsistency-FMiCDSiDSiDASiD-LSGODE-basedDDIMPNDMDPM-SolverODE-Distillation-basedD-ODECachingAdaptiveDiffusionPFDiffDeepCacheFlexiffusionUnravelingFaster-DiffusionBlockCachingSkipDiT-DiTDuCaTGATEDiTFastAttnL2CHarmoniCaLazyDiTTOCToken ReductionToMeImToMeAT-EDMToDoTokenCacheToCaothersDGDiffRSAMSSkip-TuningMASFTimeTunerResidual-LearningDICEICExposure BiasTS-DDPMIPDREAMSSE2EDiffMDSSLow-Density SamplingLDMinority-GuidanceSG-MinorityHDothersMCBayesDiffCADSFPDMDistriFusionPCPPLCSCHallucinationsGuidanceClassifierNoisy-ClassifierDACFGGFCGICG+TSGPCGCGDLSMCDMDDGAny Distance EstimatorMCGDPSFreeDoMUGDLCD-MCMPGDFIGDADMMTraining-Free-GuidanceTFGGeoGuideEluCDPnPSteered-DiffusionDSGDreamGuiderAutoGuidanceSIMSAsymmetric Reverse ProcessAsyrpDDSCFG++AsyGGInversionwhyGAN InversionDDIM InversionRegularized DDIM InversionExact InversionEDICTAIDIBDIABELMHigh-OrderSPDInvEasyInvParameter-Efficient Fine-TuningLoRAAttnLoRATriLoRAPaRaTerraSVDiffPETStyleInjectLyCORISOFTBOFTSODASCEditDiffFitDiffscalerSaRAFINEText-to-ImageAwesomeVQ-DiffusionGLIDEDALLE-2 (unCLIP)DALL·E-3CogView3SDSDXLiSDXLSD3PGv3SANAWuerstchenStableCascadeImagenYaARTeDiffiRAPHAELPixArt-PixArt-Flag-DiTNext-DiTGenTronPanGu-DrawParaDiffusionKNN-DiffusionRDMEnhancementRe-ImagenLatent TransparencyLayerDiffAFADiffusion-SoupNSOAsyVQGANCGIoCoQUOTAMuLanDIFFNATFreeUOmeganceSRConceptSlidersPromptSlidersLaVi-BridgeMulti-LoRAPositionCompFuserCoMPaSSNoiseGoldenNoiseNoiseDiffusionAttentionERNIE-ViLG 2.0TokenComposeLocal-ControlSAGPAGSEGSelf-GuidanceAttention-RegulationAttention-ModulationMaskDiffusionA&ED&BSynGenEBAMAEBCAPAC-BayesELAINITNOEnMMDiTA-STARPromptVP (text -> bounding box -> image)PCIG (text -> graph -> image)GenArtist (text -> bounding box -> image)CxD (text -> bounding box -> image)LayoutLLM-T2I (text -> bounding box -> image)DivCon (text -> bounding box -> image)LLM-Blueprint (text -> bounding box -> image)RealCompo (text -> bounding box -> image)ReasonLayout (text -> bounding box -> image)SimM (text -> bounding box -> image)SPDiffusion (text -> bounding box -> image)SceneGenie (text -> scene graph -> image)T2I-Salad (text -> scene graph -> image)FG-DM (text -> any -> image)ITI-GenRPGRAGContextCanvasR2FConceptDiffusionRCNTweedieMixSGOOLDreamWalkMCT2IMagnetToMeMoCEFDRPrompt LLM EncodingLI-DiTLLMDiffLLM4GENSUR-adapterPrompt RewriteBeautifulPromptDiffChatChatGenPromptistNeuroPromptsPRIPPatcherAP-AdapterNegtive PromptContrastivePromptDPO-DiffDNPSyntaxStructureDiffusionSG-AdapterMemorizationMemAttnAMGNeMoIET-AGCMemBenchProCreateBEAMFCGuidanceGuidanceIntervalDynamicGuidanceS-CFGWorkingMechanismGuideModelSFGFABRICCADCharacterTCOOneActorSFTEmuEvoGenRLFTDPOKDDPOImageRewardLVLM-ImageRewardHPSHPSv2RAHFRLHFBoigSDHRFDiffusion-KTOPRDPRLCMTexForceTextCraftorPAHISynArtifactDRaFTAlignPropDRTuneParrotVersaT2ICoMatDPTSELMAIterCompLongAlignFaceScoreF-BenchAIGRIDDPODiffusion-DPOD3POSPOSDPOPSOLaSRORankDPOVisionRewardPopAlignNCPPOCurriculum-DPOPatchDPOSelf-ConsumingGene-DPODDESafetyDPOGFlowNets-GFlowNetLanguageBDMTaiyi-Diffusion-XLPEA-DiffusionAltDiffusionLLMDiffusionResolutionMDMultiDiffusionDemoFusionMegaFusionFAMFreeScaleAccDiffusionAccDiffusion2ResMasterHiPromptStreamMultiDiffusionSyncDiffusionSyncTweediesSSL-GuidanceCutDiffusionASGDiffusionInstantASVSDDiffuseHighScaleCrafterFouriScaleHiDiffusionAny-Size-DiffusionSelf-CascadeDiffCollageElasticDiffusionMagicScrollFiTBeyondScenePersonalizationSubjectTI (direct)DreamBooth (direct)CustomDiffusion (direct)DreamBooth++ (direct)Improved (direct)ViCo (direct)HyperDreamBooth (no pseudo word)HyperNetFields (no pseudo word)DiffLoRA (no pseudo word)XTI (direct)CustomContrast (transform)ProSpect (direct)NeTI (direct)ED (direct)PerFusion (direct)InFusion (direct)CrossInitialization (direct)DC (direct)DP (direct)UFC (direct)SID (direct)SAG (direct)AlignIT (direct)PALP (direct)CLiC (direct)MagicTailor (direct)EM-Optimization (direct)RPO (direct)CustomSketching (direct)IP-Adapter (no pseudo word, no test-time fine-tuning)MIP-Adapter (no pseudo word, no test-time fine-tuning)DreamTuner (no pseudo word, no test-time fine-tuning / direct)FreeTuner (no pseudo word, no test-time fine-tuning)SSR-Encoder (no pseudo word, no test-time fine-tuning)Mask-ControlNet (no pseudo word, no test-time fine-tuning)DiptychPrompting (no pseudo word, no test-time fine-tuning)DreamMatcher (direct)DVAR (direct)PACGen (direct)CI (direct)CoRe (direct)SuDe (direct)ProFusion (direct)DisenBooth (direct)DreamArtist (direct)StyO (direct)DreamDistribution (direct on prompt)SingleInsert (transform)ELITE (transform)E4T (transform)DA-E4T (transform)Cones (direct, no test-time fine-tuning)Cones2 (transform)HiPer (attach)HiFi-Tuner (attach)CatVersion (no pseudo word)DPG (no pseudo word)SuTI (no pseudo word, no test-time fine-tuning)Obeject-Encoder (no pseudo word, no test-time fine-tuning)InstantBooth (transform, no test-time fine-tuning)Instruct-Imagen (no pseudo word, no test-time fine-tuning)BootPIG (no pseudo word, no test-time fine-tuning)JeDi (no pseudo word, no test-time fine-tuning)Multi-SubjectBreak-A-Scene (direct)UCD (direct)ConceptExpress (direct)DisenDiff (direct)AttenCraft (direct)UnZipLoRA (direct)Multi-Subject CompositionMix-of-Show (direct)LoRA-Composer (direct)ZipLoRA (direct)LoRA.rar (direct)LoRACLR (direct)CP (no pseudo word)MultiBooth (direct)MC2 (direct)OMG (direct)FreeCustom (no pseudo word, no test-time fine-tuning)OrthoAdaptation (direct)MLoE (direct)Break-for-Make (direct)Cones2 (transform)UMM (transform, no test-time fine-tuning)Subject-Diffusion (transform, no test-time fine-tuning)InstantFamily (no pseudo word, no test-time fine-tuning)SE-Guidance (no pseudo word, no test-time fine-tuning)Concept DiscoveryConceptorCusConceptCGCDConceptLab (direct)PartCraft (direct)Non-Subject InversionReVersion (direct)ReInter (direct)Lego (direct)ADI (direct)ImPoster (no pseudo word)ViewNeTI (direct)FSViewFusion (direct)CustomDiffusion3603D-words (direct)CustomNet (transform)FaceFastComposer (transform, no test-time fine-tuning)UniPortrait (transform, no test-time fine-tuning)DreamIdentity (transform, no test-time fine-tuning)Face2Diffusion (transform, no test-time fine-tuning)PhotoMaker (transform, no test-time fine-tuning)PortraitBooth (transform, no test-time fine-tuning)FreeCue (transform)MegaPortrait (no pseudo word, no test-time fine-tuning)Arc2Face (transform, no test-time fine-tuning)IDAdapter (transform, no test-time fine-tuning)Dense-Face (direct)InstantFamily (no pseudo word, no test-time fine-tuning)DiffSFSR (no pseudo word, no test-time fine-tuning)DemoCaricature (direct)Face Aging (direct)CelebBasis (direct)CharacterFactoryStableIdentity (direct)SeFi-IDE (direct)LCM-Lookahead (no pseudo word, no test-time fine-tuning)InpaintingRealFillPVARestorationPersonalized RestorationBenchmarkDreamBench++EfficiencyHollowedNetLifelongLFS-DiffusionX-to-Image (more fine-grained than text-to-image)SketchSKG (sketch + text)SKSG (sketch + text)SketchAdapter (sketch + text)ToddlerDiffusionSKGLO (sketch + text)KnobGen (sketch + text)Layout/SegmentationLocally-Conditioned-DiffusionIIG (bounding box + text)NoiseCollage (bounding box + text)TriggerPatch (bounding box + text)LayoutDiffuse (bounding box + text)LayoutDiffusion (bounding box + text)CreatiLayout (bounding box + text)RCL2I (bounding box + text)IFAdapter (bounding box + text)PLACE (bounding box + text)MIGC (bounding box + text)HiCo (bounding box + text)B2B (bounding box + text)R&B (bounding box + text)LAW-Diffusion (bounding box + text)SALT (bounding box + text)Directed Diffusion (bounding box + text)Attention-Refocusing (bounding box + text)BACON (bounding box/segmentation + text)BoxDiff (bounding box/segmentation + text)CSGZero-Painter (segmentation + text)CAC (bounding box/segmentation + text)SpaText (segmentation + text)EOCNet (segmentation + text)FreestyleNet (segmentation + text)ALDM (segmentation + text)DenseDiffusion (segmentation + text)SCDM (segmentation)MagicMix (layout/style from image/text + text)DiffFashionGeoDiffusion (bounding box -> text -> image)GLoD (layout + text)PoseStablePoseCamera ParameterGenPhotographyScene GraphDiffuseSG (scene graph)DisCo (scene graph)R3CD (scene graph)SGDiff (scene graph)LAION-SGTrajectoryTraDiffusion (trajectory + text)BlobBlobGEN (blob + text)DiffUHaul (blob + layout)ImageIP-Adapter (image + text)IPAdapter-Instruct (image + text)Semantica (image)FiVA (image)PuLID (image + text)InstantID (image + text)ID-Aligner (image + text)PF-Diffusion (image)M2M (image sequence)MangaDiffSenseiMangaDiffusionGeneralLate-Constraint (sketch/edge/segmentation + text)Readout-Guidance (sketch/edge/pose/depth/drag + text)MCM (segmentation/sketch + text)Acceptable Swap-Sampling (concept from text)SCEdit (keypoints/depth/edge/segmentation + text)GLIGEN (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)ReGround (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)InteractDiffusion (interaction + text)InstDiff (box/mask/scribble/point + text)ControlNet (edge/segmentation/keypoints + text)ControlNet-XSCtrLoRAFineControlNetSmartControlControlNet++X-AdapterCCMCoDiCtrl-AdapterControlNeXtMGPFCNC (depth/image/depth and image + text)FreeControl (keypoints/depth/edge/segmentation/mesh + text)T2I-Adapter (edge/segmentation/keypoints + text)BTC (sketch/depth/pose + text)DiffBlender (sketch/depth/edge/box/keypoints/color + text)Universal Guidance (segmentation/detection/face recognition/style + text)Multi-ConditionComposer (shape/semantics/sketch/masking/style/content/intensity/pallete/text)MaxFusionOmniControlNetgControlNetUni-ControlNetAnyControlDynamicControlFaceComposerTASCAny-to-AnyVersatile DiffusionMulti-SourceUniDiffuserOneDiffusionONE-PICEasyGenCoDiCoDi-2GlueGenIn-Context/Prompt/InstructionUniControlPromptDiffusionContextDiffusionImageBrushInstructGIEAnalogistReEditHuman/HandHumanSDHairDiffusionParts2WholeHandRefinerHand2DiffusionHanDiffuserRHanDSHumanRefinerHand1000MoLEText/GlyphTextDiffuserTextDiffuser-2CustomTextRefDiffuserGlyphControlGlyphDrawGlyphDraw2TextGenAnyTextUDiffTextBrush Your TextLTOSAMOSamplerTextCenGenARTISTJoyTypeMGITextMasterSIGILImage CompositionCollage-DiffusionDiffHarmonizationDiffHarmonyRecDiffusionPrimeComposerFreeComposeDiffusion-in-DiffusionTF-ICONTALECDLRDiffMagicFaceMake-A-StoryboardAnySceneGPImage Editing through TextSummarizationMDPMask-BasedIIEMaSaFusionITIEMagicQuillMask-FreeBaseline1Baseline2LASPA (real image editing, no fine-tune)LaF (real image editing, no fine-tune)RF-Inversion (real image editing, no fine-tune)RF-Solver-Edit (real image editing, no fine-tune)FlowInversion (real image editing, no fine-tune)P2P (generated image editing, real image editing, no fine-tune)NTI (real image editing)PTI (real image editing)GEO (real image editing)BARET (real image editing)NPI (real image editing)ProxEdit (real image editing)StyleDiffusion (real image editing)FlexiEdit (real image editing, no fine-tune)DirectInv (real image editing)SimInversion (real image editing)SYE (real image editing)InfEdit (real image editing)IterInv (real image editing)KV-Inversion (real image editing)EDICT (real image editing, no fine-tune)BDIA (real image editing, no fine-tune)BELM (real image editing, no fine-tune)AIDI (real image editing, no fine-tune)SPDInv (real image editing, no fine-tune)FPI (real image editing, no fine-tune)FPI (real image editing, no fine-tune)AdapEdit (real image editing)FPE (real image editing)FateZero (real image editing, no fine-tune)ALE-EditMasaCtrl (real image editing, no fine-tune)DiT4Edit (real image editing, no fine-tune)MRGD (real image editing, no fine-tune)ObjectVariations (generated image editing, no fine-tune)ViMAEdit (real image editing)DDS (real image editing)DreamSampler (real image editing)SmoothDiffusion (real image editing)IP2P (real image editing, retrain)Emu-Edit (real image editing, retrain)UltraEdit (real image editing, retrain)SeedEdit (real image editing, retrain)InstructMove (real image editing, retrain)PbI (real image editing, retrain)Diffree (real image editing, retrain)RIE (real image editing, retrain)EditWorld (real image editing, retrain)LIME (real image editing, retrain)FoI (real image editing, retrain)WYS (real image editing, retrain)ZONE (real image editing, retrain)VisII (real image editing, retrain)E4C (real image editing, fine-tune)Imagic (real image editing, fine-tune)FastEdit (real image editing, fine-tune)Forgedit (real image editing, fine-tune)DBEST (real image editing, fine-tune)PNP (real image editing, no fine-tune)Self-Guidance (real image editing, no fine-tune)Guide-and-Rescale (real image editing, no fine-tune)AGG (real image editing)Asyrp (real image editing)Interpretable h-space (real image edting)ChatFace (real image editing)ZIP (real image editing)Self-Discovering (real image editing)GANTASTIC (real image editing)NoiseCLR (real image editing)StyleDis (real image editing, no fine-tune)SINE (real image editing, fine-tune)SEGA (generated image editing, no fine-tune)DiffEdit (real image editing, no fine-tune)LIPE (real image editing, no fine-tune)DM-Align (real image editing, no fine-tune)FISEdit (real image editing, no fine-tune)InstDiffEdit (real image editing, no fine-tune)Diff-AE & PDAEDiffExDisControlFace (real image editing)UFIEHIVEDialogPaintEMILIEMLLM-PlanGenArtistGround-A-ScoreSANEDVPTIEEVLMMLLM-FeatureMGIECAFEEmoEditRP2PVideoFrame2FrameImage Editing through Reference ImageILVRFGDPbEPbSControlComMADDIMPRINTDreamInpainterPhDRefPaintObjectStitchLogoStickerAnyDoorBifrostAnyLogoLAR-GenPAIR-DiffusionCustomNetCustom-EditDreamEditDreamComSpecRefVisCtrlFreeEditUnconstrainedTry-OnTryOnDiffusionStableVITONTED-VITONMMTryonTryOn-AdapterTry-On-AdapterPLTONStableGarmentGarDiffDTCIDM-VTONWear-Any-WayFIA-VTONBootCompAnyFitAnyDressingTPDFLDM-VTONM&M-VTOShoeModelBooW-VTONFaceFaceStudioHS-DiffusionEmojiDiffStable-MakeupSHMTStable-HairMLLMBLIP-DiffusionUNIMO-GKosmos-GEmu2GILLUniRealImage Editing through Point-based SupervisionSelf-GuidanceDragDiffusionCLIPDragGoodDragDragTextDragNoiseAdaptiveDragEasyDragStableDragFreeDragDragonDiffusionDiffEditorLucidDragPixel-wise-GuidanceRegionDragReadout-GuidanceSDE-DragRotationDragMotion-GuidanceMagic-FixupSceneDiffusionInstantDragModel Editing / Concept Removal / UnlearningTIMEUCERECEMACESLDESDACUnlearningFMNPCEConceptPruneEIUPPrompt-Tuning-EraseSuppressEOTSepCE4MUAbOGeom-ErasingRing-A-BellDiff-QuickFixEraseDiffTVEditioningPositionSix-CDDUOMeta-UnlearningsRGDCADImage-to-Image TranslationSDEdit (no fine-tune)Inversion-by-Inversion (no fine-tune)UNIT-DDPM (retrain)LaDiffGAN (retrain)CycleNet (retrain)Palette (retrain)DDBM (retrain)DBIM (retrain)LSB (no fine-tune)EBDM (retrain)CDBM (retrain)A-Bridges (retrain)FSBM (retrain)ILVR (no fine-tune)DiffusionCLIP (fine-tune)Rectifier (fine-tune)EGSDE (no fine-tune)DDIB (no fine-tune)DECDM (no fine-tune)CycleDiffusion (no fine-tune)DDPM-Inversion (no fine-tune)LEDITS (no fine-tune)LEDITS++ (no fine-tune)TurboEdit (no fine-tune)Pix2Pix-Zero (no fine-tune)PIC (no fine-tune)FBSDiff (no fine-tune)CDM (retrain)DiffuseIT (no fine-tune)Few-Shot Diffusion (fine-tune)Fine-grained Appearance Transfer (no fine-tune)S2STFCDiffusion (fine-tune)Style TransferTextDiffStyler (training-free)ZeCon (training-free)Specialist-Diffusion (training-based)ImageStyleAdapter (training-based)DEADiff (training-based)ColorizeDiffusion (training-based)ArtFusion (training-based)SGDiff (training-based)StyleDiffusion (training-based)ControlStyle (training-based)OSASIS (training-based)CSGO (training-based)VisualStylePrompt (training-free)LAB (training-free)StyleID (training-free)Portrait-Diffuion (training-free)ZePo (training-free)Ditail (training-free)DiffStyle (training-free)CartoonDiff (training-free)FreeStyle (training-free)ASI (training-free)MagicStyle (training-free)STRDP (training-free)TIInST (training-based)StyleBooth (training-based)DomainGalleryPair-Customization (training-based)ArtBank (training-based)LSAST (training-based)OthersStylebreederHiCASTRB-ModulationAttnModInverse ProblemMCGDPSDEFTPGDMMAPDreamGuiderCoSIGNSTSLDMPlugCI2RMSteered-DiffusionFDEMLatentDEMEMDiffusionA&DBCDMAPSRestorationNon-BlindDDRMDDNMDDPGIR-SDEDeqIRRCMBlindBlindDPSFAG-DiffLADiBIGDPDiffusionVIBIRDFlowIEAutoDIRUIR-LoRADiffBIRPromptIRZeroAIRDiff-PluginTIPDecorruptorPromptFixSUPIRDiff-RestorerInstantIRReFIRDP-IRBIR-DFace/HumanPGDiffPFStorerRestorerIDCLR-FaceDiffBodyDTBFRAuthFaceDR-BFROSDFaceSuper ResolutionSRDiffSR3StableSRResShiftSinSRDoSSRTASRPatchScalerTregPromptSRCoSeRFaithDiffCasSRSeeSRPASDXPSRSAM-DiffSRSegSRSkipDiffECDPFDDifBlindDiffCDFormerDiffFNOBFSRAccelerationOSEDiffInvSRAdaDiffSRS3DiffTDDSRAdcSRHF-DiffTSD-SROFTSRInpaintingBlended DiffusionLatentPaintRePaintTD-PaintCoPaintGradPaintTiramisuGLIDEImagenatorStableInpaintingCAT-DiffusionSmartBrushPowerPaintControlNet-InpaintingBrushNetBrush2PromptLoMOEHD-PainterMagicRemoverAttentiveEraserMagicEraserPILOTUni-paintMaGICInpaintAnythingStrDiffusionByteEditSketchInpaintingLazyDiffusionAsyncDSBOutpaintingTenPQDiffPBGRepresentation LearningDiff-AESODAPDAEDBAEHDAEDiffuseGAEDisDiffEncDiffCL-DisFDAEDiTiCausalDiffAEObject-Centric LearningDDAESSLDiffMAEUMDMDMStableRepPersonalizedRepCLSPGenPoCCLGenViewSynCLR-SynCLIPDreamDADALDAl-DAEADDPInfoDiffusionRepFusionDe-DiffusionDiffSSLDIVAFree-ATMOther TaskshybridUniDiffuserOneDiffusionInstructDiffusionDreamOmniInstructCVPixWizardDiffusionGeneralistDiffXClassificationRDCNDCTiFCiPCDNHDCGDCRetrievalDiffusionRetTIGeRObject DetectionDiffusionDetCamoDiffusionFocusDiffuserDiffRef3DDiffuBoxMonoDiffSDDGRCLIFFEdge DetectionDiffusionEdgeDepthD4RDBetterDepthPriorDiffusionLotusFiffDepthSharpDepthSEDiffDepathAnyVideoOptical FlowFlowDiffuserSegmentationDFormerOVDiffGCDPSemFlowUniGSDiffDASSDGInStyleLDMSegpix2gestaltCorrespondenceDiffMatchCaptionCLIP-Diffusion-LMDiffCapText-only Image CaptioningPrefix-DiffusionLaDiCVisual GroundingPVDDiffusionVGDiffusionVGVisual PredictionDDPAction AnticipationDIFFANTTemporal Action DetectionDiffTADEffiDiffActActFusionObject TrackingDiffusionTrackDINTRDiffusionTrackDeTrackVideo Moment RetrievalMomentDiffVideo Question AnsweringDiffAnsSound Event DetectionDiffSEDKnowledge DistillationDM-KDDiffKDData AttributionDataset DistillationLD3MD4MImage Quality AssessmentPFD-IQAeDifFIQADP-IQANR-IQAGenerative UnderstandinghybridDatasetDMCleanDIFTAnySynthDMaaPxDMPSyn-Rep-LearnVermouthSDPDiff-2-in-1GDFGATEGenPerceptAddSDData MiningClassificationDiffusion ClassificationFGDSImage RetrievalZero-Shot Sketch-based Image RetrievalObject DetectionDiffusionEngineT2I-for-Detection3DiffTectionDetDiffusionDDTNADASketchDiffSketchDepth and SaliencyDiffusion Scene RepresentationJointNetECoDepthPrimeDepthSegmentationDDPMSegEmerDiffMaskDiffusionOVAMDiGFreeSeg-DiffDatasetDiffusionMADMDIFFVPDEVPMeta-PromptODISEDiffSegDiffSegmenterAaAMaskFactorySegGenFoBaDiffusionSSSSOutlineLDM-SegScribbleGenGroundingPeekabooGrounded DiffusionGenPrompDiffPNGSemantic CorrespondenceSD-DINODIFTSD4MatchDiffusion-HyperfeaturesDiffGlueVLMSynthVLMMulti-Object TrackingTrackDiffusionDiffMOTHuman-Object-InteractionCycleHOIUnifying Generative and UnderstandingEGCDiffDisFactorized DiffusionOther Interesting PaperUnseenDiffusionIMPUSDiffMorpherDreamMoverAIDNoiseDiffusionBlackScholesDiffusionConcept-centric PersonalizationNeural Network DiffusionFineDiffusionFactorizedDiffusion

Theory

theory

 

Tweedie's formula

One can estimate the mean of a Gaussian distribution, given a random variable zN(μz,Σz) by E(μz|z)=z+Σzzlogp(z)

在diffusion model中,xtN(α¯tx0,(1α¯t)I),所以给定一个xt,对均值α¯tx0的估计为xt+(1α¯t)xtlogq(xt|x0),根据公式xt=α¯tx0+1α¯tϵt,可以推出xtlogq(xt|x0)=ϵt1α¯t

 

Awesome Architecture

Cascaded

Cascaded Diffusion Models for High Fidelity Image Generation

 

MDM

Matryoshka Diffusion Models

Matryoshka-Diffusion

  1. 所有resolution一起训练,训练时对每个data的所有resolution的加噪时间步要一样,避免信息泄露,同样使用了SimpleDiffusion提出的noise schedule shift。

  2. 类似ProgressiveGAN,从低resolution开始训练,之后逐渐加宽UNet增多loss项数去训练更高resolution,训练高resolution时,低resolution网络也会一起训练。

 

ADM

Diffusion Models Beat GANs on Image Synthesis

  1. AdaGN:将class映射到指定维度,和time embedding相加,通过一个MLP预测ysyb,对GroupNorm的结果做affine:(ysGroupNorm(h)+yb)。

  2. super resolution model:将低分辨率图像x0low上采样到xthigh的尺寸,在channel维度上和xthighconcat在一起,输入到UNet中预测xthigh的噪声,UNet输入6通道,输出3通道。注意对不同txthigh,concat的都是x0low,而不是xtlow

 

AsCAN

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

AsCAN

  1. AsCAN is a hybrid architecture, combining both convolutional and transformer blocks.

 

D3PM

Structured Denoising Diffusion Models in Discrete State-Spaces

  1. Discrete Diffusion Model

 

GGM

Glauber Generative Model: Discrete Diffusion Models via Binary Classification

  1. D3PM的前向过程每一步对所有token进行随机加噪,GGM每一步只对某个token进行随机加噪,这样逆向过程每一步只需要预测一个token。

  2. 随机一个长度为T的index序列{it}t=0T1,加噪时,让xt+1的除it位置之外的token保持不变,it位置的token进行随机加噪(也有可能保持不变)。训练和采样时使用相同的index序列。

 

ImageBART

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

  1. VQGAN + multinomial diffusion

  2. Transformer Encoder: xt1,Transformer Decoder: AR shifted prediting xt

ImageBART

 

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

  1. VQVAE + multinomial diffusion

  2. Transformer Blocks:input xt1,cross-attention with text,NAR prediting xt

 

CATDM

Mitigating Embedding Collapse in Diffusion Models for Categorical Data

  1. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training.

 

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

AutoEncoder:H×W×3Hf×Wf×Cf=2m ,训练一个AutoEncoder对图像进行适当降维。

固定AutoEncoder,使用UNet训练DDPM建模数据降维后的latent space。好处:减少计算量;不失图像的结构性,仍能发挥UNet的归纳偏置;学到了一个可以更进一步利用的latent space。m一般取4或8最好。

slight regularization:KL or VQ,避免 high-variance latent spaces。

为UNet注入cross-attention层,放在self-attention之后。Q: xt feature map KV: conditions

 

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

  1. stage A:训练4倍降采样率的VQGAN,1024256

  2. Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到16×24×24

  3. stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。

  4. stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。

  5. 生成时CBA

 

BinaryLDM

Binary Latent Diffusion

和LDM一样的思路,只是latent是二值化的。

利用VQ的思路,使用Bernoulli采样取代VQ的最近邻,训练一个隐变量二值化的binary AutoEncoder:xRh×w×3y=Sigmoid(E(x))z=Bernoulli(y)z^=sg(z)+ysg(y)x^=D(z^)

推导Bernoulli Diffusion Process,使用DPM建模z的分布。

 

SiD

Simple Diffusion: End-to-end diffusion for high resolution images

SiD

  1. 现有的高分辨率扩散模型有两种,一种是StableDiffusion的降维法,一种是coarse-to-fine的cascade super resolution法。SimpleDiffusion采用以下几个方法解决高分辨率扩散模型的像素空间的直接训练问题。

  2. 改变noise schedule:一个现象是,高分辨率图像x0的加噪图像xt=α¯tx0+1α¯tϵ相比低分辨率的更具有辨识度,这是因为高分辨率图像像素多,表达一个细节的像素数量很多,相邻像素提供了较大的信息冗余度,参考On the Importance of Noise Scheduling for Diffusion Models。这就导致高分辨率图像的扩散模型的扩散过程前轻后重,生成时也必须在较早且较短的时间内构建图像的大体结构,导致训练效果很差。解决方法是使用一个基准分辨率64x64,并为其设置一个实验中表现较好的经验SNR函数,然后根据目标分辨率计算相应的SNR函数,得到shifted noise schedule。

  3. 多尺度训练:高分辨率扩散模型的像素空间的直接训练的一个难点在于图像的高频信息(物体边缘等)较难建模,训练loss主要由这部分支配,所以本论文提出多尺度训练loss:Lθd×d(x)=1d2Eϵ,t||Dd×d[ϵ]Dd×d[ϵθ(xt,t)]||22,其中Dd×d[]代表下采样到d×d分辨率,这是因为下采样是线性算子,Dd×d[ϵθ]可以被看成是一个d×d的扩散模型。最终优化:s{32,64,128,,d}1sLθs×s(x)

  4. 为了解决显存和计算问题,在低分辨率的feature map上增加网络深度,即block数量,本论文选择16x16,并且在整个模型最前面加一个下采样层,在最后面加一个上采样层,避免在最高分辨率下做计算。

  5. 只在低分辨率feature map上加dropout。

 

SiD2

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

SiD2

  1. Loss Weight: sigmoid shift

  2. Flop Heavy Scaling: 增长token序列而不是增大模型size

  3. Residual U-ViT

 

Scaling-Law-1

On the Scalability of Diffusion-based Text-to-Image Generation

  1. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers.

  2. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency.

 

Scaling-Law-2

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

  1. When operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.

 

Scaling-Law-3

Computational Tradeoffs in Image Synthesis Diffusion, Masked-Token, and Next-Token Prediction

  1. We recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

 

DiT-Scaling-Law

Scaling Laws For Diffusion Transformers

 

RDM

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

RDM-1

RDM-2

  1. cascaded models perform better than end-to-end models under a fair setting,RDM也是用cascaded模式,但是与传统cascaded models不同的是,RDM是在时间步上cascade,这降低了训练和采样的步数。

  2. 高低分辨率都使用EDM formulation,即xt=x+σϵ

  3. 想要在t处relay,高低分辨率分别训练的diffusion model在t处的xt的SNR需要匹配,但是the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain,因此提出和shifted noise schedule不同的方法Block Noise,在64×64采样一个噪声,然后使用Block Noise上采样到256×256​,这两个噪声加到各自分辨率的图像上后SNR就相同了。

  4. 低分辨率生成的xt和高分辨率的xt还是有gap的,因此高分辨率使用blurring diffusion建模,前向过程不仅加噪,还会对x进行blur,这样就相当于把低分辨率生成的xt的上采样看成是高分辨率图像被blur后的加噪结果。

  5. 注意,blurring diffusion建模时使用Block Noise加噪(不是直接采样256×256的噪声),这样xt上采样后就能直接输入blurring diffusion进行生成。

 

U-ViT

All are Worth Words: A ViT Backbone for Diffusion Models

U-ViT

  1. ViT in Pixel Space

 

Diffusion-RWKV

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

Diffusion-RWKV

  1. RWKV brought improvements for standard RNN architecture, which is computed in parallel during training while inference like RNN. It involves enhancing the linear attention mechanism and designing the receptance weight key value (RWKV) mechanism.

 

DiT

Scalable Diffusion Models with Transformers

DiT

  1. ViT in Latent Space

  2. adaLN:LayerNorm中不学习scale和shift,而是额外使用一个MLP(每个block都有)根据timestep和condition预测一个scale和shift。

  3. adaLN-Zero:在skip-connection前再乘一个预测出来的scale,初始化MLP使这个scale的输出都为0,这样整个DiT Block就被初始化为indentity function,有利于训练。

 

DiT-MoE

Scaling Diffusion Transformers to 16 Billion Parameters

DiT-MoE

  1. 在DiT中,每隔e个block,将MLP换为MoE,共K个expert,每个expert是一个参数量少一些的MLP,根据router预测的score选择top n个expert参与计算,除此之外有ns个总是参与计算的shared expert,所以共n+ns个expert个参与计算。This approach enables super-linear scaling of the number of model parameters relative to the computational cost of inference and training.

  2. 增大K但不增大n,训练速度完全一样,但效果却会变好。

  3. 除了diffusion loss,额外加一个balance loss,避免imbalanced expert。

 

EC-DiT

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

  1. MoE

 

DyDiT

Dynamic Diffusion Transformer

DyDiT

  1. slimmable neural network,更细粒度的MoE。

 

PoM

PoM: Efficient Image and Video Generation with the Polynomial Mixer

  1. We propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens.

 

U-DiT

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

U-DiT

  1. When the encoder processes the input image by downsampling the image as stage-level amounts, the decoder scales up the encoded image from the most compressed stage to input size. At each encoder stage transition, spatial downsampling by the factor of 2 is performed while the feature dimension is doubled as well. Skip connections are provided at each stage transition. The skipped feature is concatenated and fused with the upsampled output from the previous decoder stage, replenishing information loss to decoders brought by feature downsampling.

  2. 类似ToDo,使用token downsampling减少计算量,但并不丢失信息:将N×N的feature降采样(使用depthwise convolution)为4N2×N2的feature,每个feature独自做self-attention,4个self-attention结果拼回N×N,比如4个self-attention结果的第一个元素拼成一个2×2​的grid作为最终结果的第一个元素,总计算量减少14,但并没有信息丢失。Unlike U-Net downsampling, we are not reducing or increasing the number of elements in the feature during the downsampling process. The substitution of downsampled self-attention to full-scale self-attention brings slight improvement in the FID metric despite a significant reduction in FLOPs.

 

Switch-DiT

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Switch-DiT

  1. 引入MoE,每个block都使用timestep-based gating network预测一个概率分布,取TopK,这样可以做到参数隔离,缓解不同时间步冲突的问题。

 

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

  1. design space + DiT

 

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

  1. 传统Transformer每一个block计算量都是一样的,这里把UNet思想用在Transformer上,中间层减少token数量,降低运算量。

  2. 同时借鉴SimpleDiffusion,高分辨率少做计算,低分辨率增长增宽。

  3. 这样可以在pixel层面的计算复杂度只随着图像分辨率提升线性增长。

 

EDT

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

EDT-1 EDT-2
  1. HDiT和U-ViT的结合体。

  2. 基于人类画画的过程,先画整体(global),再画某个局部(local),局部画好后再看整体是否和谐,之后再找一个局部进行修改,依此循环。AMM是一个基于token之间距离计算的mask,让global attention变为local attention。

 

DiMR

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

DiMR

  1. U-ViT

  2. Transformer在计算量和效果之间存在tradeoff,patch size小时,token length长,计算量大,但效果好,patch size大时,token length短,计算量小,但效果差。

  3. feature cascade:分为R个branch,所有branch都是用同一个加噪后的xt。对于第r个branch,使用2Rr×2Rr的卷积降低输入xt的分辨率,该branch的输出上采样为原来的两倍,concat在下一branch的输入上。在每个branch的输出上都计算一个diffusion loss,目标为相同的ϵr​倍的average pooling下采样。

  4. 在低分辨率上使用U-ViT架构,在高分辨率使用ConvNeXt节省计算量。

 

Flag-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Flag-DiT-1

Flag-DiT-2

  1. Flag-DiT substitutes all LayerNorm with RMSNorm to improve training stability. Moreover, it incorporates key-query normalization (KQ-Norm) before key-query dot product attention computation. The introduction of KQ-Norm aims to prevent loss divergence by eliminating extremely large values within attention logits.

  2. We introduce learnable special tokens including the [nextline] and [nextframe] tokens to transform training samples with different scales and durations into a unified one-dimensional sequence. We add [PAD] tokens to transform 1-D sequences into the same length for better parallelism.

  3. 由于将不同模态的数据都转换为一个1D序列统一建模,使用1D RoPE,所以要加[nextline] and [nextframe] tokens,如果只有图像一个模态,可以使用2D RoPE,这样就不需要[nextline] token了。本质上,[nextline] and [nextframe] tokens是为了补充将高维模态转换为1D后丢失的位置信息的。

  4. 有text时,self-attention和cross-attention并列。

 

Next-DiT

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Next-DiT

  1. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations.

  2. NTK

  3. Token Merge

 

FiT

FiT: Flexible Vision Transformer for Diffusion Model

FiT

  1. ViT in Latent Space

  2. 不做crop,保持长宽比,resize图像使其满足HW2562,VAE下采样8倍,patchify size=2,所以token长度最多为(25682)2=256​,不够256的pad到256,MHSA时mask掉pad token,只在unmask token上计算loss。

  3. 使用2D-RoPE位置编码方法,利用其外推性可以生成任意分辨率和长宽比的图像。

 

VisionLLaMA

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

  1. ViT in Latent Space

  2. a vision transformer architecture similar to LLaMA to reduce the architectural differences between language and vision.

 

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

DiG

  1. ViT in Latent Space

  2. DiT models have faced challenges with scalability and quadratic complexity efficiency. We leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models and offering superior efficiency and effectiveness.

 

CLEAR

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

  1. By fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model.

 

Transfusion

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

 

MonoFormer

MonoFormer: One Transformer for Both Diffusion and Autoregression

 

CausalFusion

Causal Diffusion Transformers for Generative Modeling

CausalFusion

 

ACDiT

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

ACDiT

 

DART

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

 

Neural-RDM

Neural Residual Diffusion Models for Deep Scalable Vision Generation

 

Mamba

DiS

Scalable Diffusion Models with State Space Backbone

 

DiffuSSM

Diffusion Models Without Attention

 

ZigMa

ZigMa: Zigzag Mamba Diffusion Model

 

DiM

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

 

DiM

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

 

Dimba

Dimba: Transformer-Mamba Diffusion Models

 

LaMamba-Diff

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

 

LinFusion

LinFusion: 1 GPU, 1 Minute, 16K Image

  1. 传统DiT之前的模型随着分辨率增大(token长度增大),计算量指数增长。

  2. 借鉴Mamba2、RWKV6、GLA等,we introduce a generalized linear attention paradigm,随着分辨率增大(token长度增大),计算量线性增长。

  3. 使用SD蒸馏训练该模型,achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity.

  4. 外拓zero-shot cross-resolution generation。

 

Others

CrossFlow

Flowing from Words to Pixels A Framework for Cross-Modality Evolution

CrossFlow

  1. For each training sample, we start with an input-target pair (x,z1). We apply the VE to x to encode it to a latent z0 with the same shape as z1. Next, we employ a transformer model vθ trained for flow matching. The VE can be trained prior to training vθ or concurrently.

 

Infinite-Diff

Infinite-Diff: Infinite Resolution Diffusion with Subsampled Mollified States

Infinite-Diff

  1. Infinite-Diff is a generative diffusion model defined in an infinite dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling.

 

DoD

Diffusion Models Need Visual Priors for Image Generation

DoD

  1. DoD enhances diffusion models by recurrently incorporating previously generated samples as visual priors to guide the subsequent sampling process.

  2. We propose the Latent Embedding Module (LEM) that filters the conditional information using a compression-reconstruction approach to discard redundant details. We reasonably assume that the high-level semantic information extracted from generated images is similar to that obtained from real images. This assumption allows us to use the latents of ground truth images as inputs to LDM during training, simplifying the training strategy. Such simplification allows end-to-end training of DoD on image latents and joint optimization of the backbone model and LEM.

 

INFD

Image Neural Field Diffusion Models

INFD

  1. Neural field is also known as Implicit Neural Representations (INR), which represents signals as coordinate-based neural networks.

  2. 提出Image Neural Field Autoencoder,目的是有一个隐空间分布可以建模并采样。

  3. 类似Diff-AE和PDAE,使用diffusion model建模隐空间分布。

 

CAN

Condition-Aware Neural Network for Controlled Image Generation

CAN-1

CAN-2

  1. 传统的条件模型中,所有条件共用相同的处理条件的static network,这限制了网络的建模能力。一种解决方案是每个条件使用一个expert model,但消耗极大。因此学习一个生成网络,根据条件动态生成处理条件的网络参数。introduces a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition.

  2. Making depthwise convolution layers, the patch embedding layer, and the output projection layers condition-aware brings a significant performance boost.

  3. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT, on class conditional image generation on ImageNet and text-to-image generation on COCO.

 

Effectiveness and Efficiency Enhancement

hybrid generative model

D2C

D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation

diffusion建模autoencoder的latent distribution。

 

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

DDGAN-1 DDGAN-2

 

  1. 解决原始扩散模型不可训练的问题,参考theory第6章。

  2. 步长较大时,q(xt1|xt)就不再是高斯分布,变得更复杂和多峰,所以可以使用对抗学习拟合pθ(xt1|xt),即minθt1Eq(xt)[Dadv(q(xt1|xt)pθ(xt1|xt))],可以转化为同时学习判别器minϕt1Eq(xt){Eq(xt1|xt)[logDϕ(xt1,xt,t)]+Epθ(xt1|xt)[log(1Dϕ(xt1,xt,t))]}和生成器maxθt1Eq(xt)Epθ(xt1|xt)[logDϕ(xt1,xt,t)]

  3. q(xt)q(xt1|xt)=q(xt1,xt)=dx0q(x0,xt1,xt)=dx0q(x0)q(xt1|x0)q(xt|xt1,x0)=dx0q(x0)q(xt1|x0)q(xt|xt1),所以先从x0加噪到xt1,再从xt1加噪到xt

  4. pθ(xt1|xt)=pθ(x0|xt)q(xt1|xt,x0)dx0=p(z)q(xt1|xt,x0=Gθ(z,xt,t))dz,所以使用生成器根据xt1生成x0,再使用q(xt1|xt,x0)采样一个xt1

  5. 使用判别器判别xt1的真假。Note that for different t, xt has different levels of perturbation, and hence using a single network to predict xt1 directly at different t may be difficult. However, in our case the generator only needs to predict unperturbed x0 and then add back perturbation using q(xt1|xt,x0).

  6. DDPMs的逆向过程也可以被解释为pθ(xt1|xt)=q(xt1|xt,x0=xt1α¯tϵθ(xt,t)α¯t),它虽然一直是一个高斯分布,但当步长较大步数较少时效果较差,其和DDGANs的区别在于预测的x0是否为确定性的,由于步长很小时pθ(xt1|xt)是高斯分布,所以其预测的x0是确定性的,而DDGANs步长很大,预测的x0是具有随机性的,因此pθ(xt1|xt)就是多峰的。

  7. GANs are known to suffer from training instability and mode collapse, and some possible reasons include the difficulty of directly generating samples from a complex distribution in one-shot, and the overfitting issue when the discriminator only looks at clean samples. Our model breaks the generation process into several conditional denoising diffusion steps in which each step is relatively simple to model, due to the strong conditioning on xt. Moreover, the diffusion process smoothens the data distribution, making the discriminator less likely to overfit.

 

DiffuseVAE

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

DiffuseVAE

ϵ-VAE

epsilon-VAE: Denoising as Visual Decoding

epsilon-VAE

 

FM-Boosting

Boosting Latent Diffusion with Flow Matching

FM-Boosting

  1. LDM的推理速度随着图像分辨率提高而平方增长。

  2. 使用Flow Matching在上采样的低分辨率的latent和高分辨率的latent之间建模。使用低分辨率LDM进行生成,使用Flow Matching转换到高分辨率。

  3. 一般的Flow Matching是在数据分布和高斯分布之间建模的,这里是在数据对之间建模,所以叫Coupling Flow Matching。

 

PDM

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

  1. 类似FM-Boosting的思路逐步生成高分辨率的图像。

 

refinement of network architectures

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

 

LiteVAE

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

LiteVAE

  1. We leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.

 

Wuerstchen

`Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

 

SimpleDiffusion

Simple Diffusion: End-to-end diffusion for high resolution images

 

BK-SDM

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

  1. compact UNet:fewer blocks in the down and up stages (Base), further removal of the entire mid-stage (Small), further removal of the innermost stages (Tiny).

  2. distillation-based retraining:除了diffusion loss,还可以使用训练好的大网络的StableDiffusion进行output-level distillation(相同输入得到的输出之间的MSE loss)和feature-level distillation(相同输入得到的网络feature之间的MSE loss)。

 

KOALA

KOALA: Fast and Memory-Efficient Latent Diffusion Models via Self-Attention Distillation

KOALA

  1. 和BK-SDM做法一样。

  2. 进一步改进了feature-level distillation,测试了使用不同模块输出的feature进行distillation时的效果,发现对self-attention输出的feature进行distillation时效果最好,而且decoder early blocks位置的self-attention输出的feature效果最好。

 

DuoDiff

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

DuoDiff

 

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

 

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

 

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

 

BLAST

BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference

 

HFDM

Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation

 

SnapFusion

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

 

SnapGen

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

 

MobileDiffusion

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

  1. lightweight model architecture + DiffusionGAN + distillation

 

Spiking Network

Spiking Diffusion Models

Spiking-Diffusion Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

Spiking Denoising Diffusion Probabilistic Models

Fully Spiking Denoising Diffusion Implicit Models

SDiT: Spiking Diffusion Model with Transformer

 

NAS

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

 

DDSM

Denoising Diffusion Step-aware Models

  1. 不同step的重要性是不同的,没必要每一步都使用大模型。

  2. slimmable network: a neural network that can be executed at arbitrary model sizes.

  3. 搜索最优的采样策略,不同步使用不同size的模型,减少计算量。

 

ScaleLong

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

  1. feature norm的角度理论分析UNet的long skip connection coefficient的影响。

 

Q-DM

Q-DM: An Efficient Low-bit Quantized Diffusion Model

 

DiffuSSM

Diffusion Models Without Attention

 

EDM-2

Analyzing and Improving the Training Dynamics of Diffusion Models

  1. We update all of the operations (e.g., convolutions, activations, concatenation, summation) to maintain magnitudes on expectation.

 

ReDistill

ReDistill: Residual Encoded Distillation for Peak Memory Reduction

  1. Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. We propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling.

 

Quantum

Quantum Denoising Diffusion Models

Quantum Generative Diffusion Model

Towards Efficient Quantum Hybrid Diffusion Models

Quantum Hybrid Diffusion Models for Image Synthesis

Enhancing Quantum Diffusion Models with Pairwise Bell State Entanglement

Mixed-State Quantum Denoising Diffusion Probabilistic Model

量子计算

 

Optical

Optical Diffusion Models for Image Generation

光学

 

pre-trained network compression

quantization

PTQ

Post-training Quantization on Diffusion Models

 

Q-Diffusion

Q-Diffusion: Quantizing Diffusion Models

 

BiDM

BiDM: Pushing the Limit of Quantization for Diffusion Models

 

SD-PTQ

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

 

LDM-PTQ

Efficient Quantization Strategies for Latent Diffusion Models

 

EMF

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

 

TerDiT

TerDiT: Ternary Diffusion Models with Transformers

 

VQ4DiT

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

 

PTQ4DiT

PTQ4DiT: Post-training Quantization for Diffusion Transformers

 

DPQ

Diffusion Product Quantization

 

DiTAS

DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

 

HQ-DiT

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

 

SVDQuant

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

 

StableQ

StableQ: Enhancing Data-Scarce Quantization with Text-to-Image Data

 

BinaryDM

BinaryDM: Towards Accurate Binarization of Diffusion Model

 

APQ-DM

Towards Accurate Post-training Quantization for Diffusion Models

 

StepbaQ

StepbaQ: Stepping backward as Correction for Quantized Diffusion Models

 

TFMQ-DM

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

 

EfficientDM

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

 

EDA-DM

Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models

 

Memory-Efficient

Memory-Efficient Personalization using Quantized Diffusion Model

fine-tune quantized diffusion model

 

MixDQ

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

 

MPQ-DM

MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

 

COMQ

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

 

QNCD

QNCD: Quantization Noise Correction for Diffusion Models

 

TAC-Diffusion

Timestep-Aware Correction for Quantized Diffusion Models

 

TCAQ-DM

TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models

 

TFM

Temporal Feature Matters: A Framework for Diffusion Model Quantization

 

network pruning

DiffPruning

Structural Pruning for Diffusion Models

 

DiffPruning

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

 

LTDDPM

Successfully Applying Lottery Ticket Hypothesis to Diffusion Model

 

LAPTOP-Diff

LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models

 

LD-Pruner

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

 

LayerMerge

LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging

 

DiP-GO

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

 

SD2-Pruning

Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion

 

SVS

Singular Value Scaling Efficient Generative Model Compression via Pruned Weights Refinement

 

Distillation

DKDM

DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

DKDM

  1. 训练student输出(预测噪声)与teacher输出对齐。

 

enhancement of training

novel model family

iDDPM

Improved Denoising Diffusion Probabilistic Models

  1. They achieve similar sample quality using either σt2=βt or σt2=β~t which are the upper and lower bounds on the variance given by q(x0) being either isotropic Gaussian noise or a delta function (β~0=0), respectively. We choose to parameterize the variance as an interpolation between βt and β~t in the log domain. Σθ(xt,t)=exp(vlogβt+(1v)logβ~t). We did not apply any constraints on model output v, theoretically allowing the model to predict variances outside of the interpolated range. However, we did not observe the network doing this in practice, suggesting that the bounds for Σθ(xt,t) are indeed expressive enough.

  2. 除了diffusion loss,额外优化0.001Lvlb,we also apply a stop-gradient to the μθ(xt,t) output for the Lvlb​ term.

  3. sampling t uniformly causes unnecessary noise in the Lvlb,根据历史Lt在所有Lt的占比确定t的采样概率。

 

PD

Progressive Distillation for Fast Sampling of Diffusion Models

  1. v-prediction,在理论上和ϵ-prediction是等价的,因为两者可以相互转换。

 

FM

Flow Matching for Generative Modeling

  1. 基于Continuous Normalizing Flows(Neural ODE),CNFs在训练时是先使用模型对数据样本进行转化(ODE simulations),计算转化后的样本与标准高斯分布之间的KL散度,优化模型;flow matching是simulation-free的,因为ODE路径已经提前定义好了。

  2. diffusion model和score based model对xtlogpt(xt)进行拟合,flow matching对ddtxt进行拟合,如果和diffusion model或score based model使用相同的扩散核,那么flow matching在理论上和diffusion model和score based model是等价的(因为vθϵθsθ​​都可以相互转换,类似PD的v-prediction)。

  3. We find this training alternative to be more stable and robust in our experiments to existing score matching approaches.

 

RectFlow

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

  1. 可以在任意两个分布之间进行建模。

  2. 随机从两个分布中的选择数据X0X1(不需要成对),使用最简单的插值法Xt=tX1+(1t)X0,此时ddtXt=X1X0,所以损失函数为01E(X1X0)vθ(Xt,t)2dt​。

  3. 每次训练完成后进行采样,得到当前flow下的coupling数据,之后用这些couping数据再进行训练,以此循环对轨迹进行修正,会得到直的没有交叉点的flow。

 

EDM

Elucidating the Design Space of Diffusion-Based Generative Models

  1. xt=x0+ϵϵN(0,σt2I)Dθ(xt,σt)x0

  2. preconditioning:As the input xt is a combination of clean signal and noise, its magnitude varies immensely depending on noise level σt. For this reason, the common practice is to not represent Dθ as a neural network directly, but instead train a different network Fθ from which Dθ can be derived. VE trains Fθ to predict ϵ scaled to unit variance, from which the signal is then reconstructed via Dθ(xt,σt)=xtσFθ(xt,σt). This has the drawback that at large σ, the network needs to fine-tune its output carefully to cancel out the existing noise ϵ exactly and give the output at the correct scale. Note that any errors made by the network are amplified by a factor of σt. In this situation, it would seem much easier to predict the expected output Dθ(xt,σt) directly. To this end, we propose to precondition the neural network with a σ-dependent skip connection that allows it to estimate either x0 or ϵ, or something in between. Dθ(xt,σt)=cskip(σt)xt+cout(σt)Fθ(cin(σt)xt,σt)x0Fθ(cin(σt)xt,σt)1cout(σt)(x0cskip(σt)xt),we choose cin and cout to make the network inputs and training targets have unit variance, and cskip to make the amplifying errors in Fθ as little as possible. Other diffusion models always have cskip=1.

  3. augmentation: To prevent potential overfitting that often plagues diffusion models with smaller datasets, we apply various geometric transformations to a training image prior to adding noise. To prevent the augmentations from leaking to the generated images, we provide the augmentation parameters as a conditioning input to Fθ; during inference we set the them to zero to guarantee that only non-augmented images are generated. (macro conditioning as augmentation)

 

VDM

Variational Diffusion Models

efficient optimization of the noise schedule jointly with the rest of the model

 

VDM++

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation

 

DiffEnc

DiffEnc: Variational Diffusion with a Learned Encoder

 

SPD

Image generation with shortest path diffusion

 

optimal noise schedule

Cosine

Improved Denoising Diffusion Probabilistic Models

  1. 与linear schedule不同,cosine schedule直接定义α¯t=cos2(t+0.0081.008π2),再计算出βt

 

Laplace

Improved Noise Schedule for Diffusion Training

  1. SNR(t)=αt2σt2λ(t)=logSNR(t)αt2=exp(λ(t))exp(λ(t))+1σt2=1exp(λ(t))+1λ(t)=μbsgn(0.5t)log(12|t0.5|)p(λ)=exp(λμb)2b

 

FixFlaw

Common Diffusion Noise Schedules and Sample Steps are Flawed

FixFlaw-1

FixFlaw-2

  1. StableDiffusion使用的noise schedule最后一步加噪公式为zT=0.068265z0+0.997667ϵ,并不是标准高斯分布,zT仍然包含一些信息,其均值不为0,黑色(-1)的图像均值为负,白色(1)的图像均值为正,在生成时使用随机的zT,均值为0,所以生成的都是中等亮度的图像。随机一个ϵ,使用其给某个图像加噪到zT,从zT开始生成的结果和从ϵ开始生成的结果也不同,参考Magic-Fixup。

  2. 为了解决这个问题,需要保证zero terminal SNR,即αT¯=0,使用rescale schedule方法对现有的noise schedule进行修正,做法是保持α¯1不变,修正αT¯=0,之后对2,,Tαt¯进行rescale,之后再重新训练模型。

  3. 使用了zero terminal SNR的noise schedule后,T时刻的denoising loss就没有意义了,所以建议使用PD提出的v-prediction参数法,vt=α¯tϵ1α¯tx0,这样vT=x0v1=α¯1ϵ1α¯1x0,预测都有了意义。

  4. 结合上述两点,可以使用纠正后的noise schedule和v-prediction方法fine-tune已有的StableDiffusion,效果一致。

  5. Rescale Classifier-Free Guidance:使用zero terminal SNR的noise schedule后,原有的cfg会变得敏感,导致生成图像过曝,所以对cfg结果进行rescale。

 

SingDiffusion

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

SingDiffusion-1SingDiffusion-2
  1. 现有的模型不能生成太亮或太暗的图像。

  2. 现有的模型由于没有zero terminal SNR,所以本质上是在(0,1ϵ]上训练的,因此可以微调预训练扩散模型额外训练一步,或者重新训练一个只有最后一步的扩散模型。同FixFlaw一样,由于T时刻的denoising loss没有意义,所以这一步的训练使用x-predicition。

  3. 采样时第一步从高斯分布开始,采样出一个x1ϵ,之后和原来一样。

 

SODC

Score-Optimal Diffusion Schedules

 

optimal loss weighting

P2-weighting

Perception Prioritized Training of Diffusion Models

 

Debias

Debias the Training of Diffusion Models

类似P2-weighting。

x^0=1α¯txt1α¯tα¯tϵθ(xt,t)=1α¯t(α¯tx0+1α¯tϵ)1α¯tα¯tϵθ(xt,t)=x0+1α¯tα¯t(ϵϵθ(xt,t))=x0+1SNR(t)(ϵϵθ(xt,t))

使用1SNR(t)作为ϵθ(xt,t)ϵ22的weight,使得x^0更接近x0

 

SpeeD

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

  1. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergencen areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training.

  2. We design an asymmetric time step sampling strategy that reduces the frequency of time steps from the convergence area while increasing the sampling probability for time steps from other areas.

 

optimal timestep sampling

B-TTDM

Beta-Tuned Timestep Diffusion Model

  1. The distribution variations are non-uniform throughout the diffusion process and the most drastic variations in distribution occur in the initial stages.

  2. We propose a novel timestep sampling strategy that utilizes the beta distribution.

  3. B-TTDM not only improves the quality of the generated samples but also speedups the training process.

 

AdaTS

Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training

AdaTS

 

Multi-Task Learning

MoE

Multi-Architecture Multi-Expert Diffusion Models

Addressing Negative Transfer in Diffusion Models

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models

Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture

Denoising Task Routing for Diffusion Models

Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

  1. diffusion model需要处理识别所有level的noise这一特点导致了其模型需要大量参数进行训练。

  2. 不同时间步(expert)用不同的特性网络(architecture),降低学习难度,减少模型参数量。

 

DeMe

Decouple-Then-Merge: Towards Better Training for Diffusion Models

  1. 将时间步均分为N份,diffusion model训练完后,复制N份,分别在一个时间范围内进行fine-tune,之后merge为一个模型进行推理。

 

Dual-Output

Dynamic Dual-Output Diffusion Models

dual-output

Cas-DM

Bring Metric Functions into Diffusion Models

Cas-DM

 

DBCL

Denoising Task Difficulty-based Curriculum for Training Diffusion Models

  1. diffusion model不同时间步的学习难度是不同的,将时间步等分为20个区间,在每个区间单独训练一个模型(总共20个),考察它们的收敛速度,在loss和生成质量(混合采样,使用一个正常训练好的扩散模型,采样时只在指定区间使用单独训练的模型)方面,都是时间步越大收敛速度越快。

  2. Curriculum Learning:a method of training models in a structured order, starting with easier tasks or examples and gradually increasing difficulty. 所以将时间步分区域后,先从最靠后的区域开始训练,之后依次向靠前的区域训练,每次训练时依然要训练之前区域的时间步,避免遗忘。

  3. 收敛更快,生成效果更好。

 

REPA

Representation Alignment for Generation Training Diffusion Transformers Is Easier Than You Think

REPA

  1. 有点类似MaskDiT的MAE loss,可以提升生成效果。

 

Masked Diffusion

MaskDM

Masked Diffusion Models are Fast Learners

使用U-ViT架构(Pixel Space),最高90%的mask ratio,比DDPM收敛速度快4倍,且生成效果更好。

MaskDM

 

MDT

Masked Diffusion Transformer is a Strong Image Synthesizer

使用DiT架构(Latent Space),为了解决训练和推理时mask不同的distribution shift,训练时使用一个side-interpolater补全masked tokens,比DiT收敛速度快3倍,生成效果更好。

 

MaskDiT

Fast Training of Diffusion Models with Masked Transformers

MaskDiT

  1. 使用DiT架构(Latent Space),DiT encoder可以scaling up,DiT decoder使用固定的8个DiT block。

  2. 只根据visible token预测invisible token的score太困难了,所以将diffusion loss拆分,对于visible token使用diffusion loss,对于invisible token使用对应noisy patches的MSE loss(注意是直接预测加噪的invisible patch,不是预测其噪声或者原图),类似MaskDM + MAE。

  3. 最高50%的mask ratio,比DiT收敛速度快3倍,达到相同的生成效果。

  4. MAE必不可少,如果不加MAE,生成效果降低很多,但MAE的loss的系数如果太大反而会影响生成效果,所以这个系数要精心挑选。Without the MAE reconstruction task, the training easily overfits the local subset of unmasked tokens as it lacks a global understanding of the full image, making the gradient update less informative. 理解辅助生成。

 

SD-DiT

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

SD-DiT

  1. DiT encoder可以scaling up,DiT decoder使用固定的8个DiT block。

  2. DiT decoder的输入插入的不是learnable mask token,而是直接插入invisible patch,diffusion loss在所有patch上计算,而不是像MaskDiT那样只根据visible token去预测invisible patch。

  3. 这样去掉了MAE,没有了理解就无法辅助生成,所以引入self-distilling模块。encoder的每个token处的最后一层输出再经过一个MLP+softmax预测一个K维的分布,以teacher encoder处的预测结果为label计算cross-entropy loss,只计算所有unmask token和class token处的cross-entropy loss。

 

 

Patch

PatchDiffusion

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

  1. x进行random crop得到xi,j,s,其中i,j为左上角坐标,s为patch size,对xi,j,s​进行加噪训练EDM,i,j,s也作为输入条件。

  2. EDM only sees local patches and may have not captured the global cross-region dependency between local patches, in other words, the learned scores from nearby patches should form a coherent score map to induce coherent image sampling. To resolve this issue, we propose two strategies: 1) random patch sizes and 2) involving a small ratio of full-size images.

  3. 采样时分patch采样后拼在一起。

  4. Through Patch Diffusion, we could achieve 2 faster training, while maintaining comparable or better generation quality.

 

Patch-DM

Patched Denoising Diffusion Models For High-Resolution Image Synthesis

Patch-DM

  1. Rather than using entire complete images for training, our model only takes patches for training and inference and uses feature collage to systematically combine partial features of neighboring patches.

  2. 训练和推导时,第一种方法是将xt分成patch,每个patch输入模型单独预测;第二种方法是将xt分成patch,相邻patch输入模型预测公共部分;第三种方法将第二种方法细化到UNet的feature上。

 

novel diffusion formula

一些具体任务本身就是某种过程,可以设计不同的马尔科夫转移链进行训练。

ShiftDDPMs

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

 

ContextDiff

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

 

GUD

GUD: Generation with Unified Diffusion

  1. The choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis).

  2. The prior distribution that data is transformed into during diffusion.

  3. The scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule.

 

CARD

CARD: Classification and Regression Diffusion Models

  1. 公式类似PriorGrad,diffusion model输出regression的值或者classification的概率。

 

ExposureDiffusion

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

  1. 把照相机图像曝光过程当做一种扩散过程进行建模。

 

RDDM

Residual Denoising Diffusion Models

  1. 类似ResShift。

 

Beta-Diffusion

Beta Diffusion

  1. Beta分布,优化KL-divergence upper bounds。

 

others

FDM

Fast Diffusion Model

  1. 与SGD建立联系,引入momentum,加快训练和采样。

 

DDDM

Directly Denoising Diffusion Model

DDDM

  1. DDDMs train the diffusion model conditioned on an es timated target that was generated from previous training iterations of its own.

  2. Define fθ(x0,xt,t)=xt+t012β(s)[xsxslogq(xs)]ds as the solution of the VP PF ODE from initial time t to final time 0,使用神经网络表示Fθ(x0,xt,t)=t012β(s)[xsxslogq(xs)]ds,所以fθ(x0,xt,t)=xtFθ(x0,xt,t)。虽然使用了PF ODE,但不需要预训练的score model。

 

DeeDiff

DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation

  1. early exiting策略:The basic assumption of early exiting is that the input samples of the test set are divided into easy and hard samples. The computation for easy samples terminates once some conditions are satisfied.

  2. 使用U-ViT,在训练时为每层的输出额外训练一个uncertainty estimation module,用于评估当前层作为输出的不确定度。UEM是一个预测标量的MLP,目标是当前层的输出与最后一层的输出的MSE Loss。

  3. 推理时,对于采样的每一步,一旦某层的输出的不确定度低于给定阈值,就将该层输出作为最终输出,达到加速的效果。

 

DMP

Diffusion Model Patching via Mixture-of-Prompts

DMP

  1. 为预训练好的DiT的每层block额外训练一组参数pi,其和输入xi维度相同,类似positional embedding一样加在xi上。

  2. The same prompts are used for each block throughout the training, thus they will learn knowledge that is agnostic to denoising stages. To patch the model with stage-specific knowledge, we introduce dynamic gating. This mechanism blends prompts in varying proportions based on the noise level of an input image. 学习一个gating网络,xi=blocki(σ(G([t;i]))pi1+xi1)

 

Compensation

Compensation Sampling for Improved Convergence in Diffusion Models

  1. 额外训练一个UNet预测补全项。

 

CEP

Slight Corruption in Pre-training Data Makes Better Diffusion Models

  1. 类似CADS对condition进行操作,但CADS仅在采样时进行操作,二CEP在训练时进行操作。

  2. 初步实验:To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M. As a result, class and text-conditional models pre-trained with slight corruption achieve significantly lower FID and higher IS and CLIP score. More corruption in pre-training can potentially lead to quality and diversity degradation. As degration level increases, almost all metrics first become better and then degrade. However, the degraded measure with more corruption sometimes is still better than the clean ones.

  3. 更一般的,we propose to directly add perturbation to the conditional embeddings of DMs, which is termed as conditional embedding perturbation。做法是对condition embedding加一个符合N(0,γdI)的高斯噪声。

 

SADM

Structure-Guided Adversarial Training of Diffusion Models

SADM

  1. 除了diffusion loss,用batch内x0计算两两之间的manifold距离(用预训练编码网络将图像编码为一个向量并计算它们的欧式距离),同时使用预测的x^0也计算出batch内两两之间的manifold距离,优化使这两个距离极小化,目的是让diffusion预测出的图像保持和原数据集同样的manifold structure。

  2. 如果使用预训练好的编码网络会导致shortcut,所以引入对抗训练,训练编码网络极大化上述两个距离之间的差异(相当于区分fake和real的manifold structure)。

 

ConPreDiff

Improving Diffusion-Based Image Synthesis with Context Prediction

ConPreDiff

  1. 除了传统的diffusion loss(self-denoising)根据xt预测所有的xt1i,额外使用一个网络根据预测出的xt1i预测xt1i周围的neighhoods xt1j​,the ConPreDiff loss is an upper bound of the negative log likelihood. 这个loss的梯度也会传到self-denoising网络(UNet or Transformer)。

  2. 采样时只用self-denoising网络,和传统diffusion model一致。

 

QAC

Learning Quantized Adaptive Conditions for Diffusion Models

QAC

  1. 类似Diff-AE和PDAE的自编码器,只不过这里用一个bsq code作为表征,不需要post training建模latent,虽然这种condition不能完全复原图像,但至少提供了一些信息。

  2. 采样时随机一个binary vector code作为条件,可以加速采样,提高采样质量。

 

DCDM

Training Data Synthesis with Difficulty Controlled Diffusion Model

  1. 和CAD把coherence作为额外条件输入diffusion model训练类似,这里将difficulty作为额外条件输入diffusion model训练,可以控制生成不同复杂度的图像。

 

TDNN

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

  1. 探索更好的引入timestep embedding的方式。

 

Attention-Mediators

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

  1. This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps.

  2. We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately

 

Diff-Tuning

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

  1. Chain of Forgetting:t较小时,diffusion model可以做zero-shot denoising,和数据集无关;t较大时,diffusion model的泛化性受数据集影响较大。

  2. transfer时,同时使用一些原数据集的数据参与训练。使用原数据集的数据时,diffusion loss系数随着t的增大单调递减;使用transfer数据集的数据时,diffusion loss系数随着t的增大单调递增。

 

SFUNet

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

 

WMGM

Multi-scale Generative Modeling for Fast Sampling

WMGM

  1. We propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands.

 

enhancement of sampling

optimal sampling schedule

DP

Learning to Efficiently Sample from Diffusion Probabilistic Models

 

RL

Learning to Schedule in Diffusion Probabilistic Models

 

AdaDiff

AdaDiff: Adaptive Step Selection for Fast Diffusion

AdaDiff

  1. 预定义一个步数集合,训练一个轻量级的步数选择网络,根据text embedding从集合中选择一个步数进行生成,根据生成结果打分,policy gradient优化网络。

  2. 有一个额外的loss鼓励小步数。

 

BudgetFusion

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

BudgetFusion

  1. 收集一批prompt,每个prompt用相同xT和不同步数生成样本,计算metric,根据metric确定当前prompt的最efficient的生成步数,训练一个网络可以根据prompt预测最efficient的生成步数。

 

JYS

Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

 

CS

Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback

  1. 训练一个time-dependent reward model,采样时use the score of time-dependent reward function as guidance。

 

AYS

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

 

LD3

Learning to Discretize Denoising Diffusion ODEs

  1. ξ是一个可学习的、固定长度的、单调递减的、从T0的sampling schedule,Ψ是ODE-Solver,Ψ是teacher solution,Ψξ是按ξ sampling schedule采样的student solution,minξLhard=minξExTN(0,σT2I)[LPIPS(Ψξ(xT),Ψ(xT))]

  2. Directly optimizing Lhard could lead to severe underfitting: to minimize the objective, we need to ensure Ψ(xT) and Ψξ(xT) for any xT , which is hard as we are only allowed to optimize ξ, which typically contains no more than 20 parameters for student ODE solvers with low NFE. We only require the existence of an input xT that is close to xT , such that the student’s output given Ψξ(xT) matches Ψ(xT). Formally, we define B(x,rσT)={x|xx2rσT} as the L2 ball of radius rσT around xminξLsoft=minξExTN(0,σT2I),xTB(xT,rσT)[LPIPS(Ψξ(xT),Ψ(xT))]

 

OFS

Optimizing Few-step Sampler for Diffusion Probabilistic Model

 

Beta-Sampling

Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models using Stepwise Spectral Analysis

  1. Certain steps exhibit significant changes in image content, while others contribute minimally. 根据相邻时间步xtxt1之间的频谱的差异,发现在早期阶段,低频的差异较大,说明早期阶段主要合成低频,在后期阶段,高频的差异较大,说明后期阶段主要合成高频。

  2. Instead of the traditional uniform distribution-based time step sampling, we introduce a Beta distribution-like sampling technique that prioritizes critical steps in the early and late stages of the process. 早期和后期步长小步数多,中间步长大步数少。

 

Distillation-based

大致可分为五类:Direct Distillation,Progressive Distillation,Adaversarial Distillation,Score Distillation(DI),Consistency Distillation

 

DenoisingStudent

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

  1. direct distillation,L=12ExT[Fstudent(xT)Fteacher(xT)22],其中Fstudent(xT)是student一步生成的样本,Fteacher(xT)是teacher多步DDIM生成的样本。

  2. 本质是使用teacher构造(xT,x0) pair数据集训练一步生成的student。

 

O2MKD

Accelerating Diffusion Models with One-to-Many Knowledge Distillation

O2MKD

  1. 分时间段进行蒸馏。

 

Diffusion2GAN

  1. SDXL

  2. We can significantly improve the quality of direct distillation by (1) scaling up the size of the ODE pair dataset and (2) using a perceptual loss, not MSE loss.

  3. 在SDXL的latent上重新训练一个VGG网络,优化student生成的latent和teacher生成的latent之间的LPIPS loss。

  4. 除了LPIPS loss,还是用对抗训练,使用类似GigaGAN的multi-scale discriminator。

 

InstaFlow

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

  1. 主要结论:2-Rectified Flow is a better teacher model to distill an one-step student model than the original SD.

  2. 先使用StableDiffusion训练一个k-Rectified Flow,再对该ReFlow进行direct distillation(一步拟合多步)。

 

PD

Progressive Distillation for Fast Sampling of Diffusion Models

PD-1

PD-2

  1. 训练student一步采样模拟teacher多步采样的效果。

  2. v-prediction:在distillation时ϵ-prediction参数法就不再适用了,因为DDIM采样的每一步都要先计算x^θ=1α¯t(xt1α¯tϵθ(xt,t)),在最初的几步时α¯t0,所以ϵθ的微小变化都会被放大,极端情况下,当distillation为1步时,该公式就没有意义了,所以需要重新找到一种参数方法,使x^θ在任何SNR时都比较稳定。考虑到(α¯t)2+(1α¯t)2=1,所以可以设α¯t=cos(ϕ)1α¯t=sin(ϕ)xϕ=cos(ϕ)x0+sin(ϕ)ϵ,定义vϕ=dxϕdϕ=cos(ϕ)ϵsin(ϕ)x0,所以有x0=cos(ϕ)ϵvϕsin(ϕ)=cos(ϕ)xϕcos(ϕ)x0sin(ϕ)vϕsin(ϕ)=cos(ϕ)xϕsin2(ϕ)cos2(ϕ)sin2(ϕ)x0vϕsin(ϕ)(sin2(ϕ)+cos2(ϕ))x0=x0=cos(ϕ)xϕsin(ϕ)vϕ,同理可得ϵ=sin(ϕ)xϕ+cos(ϕ)vϕ,如果使用网络vΩ(xϕt,ϕt)预测vϕt,则等价的两种参数法分别为x^Ω=cos(ϕt)xϕtsin(ϕt)vΩ(xϕt,ϕt)ϵΩ=sin(ϕt)xϕt+cos(ϕt)vΩ(xϕt,ϕt),这样x^Ω就很稳定。重写DDIM采样公式可得,xϕs=cos(ϕs)x^Ω+sin(ϕs)ϵΩ=cos(ϕs)[cos(ϕt)xϕtsin(ϕt)vΩ]+sin(ϕs)[sin(ϕt)xϕt+cos(ϕt)vΩ],经过化简可得xϕs=cos(ϕsϕt)xϕt+sin(ϕsϕt)vΩ,即xϕtΔ=cos(Δ)xϕt+sin(Δ)vθ=cos(Δ)xϕtsin(Δ)vΩ,如图所示的三角关系,DDIM采样相当于从ϵ开始向切线方向不断移动。使用该参数法重新训练diffusion model,效果也很好。distillation即为使用student网络一步预测出的vΩ计算出的x^Ω拟合teacher网络多步预测出的x^θ(不管如何参数化,最终都转换为预测出的x^0之间的拟合)。

 

CFG-PD

On Distillation of Guided Diffusion Models

CFG-PD

  1. distill CFG teacher to student。

  2. stage 1:训练一个和teacher相同步数的student,将guidance strength作为额外的条件,guidance strength也是随机均匀采样。

  3. stage 2:和PD一样,迭代训练更少步数的student。

  4. 采样时可以调用N次student达到类似2N步的随机采样。

 

SFDDM

SFDDM: Single-fold Distillation for Diffusion models

SFDDM

  1. PD是每次减少一半步数,直到目标步数,属于multi-fold。SFDDM一步到位,属于single-fold。

  2. T是teacher的时间步,T是student的时间步,TT=c,定义student的前向过程q(xt|xt1)=N(α¯ctα¯ctcxt1,(1α¯ctα¯ctc)I),确保q(xt|x0)=q(xct|x0),并推导出q(xt1|xt,x0)​,使用student进行拟合。

  3. 本质上,student就是步数很少时的DDPM,只不过是用teacher DDPM做监督,感觉没有直接训练效果好?

 

SDXL-Lightning

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

  1. PD时不是用MSE而是用对抗损失进行训练。

  2. 判别器为D(xt,xtns,t,tns,c),使用UNet encoder结构,分别输入xt,t,cxtns,tns,c得到两个输出,融合后预测一个分数。The condition on xt is important for preserving the ODE flow. This is because the teacher’s generation of xtns is deterministic from xt. By providing the discriminator both xtns and xt, the discriminator learns the underlying ODE flow and the student must also follow the same flow to fool the discriminator.

 

TRACT

TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

 

SpeedUpNet

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

  1. 为StableDiffusion额外引入一个可训练的cross-attention与negtive prompt交互。

  2. 两个loss,一个学单步,一个学多步,是冲突的?

 

PDAE-PD

Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models

 

Imagine-Flash

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

  1. 对于某个xt,训练student一步预测的x^0拟合teacher多步预测的x^0

  2. forward distillation:根据q(xt|x0)使用真实数据加噪得到xt;backward distillation:随机xT,使用student采样得到xt。由于训练目标是加速采样,采样时是没有ground-truth signal的,所以使用forward distillation是有exposure bias的,for forward distillation, the model learns to denoise taking into account information from the ground-truth signal, backward distillation eliminates information leakage at all time steps, preventing the model from relying on a ground-truth signal.

 

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

 

SIDDM

Semi-Implicit Denoising Diffusion Models

改进DDGAN。

 

UFOGen

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

改进SIDDM。

 

YOSO

You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs

YOSO

  1. define a sequence distribution of clean data pθt(x0)=q(xt)pθ(x0|xt)dxtpθ0(x0)=q(x0)pθ(x0|xt)=δ(Gθ(xt,t)),使用Et[Dadv(q(x)pθt(x))+λKL(q(x)pθt(x))]优化学习Gθ(xt,t)直接预测clean data,前一项是distribution层面对齐,后一项是point层面对齐。

  2. 然而直接在clean data上进行对抗训练无法避免GAN训练时遇到的困难,为了解决这一问题,DDGANs在corrupted data上进行对抗训练,but such an approach fails to directly match pθ(x0)​, curtailing the efficacy of one-step generation,这算一个两难问题。

  3. 受Self-Cooperative的启发,YOSO依然在clean data上进行对抗训练,但使用pθt1(x)作为ground truth学习pθt(x),即Et[Dadv(pθt1(sg(x))pθt(x))+λGθ(xt,t)x22]pθt1(sg(x))pθt(x)的样本都是用Gθ进行生成的。这个思想类似CM,不过CM是point-to-point match,且xt1xt进行ODE采样得到,YOSO是distribution match,且xt1xt是独立采样。额外使用一个类似CM的MSE loss。

  4. 训练之前,先对训练扩散模型进行微调,第一阶段转换到v-prediction,第二阶段改变noise schedule实现zero terminal SNR,之后对得到的模型直接fine-tune或LoRA fine-tune作为Gθ(xt,t)

 

HiPA

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

HiPA

  1. StableDiffusion用不同步数生成图像,FT提取高低频信息,组合不同高低频,ITF转换为图像,可以看出单步生成不好的主要原因是高频信息不够好。

  2. 使用LoRA finetune StableDiffusion,让LoRA+SD单步生成的图像的高频部分和SD多步生成的图像的高频部分尽量靠近。

 

ADD (SDXL-Turbo)

Adversarial Diffusion Distillation

ADD-1

ADD-2ADD-3
  1. student网络初始化为teacher网络,student网络的步数设为4步,{τ1,τ2,τ3,τ4},其中τ4=1000,训练时从数据集中随机挑选数据,加噪到4步中的某一步,输入student网络,一步生成x^θ,两个loss训练。

  2. GAN loss:We use a frozen pretrained feature network and a set of trainable lightweight discriminator heads. The trainable discriminator heads are applied on features at different layers of the feature network.

  3. distillation loss: Notably, the teacher is not directly applied on generations of the ADD-student but instead on diffused, as non-diffused inputs would be out-of distribution for the teacher model. 即先采样t和噪声,将student网络的预测结果x^θ加噪为x^θ,t=α¯tx^θ+1α¯tϵ,再输入teacher网络,一步生成x^θ,t1α¯tϵψ(x^θ,t,t)α¯t,和student网络的预测结果x^θ计算MSE。实际上,x^θ,t1α¯tϵψ(x^θ,t,t)α¯tx^θ=α¯tx^θ+1α¯tϵ1α¯tϵψ(x^θ,t,t)α¯tx^θ=1α¯t(ϵϵψ(x^θ,t,t))α¯t​,其与SDS是等价的,student网络作为SDS中的gθ,让其生成的样本与teacher网络生成的样本一致。

  4. 训练时只用一步生成,采样时用4步DDIM生成。

 

LADD

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

LADD

  1. 在VAE latent空间,teacher网络生成样本z0,加噪为zt,输入student网络,一步生成z^θ,对z^θ加同样的噪声为z^θ,tztz^θ,t分别输入teacher网络,用产生的feature做判别。

 

NitroFusion

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

NitroFusion

  1. 稳定Adaversarial Distillation中的对抗训练。

  2. 判别器由frozen teacher UNet encoder和trainable lightweight head构成,生成的图像加噪到相同timestep后输入判别器判别真假。

  3. We use a dynamic discriminator pool to source these discriminator heads,head是分时间步的,每个head只负责某个时间步,每步训练时随机从pool中随机挑选一批head(相同时间步)进行训练,head优化更新后重新放回pool,the stochasticity of this process through random sampling ensures varied feedback, preventing any single head from dominating the generator’s learning and reducing bias. This diversifies feedback and enhances stability in GAN training.

  4. 每步训练后随机扔掉pool中1%的head,补充回相同数量的重新随机初始化的head,refreshing discriminator subsets helps maintain a balance between stable feedback from retained heads and variability from re-initialized ones to enhance generator performance.

 

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

  1. Substitute the NeRF rendering with a text-to-image generator that can directly synthesize a text-guided image in one step, effectively converting the text-to-3D generation training into one-step diffusion model distillation.

 

SwiftBrushv2

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

SwiftBrushv2

 

DI

Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models

DI-1

DI-2

DI-3

DI-4

  1. We have a pre-trained diffusion model with the multi-level score net denoted as sq(t)=xtlogq(t)(xt).

  2. We aim to train an implicit model gθ without any training data, such that the distribution of the generated samples, denoted as pg​, matches that of the pre-trained diffusion model.

  3. In order to receive supervision from the multi-level score functions sq(t), introducing the same diffusion process to the generated samples seems inevitable. Consider diffusing pg along the same forward process as the instructor diffusion model and let p(t) be the corresponding densities at time t. sp(t)=xtlogp(t)(xt).

  4. The IKL is tailored to incorporate knowledge of pre-trained diffusion models in multiple time levels. It generalizes the concept of KL divergence to involve all time levels of the diffusion process.

  5. 在同一个扩散过程下,分别以两个分布为起点,优化它们的IKL。对IKL求关于θ的梯度,使用链式法则可以引入xt,得到θlogp(t)(xt)q(t)(xt)=xtlogp(t)(xt)q(t)(xt)xtθ=[xtlogp(t)(xt)xtlogq(t)(xt)]xtθ。因此先训练sϕ(xt,t)sp(t)=xtlogp(t)(xt)进行估计,再训练θ优化IKL。

  6. SDS algorithm is a special case of Diff-Instruct when the generator’s output is a Dirac’s Delta distribution with learnable parameters. 如果gθ的输出是确定性的(相同输入会得到相同输出),则IKL退化到和SDS一样形式的loss(无需省略和近似),此时就需要训练sp(t)了。这说明SDS是Diff-Instruct的一个特例。

  7. ADD可以看成是Diff-Instruct和对抗训练的结合。

 

SDXS

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

SDXS

  1. 使用BK-SDM精简student model。

  2. Diff-Instruct训练student model,在[αT,T]上使用Diff-Instruct的训练方法(LDM用于训练sϕ(xt,t)LIKL用于训练student model),在[0,αT]上使用LFM​进行训练。

  3. LFM=lwlSSIM(fl(xθ(ϵ)),fl(ψ(xϕ(ϵ))))是feature matching loss,fl is l-th intermediate feature map encoded by the encoder fxθ是student model,xϕ是teacher model,ψ是ODE sampler。LFM yields favorable results with a comparison to MSE loss.

 

DMD

One-step Diffusion with Distribution Matching Distillation

DMD-1 DMD-2
  1. Distribution Matching Loss就是DI的IKL。

  2. As the distribution of our generated samples changes throughout training, we dynamically adjust the fake diffusion model,这就是为什么要额外训练一个diffusion model的原因。fake diffusion model和one-step generator是一起训练的。

 

DMD2

Improved Distribution Matching Distillation for Fast Image Synthesis

DMD2

  1. Multi-step generator (999, 749, 499, 249),和CM一样alternate between denoising and noise injection steps,如根据x999直接生成x^0,对x^0加噪得到x749,依此循环,所以训练时Gθ的输出总是x^0

  2. 为了避免training/inference mismatch,训练时的input不使用训练集的图像的加噪结果,而是使用上述方法使用Gθ生成的噪声图像。

  3. Removing the regression loss: true distribution matching and easier large-scale training.

  4. Stabilizing pure distribution matching with a Two Time-scale Update Rule. fake diffusion model和few-step generator是分开训练的。

  5. Surpassing the teacher model using a GAN loss and real data.

 

MomentMatching

Multistep Distillation of Diffusion Models via Moment Matching

MomentMatching

  1. 随机变量X的概率密度函数为f(x),其相对于值ck阶矩被定义为+(xc)kf(x)dx=Exp(x)[(xc)k],当c=0时为原点矩,当c=E(X)时为中心矩,1阶原点矩就是X的期望,2阶中心矩就是X的方差。矩的阶数可以一直到无穷大。如同PDF和CDF,矩生成函数MGF (Moment-Generating Function) 也能刻画随机变量的概率分布。随着矩阶数的升高,每一阶矩都提供了更细节的概率分布信息,与较低阶的矩一起对概率分布的刻画越完整(从均值、方差、偏度、峰度、……),这点与泰勒级数和傅立叶级数的思想一致。

  2. moment matching:通过拟合概率分布的moment来拟合概率分布,如拟合期望、方差等。

  3. Generalized Method of Moments (GMM) :定义任一函数h:RdRd,矩向量m=Exp(x)[h(x)]Rd,此时moment matching可以拓展为Expθ(x)h(x)Expdata(x)h(x)2

  4. distill pre-trained diffusion model gθ to distilled model gη,并且gη是多步的,之前的distillation都是优化gθ(xt,t)gη(xt,t)之间的距离,考虑到diffusion model ancestral sampling公式为q(xs|xt,x^0=gθ(xt,t)),因此可以使用xs作为桥梁,优化12Exqdata(x),xtq(xt|x),x^0gη(xt,t),xsq(xs|xt,x^0)Eg[x^0|xs]Eq[x^0|xs]2,对η求导,使用单个x^0进行蒙特卡洛模拟,we evaluate the moments using a sample xs from our generator distribution,由于第一项的Eg[x^0|xs]是未知的,所以需要额外训练一个模型gϕ拟合它。

  5. In the case of one-step sampling, our method is a special case of Diff-Instruct, which distill a diffusion model by approximately minimizing the KL divergence between the distilled generator and the teacher model.

 

FGM

Flow Generator Matching

FGM

  1. 感觉类似FM上的DI。

 

CM

Consistency Models

Consistency-Models

  1. 在具体实现中,CM是在40步EDM上训练的,所以tn+1tn是这40步中相邻的两步。

  2. For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.

  3. CM的训练目标是前后两步的consistency,而不是diffusion model的重构x0,所以xtn+1是从x0随机采样出来的,而xtn必须是根据xtn+1解出来的,它们必须对应同一个输出结果(可以不是x0)。可以随机采样xtn+1是因为PF ODE在证明时就是利用了和SDE的边缘分布相同的性质。当CM训练时采样到tn=ϵ时训练目标才是重构x0,所以其生成能力是由tn=ϵ这一步决定的,并链式的影响到之后的时间步。

 

LCM

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

LCM

Ψ为ODE-Solver,如DDIM,DPM-Solver等。

Since tntn+1 is tiny, ztn and ztn+1 are already close to each other, incurring small consistency loss and hence leading to slow convergence. Instead of ensuring consistency between adjacent time steps tn+1tn, LCMs aim to ensure consistency between the current time step and k-step away, tn+ktn. 这k步由ODE一步采样完成,相当于在Tk步的ODE上训练CM。

 

LCM-LoRA

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

StableDiffusion + LoRA作为Consistency Models。

 

RG-LCD

Reward Guided Latent Consistency Distillation

RG-LCD

 

CTM

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

  1. 和TCD本质上是一样的,但故事不一样,TCD是先有后两步,再引入第一步,CTM是先有前后两步,再引入中间一步。

  2. TCD抄袭CTM。

 

PCM

Phased Consistency Models

  1. The learning objectives of CTMs are redundant, including many trajectories that will never be applied for inference.

  2. 将diffusion trajectory分为N段,即[s0,s1,s2,,sN1,sN],其中s0=ϵsT=N,将每一段单独视为一个diffusion trajectory训练一个CM,如[sn,sn+1]段,随机采样一个tm+k,得到ztm+k,ODE解出ztm,保证tm也在[sn,sn+1]中,训练loss为d(fθ(ztm+k,tm+k,sn)fθ(ztm,tm,sn))

 

GCTM

Generalized Consistency Trajectory Models for Image Manipulation

  1. CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs, which translate between arbitrary distributions via ODEs.

  2. Flow Matching is another technique for learning PFODEs between two distributions. 在Flow Matching学到的PFODEs上运用CTMs。

  3. 支持translation、editing等。

 

SFD

Simple and Fast Distillation of Diffusion Models

SFD

  1. 可以加快TCD的训练速度。

 

TCD

Trajectory Consistency Distillation

TCD-1

TCD-2

  1. 定义fθ(xt,t,s)xs,而不是直接到x0,相当于训练模型保持tn+ktntm​的consistency。

  2. 左边为CM的多步采样,右边为TCD的多步采样。CM的多步采样每一步都预测到x0​再加噪,误差大且会累积,TCD的多步采样每一步预测到某一中间步,误差小。

  3. 使用LoRA fine-tune SDXL。

 

TSCD

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

TSCD

  1. 将PD与TCD结合起来,称为Progressive Consistency Distillation。先将[0,T]分为k个segment,TCD训练时将TCD中的m限制在xtxt1(TCD中的xtk)所在的segment内,训练完成后将k减半接着训练,直到k=1k=1​时就和TCD等价。

  2. Consistency Distillation使用adversarial loss和MSE loss。Empirically, we observe that MSE Loss is more effective when the predictions and target values are proximate (e.g., for k=8,4), whereas adversarial loss proves more precise as the divergence between predictions and targets increases (e.g., for k=2,1).

  3. 训练完成后继续使用DMD进行enhancement。

  4. 使用LoRA fine-tune SDXL。

 

MCM

Multistep Consistency Models

  1. 与TCD的思想类似,但不同的是MCM不重新定义f,而是使用f的预测计算某一中间步的结果(DDIM中的x^0)。

 

SCott

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

SCott

  1. 使用SDE-Solver而非ODE-Solver。

  2. 使用多步SDE。

  3. Consistency Models被参数化为预测均值和方差的模型,此时输出就是一个分布,使用KL散度优化。

 

SCFlow

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

SCFlow

 

Shortcut

One Step Diffusion via Shortcut Models

  1. Train a single model that supports different sampling budgets, by conditioning the model not only on the timestep t but also on a desired step size d.

  2. x1是data,x0是noise,xt+d=xt+sθ(xt,t,d)dd0时,sθ(xt,t,d)=xt+dxtd=vθ(xt,t)即为Flow Matching。

  3. self-consistent性质:sθ(xt,t,2d)2d=sθ(xt,t,d)d+sθ(xt+d,t+d,d)dsθ(xt,t,2d)=sθ(xt,t,d)/2+sθ(xt+d,t+d,d)/2

  4. L=sθ(xt,t,0)(x1x0)2+sθ(xt,t,2d)(sθ(xt,t,d)/2+sθ(xt+d,t+d,d)/2)2

 

Consistency-FM

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

  1. CM在flow matching上的应用。

 

iCD

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

  1. 利用CD的思想学习一个用于inversion的模型fCD,CD是映射xtx0,fCD是映射xtxT,可以加速编辑。

  2. fCD有两个问题,一是要学习的xt没法直接通过对x0进行加噪得到,所以要先将x0加噪到xT或者直接从N(0,I)中采样一个xT,之后使用ODE采样得到xt进行训练。二是fCD无法进行多步采样,所以借鉴TCD和CTM的思想,给模型加一个额外的时间步输入,让模型预测指定时间步的latent,之后在trajectory上设置几个boundary时间步,每次只训练模型预测到最近的boundary时间步的latent。

  3. 同时训练CD和fCD,额外加两个preservation loss:采样某个boundary时间步的xt,让CD先预测出x0,再让fCD预测出xt,与ground truth xt计算loss,只优化fCD;另一个preservation loss反过来同理,只优化CD。

  4. We train fCD and CD separately from each other but initialize them with the same teacher model.

  5. For fCD, we consider the unguided model with a constant ω=1. The reason is that the guided encoding (ω>1) leads to the out-of distribution latent noise, which results in incorrect reconstruction.

 

SiD

Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation

 

SiDA

Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step

  1. SiD with Adversarial Loss

 

SiD-LSG

Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation

 

ODE-based

可以分为single-step和multi-step,single-step只根据当前状态预测下一状态,如DDIM,EDM,DPM-Solver,优点是实现简单,可以自启动;multi-step需要额外的历史状态预测下一状态,如PNDM,DEIS,优点是估计更精准效果更好。

 

DDIM

Denoising Diffusion Implicit Models

 

PNDM

Pseudo Numerical Methods for Diffusion Models on Manifolds

 

DPM-Solver

DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

 

ODE-Distillation-based

D-ODE

Distilling ODE Solvers of Diffusion Models into Smaller Steps

D-ODE-1 D-ODE-2
  1. We observe that predictions from neighboring timesteps exhibit high correlations in both denoising networks, with cosine similarities close to one. This observation suggests that denoising outputs contain redundant and duplicated information, allowing us to skip the evaluation of denoising networks for most timesteps.

  2. We can combine the history of denoising outputs to better represent the next output, effectively reducing the number of steps required for accurate sampling. This idea is implemented in most ODE solvers, which are formulated based on the theoretical principles of solving differential equations. These solvers often adopt linear combinations or multi-step approaches, leveraging previous denoising outputs to precisely estimate the current prediction.

  3. 已有的ODE方法,如线性多步法,都有固定的组合历史预测的公式;D-ODE使用一组可学习的组合历史预测的系数,student利用第t步的预测和历史预测的组合估计第t1步的预测(大跨步),teacher使用已有的ODE方法从第t步采样C步得到第t1步的预测(小跨步),前者以后者为目标进行蒸馏,优化组合系数。

 

Caching

AdaptiveDiffusion

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

AdaptiveDiffusion

 

PFDiff

PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

  1. Based on two key observations: a significant similarity in the model’s outputs at time step size that is not excessively large during the denoising process of existing ODE solvers, and a high resemblance between the denoising process and SGD.

  2. 直接使用之前timestep的预测(也可以使用ODE算法组合多个timestep的预测)作为当前timestep的预测,因此当前timestep不需要NFE。

 

DeepCache

DeepCache: Accelerating Diffusion Models for Free

DeepCache

  1. 每隔N步执行一次full inference并进行cache,之后N1步使用cache feature进行计算,这样总的full inference只有TN次。

 

Flexiffusion

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

  1. 类似DeepCache的思想,利用NAS技术 to search for potential inference schedules with non-uniform steps and structures.

 

Unraveling

Unraveling the Temporal Dynamics of the Unet in Diffusion Models

Unraveling

 

Faster-Diffusion

Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models

  1. The encoder features exhibit a subtle variation at adjacent time-steps, whereas the decoder features exhibit substantial variations across different timesteps,所以可以复用之前步数的UNet encoder的输出和feature,直接输入/skip-connect到下一步的UNet decoder。

  2. The encoder feature change is larger in the initial inference phase compared to the later phases throughout the inference process,所以在复用集中在采样的中后期阶段。

  3. 还可以连续多步复用,这样多步就可以并行计算。

 

BlockCaching

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

BlockCaching

  1. UNet的block输出具有三个特点:smooth change over time, distinct patterns of change, small step-to-step difference. A lot of blocks are performing redundant computations during steps where their outputs change very little. Instead of computing new outputs at every step, we reuse the cached outputs from a previous step. Due to the nature of residual connections, we can perform caching at a per block level without interfering with the flow of information through the network otherwise.

  2. 重复利用之前时间步的某些block的输出,减少运算量。

 

SkipDiT

Accelerating Vision Diffusion Transformers with Skip Branches

SkipDiT

 

Δ-DiT

Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Delta-DiT

  1. 注意看剪刀,不同方法的区别在于省略的地方不同。

  2. 和之前的方法不同的是,Δ-Cache caches the difference between feature maps.

  3. Δ-Cache is applied to the back blocks in the DiT during the early outline generation stage of the diffusion model, and on front blocks during the detail generation stage.

 

DuCa

Accelerating Diffusion Transformers with Dual Feature Caching

DuCa

 

TGATE

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

  1. This study reveals that, in text-to-image diffusion models, cross-attention is crucial only in the early inference steps, allowing us to cache and reuse the cross-attention map in later steps.

  2. 节省了最耗计算量的cross-attention map的计算。

 

DiTFastAttn

DiTFastAttn: Attention Compression for Diffusion Transformer Models

DiTFastAttn

  1. Post-training compression for self-attention. 可以用在ImageNet的DiT上,也可用在text-to-image的PixArt上。

  2. self-attention values concentrate within a window along the diagonal region of the attention matrix. 计算采样的前两步的self-attention map的residual,之后只计算对角线附近的window self-attention,加上这个residual作为最后的self-attention map。

  3. self-attention sharing直接共享self-attention的结果。

 

L2C

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

L2C

  1. 训练T×D个实数,T是时间步数,D是DiT的层数,MHSA和FeedForward各算一层。

  2. 训练时,采样相邻的两个时间步sm,采样得到xs,计算ϵθ(xs,s)并cache每一层的输出;之后使用ODE从xs中求解出xm,计算ϵθ(xm,m)作为ground truth;之后计算ϵ~θ(xm,m),其DiT的每一层计算公式为hi+1m=him+g(m)(βm,ifi(him)+(1βm,i)fi(his)),其中fi是MHSA或FeedForward,hi是当前层的输入,g(m)是DiT的scale系数;ϵθ(xm,m)ϵ~θ(xm,m)22优化βm

  3. 推理时,某一层的βt,i小于某个阈值时,就设βt,i=0,这样fi(hit)的系数就为0,因此可以跳过当前层的计算。

 

HarmoniCa

HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

  1. 改进L2C。

 

LazyDiT

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

  1. 给MHSA和FeedForward各学习一个相似度估计器,相似度超过阈值时跳过,使用上一步的cache结果。

 

TOC

Task-Oriented Diffusion Model Compression

  1. 专门针对image to image translation任务的加速,如InstructPix2Pix image editing和StableSR image restoration。

  2. Depth-skip compression:和Unraveling的(b) Removing deconv blocks一样。

  3. Timestep optimization:biased timestep selection

 

Token Reduction

ToMe

Token Merging: Your ViT But Faster

Token Merging for Fast Stable Diffusion

ToMeSD

 

ImToMe

Importance-based Token Merging for Diffusion Models

  1. ϵθ(xt,t,c)ϵθ(xt,t)作为每个token的importance score,根据importance score选top-k个token作为dst set。

 

AT-EDM

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

AT-EDM

  1. 根据attention map识别过剩的token进行merge。

 

ToDo

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

  1. 类似PixArt-Σ的KV compression,但是training-free的。

  2. Tokens in close spatial proximity exhibit higher similarity, thus providing a basis for merging without the extensive com putation of pairwise similarities.

  3. We employ a downsampling function using the Nearest-Neighbor algorithm to the keys and values of the attention mechanism while preserving the original queries.

 

TokenCache

Token Caching for Diffusion Transformer Acceleration

TokenCache

 

ToCa

Accelerating Diffusion Transformers with Token-wise Feature Caching

ToCa

  1. 本质和UNet那一套feature cache是一样的,只是对象换成了token。

 

others

DG

Refining Generative Process with Discriminator Guidance in Score-Based Diffusion Models

  1. 使用预训练diffusion model生成一批样本,同时记录生成过程中的xt

  2. 从真实数据集中采样一批样本并加噪到某个随机的时间步,从生成样本数据集中采样一批样本,取相同时间步的xt,训练time-dependent discriminator D(xt,t)进行判别。

  3. 采样时,使用xtlogD(xt,t)1D(xt,t)指导采样。

 

DiffRS

Diffusion Rejection Sampling

DiffRS

  1. pθ(xt1|xt)q(xt1|xt)之间有gap,且q(xt1|xt)不一定高斯分布。

  2. 使用rejection sampling优化每一步采样,将pθ(xt1|xt)作为proposal distribution,从其中采样并以一定概率拒绝,直到接受即可完成采样。

  3. 概率的计算最终还是使用DG的time-dependent discriminator。

 

AMS

Score-based Generative Models with Adaptive Momentum

  1. 类似FDM但不需要重新训练,motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters.

 

Skip-Tuning

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

Skip-Tuning

  1. UNet encoder的feature为di,decoder的feature为ui,decoder使用的是两者的concatenation,即(di,ui)。分别计算原模型和用原模型蒸馏出来的模型(如CM)的di2ui2,会发现蒸馏出来的模型这个比值会降低,这启发我们:在使用原模型进行加速采样时,引入一个小于1的系数ρi,decoder中使用(ρidi,ui),会不会提升采样质量?

  2. 使用最简单的策略,定义最里层的ρbottom和最外层的ρtopρbottom<ρtop,中间层的ρi使用它们之间的插值,在EDM的5步Heun采样中,使用ρbottom=0.55ρtop=1.0,FID降到了原来的12,提升巨大。

  3. 随机选取一些图像进行加噪再一步去噪,计算所有步的diffusion loss的和,会发现skip-tuning不会使diffusion loss降低,但会使一步去噪生成的图像在feature space上(InceptionV3,CLIP等模型提取)离原图更近,所以skip-tuning提升FID的原因是对feature的优化。所以可以使用现有的模型,加一个可训练的ρ,使用diffusion loss+feature loss训练这个ρ,也可以达到上述的提升。

 

MASF

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

MASF-1

MASF-2

  1. 把diffusion model生成的过程看成参数优化的过程,因此可以引入滑动平均提高稳定性和效果。

  2. The denoising process often prioritizes reconstructing low-frequency component (layout) in the earlier stage, and then focuses on the recovery of high-frequency component (detail) later. 因此IDWT时,给不同component乘一个系数,给low-frequency component乘一个单调递减的常数,给high-frequency component乘一个单调递增的常数。

  3. 相同步数下,FID比DDIM好。

 

TimeTuner

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

  1. 类似TS-DDPM,at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, enforcing the sampling distribution towards the real one.

 

Residual-Learning

Residual Learning in Diffusion Models

Residual-Learning-1 Residual-Learning-2

Residual-Learning-3

  1. score-based generative models存在两种误差,离散化导致的误差和score network无法完全拟合导致的误差,所以可以在pre-trained diffusion model的基础上学习一个矫正网络来拟合这种误差。

  2. 只在t=0附近使用。

 

DICE

DICE: Staleness-Centric Optimizations for Efficient Diffusion MoE Inference

  1. 针对MoE网络结构的diffusion model采样加速。

 

IC

Informed Correctors for Discrete Diffusion Models

  1. 针对discrete diffusion model的采样算法。

 

Exposure Bias

对于某个timestep t,训练时输入网络的xt和采样时采样得到的xt的分布是不同的,或者说是有domain shift的。

 

TS-DDPM

Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps

  1. We search for such a time step within a window surrounding the current time step to restrict the denoising progress.

 

IP

Input Perturbation Reduces Exposure Bias in Diffusion Models

  1. Gaussian分布建模训练时输入网络的xt和采样时采样得到的xt的分布的gap。

 

DREAM

DREAM: Diffusion Rectification and Estimation-Adaptive Models

DREAM

 

SS

Markup-to-Image Diffusion Models with Scheduled Sampling

  1. 训练diffusion model时,先从q(xt+m|x0)中采样一个xt+m,然后用diffusion model采样得到xt,训练ϵθ(xt,t)预测xtα¯tx01α¯t​,为了简便,忽略采样时产生的梯度。

  2. 该方法原本是用来解决自回归文本生成的exposure bias问题的。

 

E2EDiff

E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

E2EDiff

 

MDSS

Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models

MDSS

 

 

Low-Density Sampling

LD

Generating High Fidelity Data from Low-density Regions using Diffusion Models

 

Minority-Guidance

Don’t Play Favorites: Minority Guidance for Diffusion Models

 

SG-Minority

Self-Guided Generation of Minority Samples Using Diffusion Models

 

HD

Diffusion Models as Cartoonists! The Curious Case of High Density Regions

  1. We propose a practical high probability sampler that consistently generates images of higher likelihood than usual samplers.

 

others

MC

Manifold-Guided Sampling in Diffusion Models for Unbiased Image Generation

encourage the generated images to be uniformly distributed on the data manifold, without changing the model architecture or requiring labels or retraining.

利用guidance。

 

BayesDiff

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

利用Last-layer Laplace Approximation (LLLA)技术估计diffusion model生成样本的不确定度,which can indicate the level of clutter and the degree of subject prominence in the image.不确定度高的样本背景较为混杂,可以过滤掉。

 

CADS

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

  1. diversity低的原因:模型本身在小数据集上训练;cfg太大。

  2. 对输入模型的条件y进行corrupt:y^=γ(t)y+s1γ(t)ny^rescaled=y^mean(y^)std(y^)std(y)+mean(y)y^final=ψy^rescaled+(1ψ)y^,其中γ(t)为分段函数,在[t2,T]为0,在[t1,t2]为0-1的线性函数,在[0,t1]​为1。the diffusion model initially only follows the unconditional score and ignores the condition. As we reduce the noise, the influence of the conditional term increases. This progression ensures more exploration of the space in the early stages and results in high-quality samples with improved diversity.

  3. 对于class-conditional diffusion model,y为class embedding;对于StableDiffusion,y为text embedding;对于image-conditional diffusion model,y为image condition。

 

FPDM

Fixed Point Diffusion Models

FPDM

没什么新理论,只是将DiT block中间的较大的网络换成一个较小的求不动点的 x=ffpθ(x,xinput,t) 的implicit model,其中xinputfpre的输出,训练时使用 Jacobian-Free Backpropagation算法计算ffpθ的梯度。

可以根据精度要求或者计算时间需求动态调整不动点网络迭代次数。

 

DistriFusion

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

  1. 分布式推理。

  2. Patch Parallelism (PP), where a single image is divided into patches and distributed across multiple GPUs for individual and parallel computations.

  3. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step.

 

PCPP

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

  1. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices.

  2. PCPP decreases the communication cost by around 70% compared to DistriFusion.

 

LCSC

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

LCSC

  1. Evolutionary Search

  2. 类似集成学习的效果。

 

Hallucinations

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Hallucinations

  1. Diffusion models smoothly “interpolate” between nearby data modes in the training set, to generate samples that are completely outside the support of the original training distribution; this phenomenon leads diffusion models to generate artifacts that never existed in real data (i.e., hallucinations).

 

Guidance

Classifier

Noisy-Classifier

Diffusion Models Beat GANs on Image Synthesis

  1. 需要在xt上训练,获取exact gradient on xt

 

DA

Training Diffusion Classifiers with Denoising Assistance

  1. 训练noisy classifier时,把pre-trained diffusion model预测的x^0也当作条件,guidance效果更好。

 

CFG

Classifier-Free Diffusion Guidance

 

GFCG

Gradient-Free Classifier Guidance for Diffusion Model Sampling

GFCG

 

ICG+TSG

No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

  1. 训练conditional diffusion model时不需要随机drop condition as a null condition进行训练。

  2. unconditional distribution可以由conditional distribution计算,即p(x)=yp(x|y)p(y),但这样需要forward多次。ICG使用随机采样的高斯噪声作为null condition直接输入conditional diffusion model得到unconditional score。

  3. TSG:ϵθ(xt,t~)+ωTSG(ϵθ(xt,t)ϵθ(xt,t~)),其中t~是在t的embedding上随机加一个高斯噪声。

 

PCG

Classifier-Free Guidance is a Predictor-Corrector

  1. CFG可以看成一种Score-based Generative Models中Predictor-Corrector采样的过程。

 

CG

Compress Guidance in Conditional Diffusion Sampling

CG

  1. 去噪过程可以看成是对KL散度梯度下降优化的过程。

  2. 分类器指导采样同样可以看成类似的过程。

 

DLSM

Denoising Likelihood Score Matching for Conditional Score-Based Data Generation

DLSM

 

CDM

Classification Diffusion Models Revitalizing Density Ratio Estimation

CDM-1 CDM-2

 

  1. 训练一个timestep分类器,根据xt分类t,we set two additional timesteps, 0 and T+1, corresponding to clean images and pure Gaussian noise, respectively,从{0,[1,T],T+1}中随机挑选一个训练分类器。

  2. 可以证明E[ϵ|xt=α¯tx0+1α¯tϵ]=1α¯t(xtFθ(xt,t)+xt),用它进行diffusion训练。

 

DDG

Simple Guidance Mechanisms for Discrete Diffusion Models

  1. 应用于discrete diffusion model的CFG方法。

 

Any Distance Estimator

The recent focus of the conditional diffusion researches is how to incorporate the conditioning gradient during the reverse sampling. This is because for a given loss function l(x), a direct injection of the gradient of the loss computed at xt produces inaccurate gradient guidance.

用Tweedie's formula根据xtϵθ(xt,t)计算x^0,输入l(x^0),计算对xt的梯度作为guidance。

 

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

 

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

 

FreeDoM

FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model

训练noisy data和condition之间的distance function并用其梯度做guidance过于消耗计算量,可以用每一步预测的噪声去计算出预测的clean data,使用现有的clean data和condition之间的distance function,即:

Dϕ(c,xt,t)Ep(x0|xt)Dθ(c,x0)

这种做法很普遍,但是效果却不稳定,对small domain(如人脸)效果很好,但对large domain(ImageNet)效果很差。原因是:The direction of unconditional score generated by diffusion models in large data domains has more freedom, making it easier to deviate from the direction of conditional control。

解决方案:利用RePaint的resample technique,循环进行xtguidancext1diffusext,相当于每一步sample都进行多次guidance,每次得到的xt都比之前的xt更加informative,aligned,harmonized

 

UGD

Universal Guidance for Diffusion Models

 

LCD-MC

Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation

 

MPGD

Manifold Preserving Guided Diffusion

 

FIGD

Fisher Information Improved Training-Free Conditional Diffusion Model

 

ADMM

Decoupling Training-Free Guided Diffusion by ADMM

 

Training-Free-Guidance

Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

UTFG

  1. 两种改进方法。

 

TFG

TFG: Unified Training-Free Guidance for Diffusion Models

TFG-1

TFG-2

 

GeoGuide

GeoGuide: Geometric Guidance of Diffusion Models

GeoGuide

  1. 对于随机变量ϵiN(0,1),其平方的均值即为其二阶中心距,也就是方差,即E[ϵi2]=E[(ϵi0)2]=E[(ϵiE(ϵi))2]=σi2,根据强大数定律,ϵ12+ϵ22++ϵn2随着n的增大以概率1收敛于n,因此对于一个ϵRDN(0,I),其2范数的平方ϵ22=ϵ12+ϵ22++ϵD2随着D的增大以概率1收敛于D,因此ϵ2随着D的增大以概率1收敛于D

  2. data manifold MRDxt=α¯tx0+1α¯tϵ,因此d(xt,α¯tM)1α¯tϵ2=1α¯tD,因此让guidance正比于1α¯tD

 

EluCD

Elucidating The Design Space of Classifier-Guided Diffusion Generation

矫正,不过只能用于off-the-shelf的离散分类器上。

EluCD

 

PnP

Diffusion Models as Plug-and-Play Priors

Variational Inference

采样过程是对引入的variational distribution的点估计采样过程,也是对negtive ELBO最小化的过程,即对variational distribution和真实后验分布之间的KL散度的最小化。

Plug-and-Play

 

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

pθ(xt1|xt,c)pθ(xt1|xt)pθ(c|xt1)pθ(c|xt)

xtlogpθ(xt1|xt,c)=xtlogpθ(xt1|xt)xtV1(xt,c)+xtV2(xt1,c)

xt1=1αt(xt1αt1α¯tϵθ(xt,t))xtV1(xt,c)+xtV2(xt1,c)+σtϵ

 

DSG

Guidance with Spherical Gaussian Constraint for Conditional Diffusion

DSG

  1. DSG enhanced DPS by normalizing gradients in the constraint guidance term and implementing a step size schedule inspired by Spherical Gaussians.

 

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

DreamGuider

  1. 求梯度不需要过diffusion network,降低了计算量。

  2. 受SGD算法启发使用动态scale,不需要handcrafted parameter tuning on a case-by-case basis.

 

AutoGuidance

Guiding a Diffusion Model with a Bad Version of Itself

  1. Guiding a high-quality model with a poor model trained on the same task, conditioning, and data distribution, but suffering from certain additional degradations, such as low capacity and/or under-training.

  2. D0(xt,t,c)+ω[D1(xt,t,c)D0(xt,t,c)]D1(xt,t,c)是正常训练好的模型,D0(xt,t,c)是没有训练好的或者参数量少很多的模型。

 

SIMS

Self-Improving Diffusion Models with Synthetic Data

  1. Use self-synthesized data to provide negative guidance during the generation process to steer a model’s generative process away from the non-ideal synthetic data manifold and towards the real data distribution.

  2. 使用训练集训练一个diffusion model,训练完成后使用diffusion model生成一个数据集,用这个生成的数据集再训练一个diffusion model,类似AutoGuidance做CFG。

 

Asymmetric Reverse Process

DDIM reverse process中的PtDt非对称。

Asyrp

Diffusion Models Already Have a Semantic Latent Space

  1. 根据l优化模型并输出一个特定的Pt

 

DDS

Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

  1. 用Tweedie's formula根据xtϵθ(xt,t)计算出x^0=xt1α¯tϵθ(xt,t)α¯t,之后优化出一个Δx0,使得l(x^0+Δx0)尽可能小,使用x^0+Δx0作为Pt,可以看做和Asyrp等价,只是方法不同。注意,这里的x^0是unconditional采样的结果,条件是在优化的过程中引入的。

  2. 这等价于对x^0进行梯度下降优化,DDIM采样公式变为 xt1=α¯t1(x^0γtx^0l(x^0))+1α¯t1ϵθ(xt,t)

 

CFG++

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

CFG++

  1. 将diffusion model conditional sampling看成是measurement为条件的inverse problem。

  2. 使用DDS求解该inverse problem,先无条件采样计算出x^0,再使用l(x^0)=ϵϵθ(xt,t,c)2优化x^0,其中xt=α¯tx^0+1α¯tϵ,即对每一步得到x^0再进行加噪。该loss就是SDS loss,这和DreamSampler类似。

  3. l(x^0)=ϵϵθ(xt,t,c)2=xtα¯tx^01α¯txtα¯tx^0c1α¯t2=α¯t1α¯tx^0x^0c2,其中x^0c=xt1α¯tϵθ(xt,t,c)α¯t,此时DDIM采样公式变为 xt1=α¯t1(x^0+λt(x^0cx^0))+1α¯t1ϵθ(xt,t),其中λt=2α¯t1α¯tγt

  4. 将公式中的x^0+λt(x^0cx^0)看成由某个ϵθλt根据Tweedie's formula计算得到的,可以解出ϵθλt=ϵθ(xt,t)+λt[ϵθ(xt,t,c)ϵθ(xt,t)],即为CFG的形式。

  5. CFG++就是非对称逆向过程,Pt使用CFG形式的ϵθλt计算,Dt使用unconditional ϵθ(xt,t)计算。

  6. x^0+λt(x^0cx^0)=λtx^0c+(1λt)x^0是一个插值,类似于DDIM采样中对每一步x^0进行clamp,这里选出λt[0,1]来保证其是x^0x^0c的内插。相比于CFG,CFG++中,Ptλt[0,1]以防止extrapolation and deviation from the piecewise linear data manifold,Dt的unconditional ϵθ(xt,t)防止renoising过程中的manifold offset。

  7. CFG++的计算量和CFG的是一样的,每一步都要先unconditional过一次diffusion model计算得到x^0,之后加噪得到xt,又需要conditional过一次diffusion model计算得到x^0c

  8. 还可以用于DDIM Inversion,解决了ω过大时CFG的DDIM Inversion不准的问题。

 

AsyGG

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

Guidance

 

Inversion

why

  1. 生成模型是先有隐变量(一般是随机采样的噪声)再有生成样本,Inversion是先有真实数据(非生成的)再找到能生成它本身的隐变量。

  2. 动机是对真实数据做编辑。

 

GAN Inversion

  1. 由于mode collapse,GAN Inversion效果相对较差,过程较为复杂。

 

DDIM Inversion

  1. DDIM的生成过程可以表示为at=α¯t1α¯tbt=α¯t1(1α¯t)α¯t+1α¯t1ϵ(xt,t)=ϵθ(xt,t,ϕ)+ω[ϵθ(xt,t,c)ϵθ(xt,t,ϕ)]xt1=atxt+btϵ(xt,t),可以看出其是不可逆的,因为只根据xt1是无法解析复原xt的。如果做一个近似假设: ϵ(xt,t)ϵ(xt1,t1),上式就近似可逆: xt1=atxt+btϵ(xt1,t1)xt=xt1btϵ(xt1,t1)at,这就是DDIM Inversion(reverse ODE)时使用的公式。

  2. 对于unconditional(ω=0)和一般的conditional(ω=1)模型,这种近似还是比较准确的,x0DDIM InversionxtDDIMx^0基本可以完美重构图像,但对于large scale classifier guidance(ω>1),这种近似的误差就很大,重构效果就会很差,尤其是在步数少跨度比较大的情况下。使用50步生成时,上半部分:先根据prompt用ω=7生成一张图;用ω=0编码再解码,重构效果很好;用ω=7编码再解码,重构效果很差;下半部分:为real image写一个prompt,用ω=0编码再解码,重构效果很好;用ω=7编码再解码,重构效果很差。右边画出了ϵ(xt,t)ϵ(xt1,t1)之间的cosine similarity曲线图,表明了上述近似假设的成立程度。

EDICT

  1. 如果使用非对称的ω,即x0DDIM InversionxtxtDDIMx^0时使用不同的ω,那么DDIM Inversion时的近似误差会被放大:when ω of the sampling process is different from that of the forward process, the accumulated error would be amplifified, leading to unsatisfactory reconstruction quality. ωenc=0时使用不同ωdec重构的结果:

Prompt-Tuning

  1. 通过grid search(PSNR越大越好)可以看到:每一行中,只有ωdecωenc相同时才能做到该ωenc下最好的重构。如果一定要使用较大的ωdec,最好使用较小的ωenc

Prompt-Tuning-2

  1. 在DiffusionAutoencoder中,如果使用控制stochastic changes的inferred xT,并不能零误差复原原图,就是因为DDIM是不可逆的。同样的,对于inferred xT,使用100步编码往往比1000步编码的复原效果更差,因为1000步时ϵ(xt,t)ϵ(xt1,t)的近似更精确,而100步时这个近似误差就相对较大了。

  2. Inversion做好就可以很好的编辑了,有一些专门做精确Inversion的工作,如EDICT, Null-text Inversion, Prompt Tuning, AIDI等,参考Image Editing部分。

 

Regularized DDIM Inversion

Zero-shot Image-to-Image Translation

DDIM Inversion时每一步使用两个loss梯度下降优化ϵθ的预测结果,一个loss计算不同位置之间的相关性,另一个loss计算每个位置和标准高斯分布的KL散度。

 

Exact Inversion

EDICT

EDICT: Exact Diffusion Inversion via Coupled Transformations

 

AIDI

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

 

BDIA

Exact Diffusion Inversion via Bi-directional Integration Approximation

 

BELM

BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models

 

High-Order

On Exact Inversion of DPM-Solvers

  1. 高阶采样器的inversion

 

SPDInv

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

 

EasyInv

EasyInv: Toward Fast and Better DDIM Inversion

 

Parameter-Efficient Fine-Tuning

可以用在不同任务上,比如data-driven fine-tune,RLHF fine-tune,TI fine-tune等。

 

LoRA

LoRA: Low-rank adaptation of large language models

W=Wo+BAh=ho+Δh=Wox+BAx

 

AttnLoRA

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

AttnLoRA

  1. The standard U-Net architecture for diffusion models conditions convolutional layers in residual blocks with scale-and-shift but does not condition attention blocks. Simply adding LoRA conditioning on attention layers improves the image generation quality.

 

TriLoRA

TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation

  1. Compact SVD:ARm×n=UrΣrVT,其中UrRm×rVRn×r是简化后的单位正交矩阵,ΣrRr×r是最大的r个奇异值组成的对角矩阵。

  2. TriLoRA:W=Wo+UrΣrVTh=ho+Δh=Wox+UΣVTx,学习三个矩阵。

 

PaRa

PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

  1. W=W0QQTW0,其中Q是某个可学习的矩阵B的QR分解中的正交矩阵,ensures that the column space of W is a subset of the column space of W0, effectively reducing the dimension of the output while maintaining the key features learned by the model.

 

Terra

Time-Varying LoRA Towards Effective Cross-Domain Fine-Tuning of Diffusion Models

Terra

 

SVDiff

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

  1. 类似FSGAN,对卷积层参数做SVD,W=UΣVT,其中Σ=diag(σ),fine-tune模型时只学习奇异值Σ的一个偏移量,得到Σδ=diag(ReLU(σ+δ)),最终卷积层变为Wδ=UΣδVT​。因为参与训练的参数少,更不容易overfitting。

 

PET

A Closer Look at Parameter-Efficient Tuning in Diffusion Models

PET

  1. 为预训练StableDiffusion加小参数量的可训练的adapter进行transfer learning,探索了adapter位置和网络结构对训练的影响。

 

StyleInject

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

StyleInject

  1. 改进LoRA:W=Wo+BAh=ho+Δh=Wox+BAx

 

LyCORIS

Navigating Text-To-Image Customization From LyCORIS Fine-Tuning to Model Evaluation

LyCORIS

 

OFT

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

  1. z=WTx=(RW0)Tx   s.t.   RTR=RRT=I,RIϵ,只优化R

  2. OFT时另一种fine-tune方法,比LoRA效果更好,参数更少收敛更快。

  3. 和OrthoAdaptation没有关系

 

BOFT

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

  1. 改进OFT。

 

SODA

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

PEFT-SODA

 

SCEdit

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

SCEdit

  1. 只用SC-Tuner。

 

DiffFit

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

DiffFit

  1. 针对DiT的PEFT方法。

  2. 还支持将低分辨率模型fine-tune到高分辨率,对positional embedding进行插值,比如提高一倍分辨率时,原来的(i,j)变为(i2,j2)

 

Diffscaler

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Diffscaler

  1. DiT for incremental class-conditional generation.

  2. 为incremental class添加class embedding。

  3. Affiner: Wx+b^(1+a)Wx+b^+b+sWupReLU(Wdownx). For transformer models, we add our Affiner block for each key, query, value weights and bias parameters as well as the MLP block.

 

SaRA

SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

SaRA

  1. 借助减枝理论,模型中有一些ineffective parameter,即绝对值小于某个阈值的那些参数。

  2. These currently ineffective parameters are caused by the training process and can become effective again in following training.

 

FINE

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

FINE

  1. 奇异值分解,只fine-tune奇异值。

 

Text-to-Image

Awesome

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

VQVAE + multinomial diffusion

transformer blocks:input xt1,cross-attention with text,NAR prediting xt

VQ-Diffusion

 

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  1. 首次提出 text 64x64 256x256 的生成模式

text 64x64:训练一个TransformerEncoder编码text(长度为K),输出K个vector(K×d),两种condition方法一起使用:

第一种:最后一个vector作为ADM中AdaGN的class embedding的替代。

第二种:K个vector也被插入到UNet各处的attention module中,具体做法是UNet的每个AttentionBlock多训练一个一维卷积(ddc),将K个vector映射到2×dc×K的维度(textual KV),当前xt的feature map(dc×h×h)被映射到3×dc×h2(visual QKV),之后dc被分成n_heads组,visual和textual的KV在尺寸维度concat起来,再做MultiheadAttention,得到的attention map大小为h2×(h2+K)。相当于一个self-attention (h2×h2) 和一个cross-attention (h2×K) 组合起来的hybrid-attention。

attention

64x64 256x256:使用ADM的super resolution方法,使用和上面相同的condition方法,但是用了一个维度更小的TransformerEncoder编码text。

  1. classifier-free guidance

上面的conditional模型训练后,再以20%概率用空串代替文本的方式对其进行fine-tune,得到classifier-free模型。

  1. Text-Guided Inpainting Model

使用预训练好的DDPM进行inpainting,即采样时每一步将xt的unmask部分替换为q(xt|x0)的采样样本,但这样做的话,每一步采样时模型是看不到完整的unmask部分的信息的,只有noisy version的,这样会造成生成样本在mask边界不自然的现象。

上面的conditional模型训练好后,随机mask x0的一部分得到masked x0,将masked x0和mask(1RGB+1mask 共四通道)concat到unmasked xt上,输入UNet预测,只计算mask region的loss,fine-tune得到一个inpainting model。模型只增加了第一个Conv层的输入通道数,其余不变。

采样时和上面一样,每一步将xt的unmask部分替换为q(xt|x0)的采样样本。

 

DALLE-2 (unCLIP)

Hierarchical Text-Conditional Image Generation with CLIP Latents

  1. text64×64256×2561024×1024。两个diffusion upsampler model,都不再以text为condition,低分辨率图像做一些数据增强后concat在xt上(64×64256×256使用Gaussian Blur,256×2561024×1024使用Diverse BSR)。To reduce training compute and improve numerical stability, we train upsamplers on random crops of images that are one-fourth the target size. We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution.

  2. Prior:text作为条件,DDPM建模image CLIP embedding。使用GLIDE的text encoder编码text,使用预训练好的CLIP编码text和image,使用TransformerDecoder模型,分别将encoded text,CLIP text embedding,timestep embedding,noised CLIP image embedding,placeholder embedding按顺序输入,使用causal attention mask(当前位置只和前面的做attention),placeholder embedding位置的输出预测unnoised CLIP image embedding。不使用ϵprediction,而是使用ziprediction,使用MSE优化。采样时,采样两个zi,然后选取zizt较小的那个zi

  1. Decoder:image CLIP embedding和text作为条件,DDPM建模image。用GLIDE的两种condition方法,第一种是将CLIP image embedding映射到指定维度替代ADM中AdaGN的class embedding,第二种是将CLIP image embedding映射成长度为4的token sequence,然后concat到上述encoded text token sequence之后(K+4),之后和GLIDE一样使用hybrid-attention。

  2. CFG:Prior:randomly dropping text conditioning information 10% of the time during training. Decoder:randomly setting the CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of the time during training

 

DALL·E-3

Improving Image Generation with Better Captions

  1. Existing text-to-image models struggle to follow detailed image descriptions and often ignore words or confuse the meaning of prompts. We hypothesize that this issue stems from noisy and inaccurate image captions in the training dataset. We address this by training a bespoke image captioner and use it to recaption the training dataset. We then train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability. 使用recaptioned dataset训练StableDiffusion,95%使用recaption,5%使用原caption。

  2. A text-conditioned convolutional UNet latent diffusion model on top of the latent space learned by the VAE.

  3. Once trained, we used the consistency distillation process to bring it down to two denoising steps.

 

CogView3

CogView3: Finer and Faster Text-to-Image via Relay Diffusion

CogView3

  1. 类似DALL·E-3使用recaptioned dataset进行训练。

  2. Base Stage是一个512×512图像压缩8倍的EDM StableDiffusion。

  3. SR Stage是一个latent space的RDM(原RDM是在pixel space的),只训练了[0,Tr]的时间步,在Tr进行交接。不使用bluring diffusion,对1024×1024x先降采样到512×512再上采样到1024×1024得到xL,编码到latent space得到zzL,前向过程变为TrtTrz+tTrzL+σϵ,相当于用插值取代了blur,也是为了解决直接上采样后有gap的问题。

  4. 采样时将Base Stage生成的图像上采样到1024×1024,编码到latent space并加噪后输入RDM进行采样。

 

SD

High-Resolution Image Synthesis with Latent Diffusion Models

 

SDXL

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

  1. 架构上:借鉴SimpleDiffusion第3条经验,架构上采用不均一的block分布,SD是[1,1,1,1],即4层每层1个block,downsample 3次,SDXL是[0,2,10],即3层,第一层直接降维,不做其余处理,第二第三层各2个和10个block,downsample 2次。使用了两个text encoder,输出concat在一起。参数量是原SD的3倍。

  2. Micro-Conditioning Image Size:由于数据集图像尺寸不统一,SD的做法是直接丢弃小尺寸的数据,但这样会损失很大一部分数据;另一种做法是把小尺寸数据upsample到大尺寸,但这种数据比真的大尺寸图像模糊,会导致模型输出图像模糊。SDXL将原图尺寸作为condition输入网络,加到time embedding上,注意:网络输出的还是目标尺寸的图像,但其模糊程度由这个condition决定。The image quality clearly increases when conditioning on larger image sizes.

  3. Micro-Conditioning Cropping Parameter:SD的一个很大的问题就是有些输出图像会截掉某个物体一部分,这是由于数据处理时,将图像长宽中较小的那一个resize后目标尺寸再对较长的那一个进行crop造成的。SDXL将crop位置作为condition输入网络,加到time embedding上,inference时输入(0,0)就能得到物体比较完整的图像。

SDXL

  1. Multi-Aspect Training:SD使用固定的输出尺寸。SDXL在目标尺寸上预训练好后,在多比例尺寸图像上进行fine-tune,做法是划分一些尺寸bucket,将图像装入最近的bucket,同一个bucket内的图像resize到bucket对应的尺寸,每次随机选一个bucket采样batch进行训练。Inference时就可以生成不同尺寸的图像(只要输入目标尺寸的噪声即可)。

  2. improved VAE-Autoencder,batch size 256(之前是9) + EMA

  3. 先在256x256上训练(带Micro-Conditioning),再在512x512上训练(带Micro-Conditioning),再在1024x1024的分辨率上进行Multi-Aspect Training(划分bucket:以1024x1024为中心,64为步长增减长宽,保持pixel总数和1024x1024接近)。

  4. Refinement Stage:We train a separate LDM in the same latent space, which is specialized on high-quality, high resolution data and employ a noising-denoising process as introduced by SDEdit on the samples from the base model. We follow and specialize this refinement model on the first 200 (discrete) noise scales. During inference, we render latents from the base SDXL, and directly diffuse and denoise them in latent space with the refinement model, using the same text input.

 

iSDXL

On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

  1. Straightforward implementation of control conditions in DiT may cause interference between the time-step and class-level or control conditions (macro-conditioning) if their corresponding embeddings are additively combined in the adaptive layer norm conditioning.

  2. For class, we move the class embedding to be fed through the attention layers present in the DiT blocks.

  3. For control conditions, we zero out the control embedding in early denoising steps, and gradually increase its strength.

 

SD3

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

SD3

  1. 在Conditional Flow Matching中,zt=atz0+btϵa0=1,b0=0,a1=0,b1=1ddtzt=atz0+btϵ=atztbtϵat+btϵ=atatztbt(atatbtbt)ϵLCFM=Et,ϵvθ(zt,t)atatzt+bt(atatbtbt)ϵ2,换成ϵ-prediciton参数法,LCFM=Et,ϵ(bt(atatbtbt))2ϵθ(zt,t)ϵ2,已知SNR λt=logat2bt2,所以λt=2(atatbtbt),所以LCFM=Et,ϵ(bt2λt)2ϵθ(zt,t)ϵ2,为了统一不同方法,使用Lω=12Et,ϵ[ωtλtϵθ(zt,t)ϵ2],对于CFM,ωt=12λtbt2。不同方法(Rectified Flow,DDPM,EDM,PD v-prediction等)都可以看成ztwt不同的CFM。

  2. MMDiT架构,ViT in latent space,latent channel取16,since text and image embeddings are conceptually quite different, we use two separate sets of weights for the two modalities.

  3. 训练时SNR遵循什么样的分布采样很重要。

  4. Rectified Flow (zt=(1t)z0+tϵωt=t1t) formulations generally perform well and compared to other formulations, their performance degrades less when reducing the number of sampling steps.

 

PGv3

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

PGv3

  1. We believe that the continuity of information flow through every layer of the LLM is what enables its generative power and that the knowledge within the LLM spans across all its layers, rather than being encapsulated by the output of any single layer.

 

SANA

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

  1. 32×32降采样率的VAE,patch size为1,最低分辨率为1024×1024

 

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

  1. stage A:训练4倍降采样率的VQGAN,1024256

  2. Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到16×24×24

  3. stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。

  4. stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。

  5. 生成时CBA

 

Imagen

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

  1. text64×64256×2561024×1024,三个模型都要condition text。

text64×64:GLIDE的两种conditon方法。

64×64256×256:Efficient UNet. GLIDE的两种conditon方法。

256×2561024×1024:Efficient UNet. 不用self-attention,只用cross-attention,降低运算量。

Use noise conditioning augmentation for both super resolution models。

  1. 三个模型都有CFG。

  2. Dynamic thresholding(只针对采样)

使用比较大的classifier guidance weight时,每一步得到的x^0的像素值容易出界,一般都是直接截取到(1,1),Imagen动态,每一步将中x^0中所有像素值取绝对值并排序,将s设为一定百分位(比如80%)的像素值,如果s>1,将所有像素值截取至(s,s)再除以s,否则还是按原来的方法截取。

  1. 纯text encoder比image-text联合训练出来的text encoder要好。

 

YaART

YaART: Yet Another ART Rendering Technology

YaART

  1. text64×64256×2561024×1024​,前两个模型都要condition text,最后一个模型不用condition text。

text64×64:GLIDE的两种conditon方法。

64×64256×256:只使用AdaGN的condition方法。

256×2561024×1024​:Efficient UNet,不condition text。

  1. fine-tune text64×64 and 64×64256×256 with high-quality image-text pairs, fine-tune 256×2561024×1024 with SR dataset.

  2. RL alignment for text64×64.

 

eDiffi

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  1. text64×64256×2561024×1024,同Imagen。

  2. 同时使用T5 text encoder和CLIP text encoder。

  3. 发现了不同时间步对文本的利用程度不同

提出模型分裂法,每个子模型只针对某个子level的noise进行训练,称为expert,最终模型为Ensemble of Expert Denoisers。

 

RAPHAEL

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

  1. eDiffi是使用不同timesteps的experts进行生成,这里还使用不同space的experts进行生成。

  2. Space MoE:根据cross-attention map使用阈值法确定某个word的mask,再根据word由route网络选择某个expert,由该expert生成该word对应的feature,所有word的feature乘上对应的mask取平均作为输出。

 

PixArt-α

PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

PixArt-alpha

  1. DiT加入cross-attention引入text。

  2. 在DiT架构中,AdaLN的参数量竟然占到了DiT的27%,但在文生图中没有类别条件只有时间步,所以不需要这么多参数,从而使用AdaLN-single:在所有block之外使用一个MLP根据timestep预测出的global AdaLN参数(共6个),每个block再额外训练一个长度为6的AdaLN参数,和global AdaLN参数相加得到最终的AdaLN参数,极大地降低了参数量。

  3. 三阶段训练:使用一个预训练的class-conditional ImageNet模型作为初始化,一方面可以节省text-to-image的训练时间,一方面class-conditional模型训练起来较为容易且不费时;使用高度对齐的、高密度信息的文本的数据集进行训练,实现text-image alignment;类似Emu,使用少量的高质量图像进行fine-tune。

 

PixArt-Σ

PixArt-sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

PixArt-sigma

  1. 使用更高分辨率的图像和细粒度的caption进行训练。

  2. 为了减少参数量在self-attention中使用KV compression,因为相邻的R×R中的feature具有相似性,比较冗余,所以使用一个卷积层对KV进行缩小,使用2×2的卷积层,参数被初始化为1R2,让它一开始就等价于一个average pooling。Q还是保持不变,以保留信息。

  3. Weak-to-Strong Training Strategy:PixArt-α作为weak模型。直接将VAE替换为高分辨率图像的VAE;切换到高分辨率时使用DiffFit的positional embedding插值;即使weak模型上没有使用KV compression,也可以直接在strong模型训练时使用KV compression。

 

Flag-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

 

Next-DiT

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

 

GenTron

GenTron: Diffusion Transformers for Image and Video Generation

GenTron

  1. adaLN design yields superior results in terms of the FID, outperforming both cross-attention and in-context conditioning in efficiency for class-based scenarios. However, our observations reveal a limitation of adaLN in handling free-form text conditioning. Cross-attention uniformly excels over adaLN in all evaluated metrics.

 

PanGu-Draw

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

PanGu-Draw-1

PanGu-Draw-2

PanGu-Draw-3

  1. Cascaded Training:不同分辨率的三个模型分别训练。Resolution Boost Training:先在低分辨率上训练,再在高分辨率上训练。

  2. Time-Decoupled Training:将时间步分为两个阶段,前一阶段主要负责生成形状,后一阶段负责refine。前一阶段需要使用大量的text-image pair进行训练以让模型学习不同的concept,之前的模型都过滤掉低分辨率,但这里不需要,将低分辨率也上采样到高分辨率进行训练,因为前一阶段生成的是xTstruct​,其本来就是带噪的,使用低分辨率上采样后的模糊图像并不影响效果。后一阶段在低分辨率上进行训练,在高分辨率上进行采样。

  3. Coop Diffusion:不同隐空间和不同分辨率训练的扩散模型可以一起用于采样,以image space为中介进行转换。

 

ParaDiffusion

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

  1. 解决长文本复杂场景的生成问题。

  2. 使用decoder-only的language model训练t2i模型,好处是gpt已经展现出了强大的能力,对长文本已经有很好的建模,且训练数据多,缺点是pre-trained decoder-only模型feature extraction能力不太行,所以需要adaption。efficiently fine-tuning a more powerful decoder-only language model can yield stronger performance in long-text alignment (up to 512 tokens)

ParaDiffusion

 

KNN-Diffusion

KNN-Diffusion Image Generation via Large-Scale Retrieval

不需要text-image pair进行训练,用image做条件,CLIP做桥梁。训练时根据image间CLIP编码的cosine距离,使用KNN算法找出和训练image相似的N个image作为条件。采样时根据text和image间CLIP编码的cosine距离,使用KNN算法找出和采样text相似的N个image作为条件。

 

RDM

Retrieval-Augmented Diffusion Models

使用训练数据的k-NN的CLIP embedding作为条件进行训练,采样时,可以根据文本挑选k-NN进行生成,也可以直接使用文本的CLIP embedding。

 

Enhancement

Re-Imagen

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Re-Imagen-1

Re-Imagen-2

  1. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities. A generative model that uses retrieved information can produce high-fidelity and faithful images, even for rare or unseen entities.

  2. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities’ visual appearances.

  3. 在UNet encoder后加一个cross-attention与neighbors做交互,同样使用UNet encoder编码neighbors作为key-value,t设为0,所有参数一起训练。

  4. 采样时可以自己提供reference image作为neighbor,实现类似Textual Inversion的效果。

 

Latent Transparency

Transparent Image Layer Diffusion using Latent Transparency

Latent-Transparency

  1. 用透明图像数据训练一个编码器和一个解码器:编码器根据RGB图像和alpha图像预测一个VAE latent空间的偏移量latent transparency,该偏移量加在RGB图像的latent上,相当于对latent distribution做修改,这么做的目的是为了让解码器可以根据修改后的latent预测出RGB图像和alpha图像,但同时应该尽可能少地影响VAE重构效果,让StableDiffusion可以正常运行。loss分为两部分,第一部分是解码器重构RGB图像和alpha图像的loss,第二部分是VAE重构loss,约束编码器预测的偏移量latent transparency不要影响latent distribution。

  2. 在新的latent distribution上fine-tune StableDiffusion。

 

LayerDiff

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

LayerDiff-1

LayerDiff-2

  1. 一个background layer和K个foreground layer,每个foreground layer对应一个image和mask,foreground layer之间不重叠,最终图像是所有foreground layer拼在一起,然后background layer填补空隙。

  2. 使用InstructBLIP、SAM、StableDiffusion inpainting模型造数据训练。

 

AFA

Ensembling Diffusion Models via Adaptive Feature Aggregation

AFA

  1. 集成学习。

  2. AFA dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages.

 

Diffusion-Soup

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

  1. Diffusion Soup enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging.

  2. Diffusion Soup approximates ensembling, and involves fine-tuning n diffusion models on n data sources respectively, and then averaging the parameters.

 

NSO

Not All Noises Are Created Equally: Diffusion Noise Selection and Optimization

NSO

  1. 一个好的noise应该是生成再Inversion后保持不变,以此为标准可以为某个prompt选择好的noise进行生成,也可以优化出一个好的noise。

 

AsyVQGAN

Designing a Better Asymmetric VQGAN for StableDiffusion

  1. 改进StableDiffusion要建模的隐空间。

  2. 为decoder设计了一个conditional branch,输入task-specific prior,如unmasked image in inpainting。

  3. decoder远比encoder大,提升细节重构能力。

 

CG

Counting Guidance for High Fidelity Text-to-Image Synthesis

  1. 用pre-trained counting network,输入每一步的x^0,输出的count与期望的count作差计算loss,求梯度作为guidance。

 

IoCo

Iterative Object Count Optimization for Text-to-image Diffusion Models

IoCo

  1. TI的思路解决count问题。

 

QUOTA

QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

QUOTA

  1. TI的思路解决count问题,meta-learning。

 

MuLan

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

  1. a training-free Multimodal-LLM agent that can progressively generate multi-object with planning and feedback control, like a human painter.

 

DIFFNAT

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

  1. We propose a generic "naturalness" preserving loss function, kurtosis concentration (KC) loss,和diffusion loss一起训练。

 

FreeU

FreeU: Free Lunch in Diffusion U-Net

  1. training-free,只用两个系数提高生成效果。

  2. UNet的decoder的feature由两部分组成,一个是自己网络生成的backbone feature,另一个是同分辨率下encoder skip connection过来的skip feature。

  3. 给backbone feature乘一个系数b,随着b的增大,UNet的denoising能力增强,生成图像的质量变高,但高频信息被抑制。

  4. 实验证明了skip feature含有更多高频细节的信息,所以给skip feature的FFT乘一个稍大的系数,复原被抑制的高频信息。

 

Omegance

Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

  1. zt1=atzt+ωbtϵθ(zt,t),给DDIM采样公式direction项乘一个系数。

  2. ω<1 enhances detail, making it well-suited for generating a busier crowd in a marketplace, intricate patterns in clothing design, or fine textures in elements like sand or waves.

  3. ω>1 produces smoother, simpler visuals, ideal for scenes with clear skies, calm waters, or minimalist designs, where a streamlined aesthetic is preferred.

  4. 不同区域可以使用不同ω进行生成。

 

SR

Fine-grained Text-to-Image Synthesis with Semantic Refinement

  1. KNN-Diffusion (language-free training),采样时根据text中的semantic选取reference image,给reference image加噪,计算noised reference image和xt的CLIP embedding的点积,计算其梯度作为guidance。

 

ConceptSliders

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

ConceptSliders

  1. 对于target concept的纠错或者编辑。

  2. η是常数,sliding指scale LoRA的系数,即W+αΔW中的α

  3. 可以通过prompt engineering设计enhanced and suppressed attribute,可以解决hands生成等问题。

 

PromptSliders

Prompt Sliders for Fine-Grained Control, Editing and Erasing of Concepts in Diffusion Models

PromptSliders

  1. 类似ConceptSliders,但是是在text encoder上训练LoRA。

  2. α是scale LoRA的系数,η是CFG scale。

 

LaVi-Bridge

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

  1. 训练一个adapter以结合不同预训练语言模型和预训练文生图模型。

  2. 给定任意预训练的文本编辑器f和文生图生成器g,对于文本yc=f(y)g(h(c)),使用g的loss训练MLP adapter h,同时LoRA fine-tune fg,只需要少量text-image pair即可完成adaptation。

 

Multi-LoRA

Multi-LoRA Composition for Image Generation

  1. training-free

  2. 直接将每个LoRA的输出ϵθ,θi相加进行采样(而不是将LoRA参数相加)。

 

 

Position

CompFuser

Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

  1. 对于含有左右位置关系的两个物体的prompt,先正常生成其中一个物体,再利用place * on the left这样的instruction进行编辑。

  2. 编辑模型类似InstructPix2Pix,使用LLM-grounded diffusion,生成两个物体的layout,只用其中一个layout生成原图,两个layout都用生成目标图,instruction一起,LoRA fine-tune InstructPix2Pix。

 

CoMPaSS

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

CoMPaSS

  1. 给定一个带方位的prompt,对其进行改写(语义不变),再对其进行反义、交换等操作(改变语义),之后都送入文本编码器计算编码结果的相似度,理论上应该是改写后的prompt与原prompt最相似,但发现目前流行的文本编码器都有90%以上的失败率。

  2. 构造方法数据集SCOP fine-tune diffusion model,fine-tune时类似RoPE给QK加positional embedding to augment the conditioning text signals.

 

Noise

GoldenNoise

Golden Noise for Diffusion Models: A Learning Framework

GoldenNoise

  1. 图里画错了,NPNet是两个网络,分别是两种预测target noise的方法,输入都是source noise xT,两种方法预测的结果分别乘上可训练的系数αβ作为最终的输出,与target noise xT计算MSE loss进行训练。

 

NoiseDiffusion

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

NoiseDiffusion

  1. 利用diffusion chain回传LVLM的梯度优化更新initial noise。

  2. 直接回传梯度计算量较大,这里将所有ϵθ(zt)近似看成一个constant,这样zt1zt=α¯t1α¯t+α¯tα¯t1α¯tϵθ(zt)ztϵθ(zt)zt=0,整个diffusion chain回传梯度相当于只乘了一个常数系数。

 

Attention

ERNIE-ViLG 2.0

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

  1. 通过引入先验知识提高image-text对齐程度的优化训练算法。

  2. 利用NLP工具标注出text中的关键词,并在cross-attention中提高其与image token的attention的权重。

  3. 利用object detection检测出text中的object的区域,提高这一区域的diffusion loss的权重。

 

TokenCompose

TokenCompose: Grounding Diffusion with Token-level Supervision

TokenCompose

  1. 利用SAM提取prompt中名词对应的object的mask,fine-tune StableDiffusion,除了diffusion loss,还加了两个cross-attention map的辅助loss。

  2. Ltoken=1NiN(1uMiCAMi,uuCAMi,u),即提高mask区域内响应值和的比例,但该loss不保证响应值均匀,如果高响应值集中于Mi的一个子区域也能优化该loss,所以加一个全区域的交叉熵loss,Lpixel=1NLiuMi,ulog(CAMi,u)+(1Mi,u)log(1CAMi,u)

 

Local-Control

Local Conditional Controlling for Text-to-Image Diffusion Models

  1. StableDiffusion + ControlNet

  2. training-free

  3. 如果ControlNet的输入只包含一个物体的控制信息,比如对于prompt"a dog and a cat",ControlNet的输入只包含了cat的bounding box,the prompt concept that is most related to the local control condition dominates the generation process, while other prompt concepts are ignored. Consequently, the generated image cannot align with the input prompt. dog容易消失。

  4. 对于有local control的物体,使用控制信息大致估算出一个mask,计算该物体对应的token的cross-attention map在mask内最大值和mask外最大值的差作为loss,对于没有local control的物体,将mask外视为自己的区域,mask外视为非自己的区域,用同样的方法计算loss,loss求和,求梯度作为guidance。

  5. 将mask用在ControlNet的skip connection feature上,使得ControlNet只影响mask内的feature。

 

SAG

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

SAG

  1. 相比于classifier-free guidance借助condtional score计算guidance,self-attention guidance使用internal information计算guidance,training-free and condition-free,所以比较通用,可用于任何diffusion model的enhancement。

  2. classifier guidance的u就是要远离的目标,如果是uncondtional model,可以将其输出作为c,然后人为定义一个u,这里使用每一步生成的x^0的Gaussian Blur的噪声版本的score作为u,称为Blur Guidance: Gaussian blur reduces the fine-scale details within the input signals and smooths them towards constant, resulting in locally indistinguishable ones. 但这样会导致生成图像含有噪声,We assume that this is because global blur introduces structural ambiguity across entire regions. 所以提出只在显著位置使用Gaussian blur。

  3. self-attention mask:再次利用self-attention,unnormalized self-attention map大小为RN×(HW)×(HW),其中N为head数,在N×(HW)维使用global average pooling (GAP) ,再reshape and upsample到图像尺寸。使用均值作为阈值确定一个mask,该mask对应图像的高频部分,只取Gaussian blur的mask部分。

 

PAG

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

PAG

  1. 类似SAG,使用I替代self-attention map作为CFG的unconditional score进行采样。

 

SEG

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

  1. 破坏CFG的unconditional prediction的self-attention,和PAG一个思想。

  2. By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction of CFG.

 

Self-Guidance

Guided Diffusion from Self-Supervised Diffusion Features

  1. 类似SAG,利用数据本身的UNet feature做guidance。

  2. Our method leverages the inherent guidance capabilities of diffusion models during training by incorporating the optimal-transport loss. In the sampling phase, we can condition the generation on either the learned prototype or by an exemplar image.

  3. 需要全部重新训练。

 

Attention-Regulation

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Attention-Regulation

  1. 某个token的cross-attention的dominance导致了其它token的semantic的丢失。

  2. 需要额外输入一个token index集合(可以自动提取所有名词),取该集合内每个token对应的cross-attention map,对于每个cross-attention map,计算其90%分位数的响应值与一个预设值之间的MSE loss,目的是让这些指定的token的cross-attention map的响应值较大(类似A&E),同时计算每个cross-attention map的响应值的和与一个预设值之间的MSE loss,目的是上这些token的cross-attention map的响应值均衡。给cross-attention map计算公式中加一个参数化的S得到Softmax(QKT+Sd),使用两个loss优化S​。

  3. We choose cross-attention layers in the last down-sampling layers and the first up-sampling layers in the U-Net for optimization.

  4. 为了稳定,使用EMA更新。

 

Attention-Modulation

Towards Better Text-to-Image Generation Alignment via Attention Modulation

  1. training-free

  2. self-attention temperature control:计算attention时使用较小的temperature,让softmax的分布更加集中,high attention values between patches with strong correlations are emphasized, while low attention values between unrelated patches are suppressed. After temperature control, the patch only corresponds with patches within a smaller surrounding area, leading to the correct outlines being constructed in the final generated image. We apply the temperature operation to the early generation stage of the diffusion model in the self-attention layer.

  3. object-focused masking mechanism:对prompt进行拆分,分为带形容词的物体、动词、介词等主体,计算prompt中不同主体对应的cross-attention map之和(每个主体可能不止一个word)作为该主体的cross-attention map,之后遍历所有pixel,对于每个pixel,选出其cross-attention map响应值最大的那个主体,将该pixel分配给该主体,在其它主体的所有word的cross-attention map上mask掉该pixel(响应值设为0)。With this masking mechanism, for each patch, we retain semantic information for only the entity group with the highest probability, along with the global information related to the layout. This approach helps reduce occurrences of object dissolution and misalignment of attributes.

 

MaskDiffusion

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作。

  4. cross-attention map有三种bad case:

cross-attention-bad-cases

  1. 做法是使用region selection算法,挑选出每个text token对应的区域,提高其cross-attention map的response,在cross-attention map中尽量分开不同token对应的区域。softmax(QKTd+M)M初始化为0,对i-th text token,如果j-th image token在其对应的区域内,则Mji=Mji+w0,其中w0为预设常数。

 

A&E

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

A&E

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作。

  4. 生成过程中每一步梯度下降优化zt,鼓励每个subject token对应的cross-attention map中至少有一个patch具有high response,目的是保证物体的存在性,使用优化后的zt去生成zt1,循环往复。

 

D&B

Divide and Bind Your Attention for Improved Generative Semantic Nursing

  1. StableDiffusion

  2. training-free

  3. 用total variation loss代替上面的loss,这样就不局限在某个patch点上了,激励整个区域。

  4. 另外引入了一个bind loss,其动机是prompt中还存在一些修饰subject token的形容词,这些形容词对应的cross-attention map应该和对应名词的cross-attention map是对齐的,所以引入它们(归一化后)之间的JS散度作为loss。

 

SynGen

Linguistic Binding in Diffusion Models Enhancing Attribute Correspondence through Attention Map Alignment

SynGen

  1. StableDiffusion

  2. training-free

  3. 利用cross-attention map计算loss,求梯度作为guidance。

 

EBAMA

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

EBAMA

  1. StableDiffusion

  2. training-free

  3. 利用cross-attention map计算loss,求梯度作为guidance。

  4. indensity loss: 负的cross-attention map的最大值,类似A&E。

  5. binding loss: maximize the cosine similarity between the given object and its syntactically-related modifier tokens, while enforcing the repulsion of grammatically unrelated ones in the feature space.

 

EBCA

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

 

PAC-Bayes

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

PAC-Bayes

  1. StableDiffusion

  2. training-free

  3. 利用cross-attention map计算loss,求梯度作为guidance。

 

ELA

Easing Concept Bleeding in Diffusion via Entity Localization and Anchoring

ELA

  1. 类似DiffEdit使用cross-attention map估计出mask,之后进行自我增强。

 

INITNO

INITNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

INITNO

  1. 根据第一步生成时的cross-attention map和self-attention map优化initial noise的重参数分布,保证物体的存在性,解决subject mixing问题。

  2. SCrossAttn是cross-attention response score,和A&E中的loss一样,保证物体的存在性。

  3. SSelfAttn是self-attention conflict score,existing diffusion models suffer from self-attention map overlap, leading to a failure case of subject mixing. 对于任意两个不同的subject的token,找出各自的cross-attention map中最大的响应值的位置,并分别找出该位置对应的self-attention map,shape都为H×W,遍历这HW个位置,计算两个self-attention map在每个位置上两者中的最小值除以两者的和,目的是让两个self-attention map在相同位置上的响应值一高一低,减少overlap。

  4. 两个score如果都低于各自的阈值,则说明不需要继续优化,直接采样并进行生成。

  5. Ljoint包含两个score以及一个额外的KL散度,防止initial noise的重参数分布偏离N(0,I)

 

EnMMDiT

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

  1. 针对MMDiT的解决subject neglect/mixing的方法。

  2. 计算三个loss,求梯度作为guidance。

  3. Block Alignment Loss: The blocks in the later layers gradually remove the ambiguities present in the earlier ones, 因此使用深层的cross-attention map与浅层的cross-attention map计算相似度。

  4. Text Encoder Alignment Loss: T5与CLIP可能冲突,因此计算两者编码相同token得到对应的cross-attention map的相似度。

  5. Overlap Loss: 计算不同token对应的cross-attention map的overlap,T5和CLIP各算一个,再两两各计算一个。

 

A-STAR

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

  1. StableDiffusion

  2. training-free

  3. attention overlap问题。解决:计算不同token对应的cross-atttenion map的IoU。

  4. attention decay问题:作者发现StableDiffusion生成早期的cross-atttenion map的布局是比较清晰的,但越到后期这种布局越模糊,没有保持住。所以可以利用上一步的cross-attention map估算一个mask,计算这一步的cross-attention map与这个mask的IoU。

  5. 3中的loss减去4中的loss,求梯度作为guidance。

 

Prompt

VP (text -> bounding box -> image)

Visual Programming for Text-to-Image Generation and Evaluation

  1. fine-tune LLM on text-layout paris,使其可以将text转换为layout,和text一起作为条件输入GLIGEN,辅助精确可控生成。

 

PCIG (text -> graph -> image)

Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models

PCIG

  1. 类似VP。

 

GenArtist (text -> bounding box -> image)

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

GenArtist

  1. 类似VP。

 

CxD (text -> bounding box -> image)

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

CxD

  1. 类似VP。

 

LayoutLLM-T2I (text -> bounding box -> image)

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

  1. in-context learning:从训练集(COCO,带prompt和bounding box标注)中随机采样一批样本作为candidate set,训练一个策略网络,该策略网络根据查询prompt,从candidate set选取几个样本作为in-context examples,为ChatGPT输入in-context examples和查询prompt,生成prompt中object的bounding box(文本形式)。策略网络根据mIoU和CLIP相似度等reward训练。

  2. GLIGEN fine-tune StableDiffusion。

 

DivCon (text -> bounding box -> image)

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

DivCon

 

LLM-Blueprint (text -> bounding box -> image)

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

LLM-Blueprint

 

RealCompo (text -> bounding box -> image)

RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models

RealCompo

  1. 利用ChatGPT生成layout后,利用L2I模型(如GLIGEN)和T2I模型的一起生成,做法是每一步生成时使用系数组合两个模型预测的噪声作为DDIM计算下一步的噪声,并根据DDIM的计算结果定义一个loss更新系数作为下一步的系数,以动态调整真实性和组合性。

 

ReasonLayout (text -> bounding box -> image)

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

  1. CoT reasoning:in-context让GPT3.5根据prompt生成layout。

  2. 在StableDiffusion的self-attention和cross-attention之间插入一个可训练的Layout-Aware Cross-Attention,用layout生成mask作用于cross-attention map上。

 

SimM (text -> bounding box -> image)

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

  1. StableDiffusion

  2. training-free

  3. 从带有位置关系的text中解析出一个粗糙的layout(比如middle对应图像中央一个方框,left对应左边占1/3的框,都是固定大小的),与第一步生成时产生的cross-attention map做比对,阈值法看是否有layout的不匹配,如果匹配就不介入,直接生成;如果不匹配,则进行介入。

  4. 介入:首先从T生成到Tloc,使用[T,Tloc]产生的cross-attention map计算一个均值,使用上面说的固定大小的框,在cross-attention map上sliding window,使用阈值法确定不同token的layout。在Tloc之后的生成开始修改每一步的cross-attention map,对于某个token,由于分配的layout框和计算出的layout框大小一样,所以可以直接将计算出的layout框中的响应值直接复制到分配的layout框中,同时对框内响应值做增强,对框外响应值做抑制。

 

SPDiffusion (text -> bounding box -> image)

SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

SPDiffusion

  1. StableDiffusion

  2. training-free

 

SceneGenie (text -> scene graph -> image)

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

SceneGenie

 

T2I-Salad (text -> scene graph -> image)

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion

  1. generating intricate visual content from simple abstract text prompts。

  2. 自监督训练一个scene graph的discrete diffusion model,根据simple abstract text prompts生成语义更丰富的scene graph。

  3. 给StableDiffusion插入scene graph attn进行训练。

 

FG-DM (text -> any -> image)

Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis

FG-DM-1 FG-DM-2

 

  1. 除了text,引入其它条件,如semantic or sketch or depth or normal maps等K个condition作为中介生图,建模它们的联合分布。

  2. 对于某个condition加噪,使用预训练的StableDiffusion对其进行去噪,使用可训练的T2I-Adapter引入之前的所有condition,输出输入到下一个T2I-Adapter继续往后传递。

  3. 使用K+1个diffusion loss训练。

  4. 训练时随机置空条件,这样采样时可以挑选任意子图进行生成。

 

ITI-Gen

ITI-GEN: Inclusive Text-to-Image Generation

  1. make the pre-trained StableDiffusion to generate images which are uniformly distributed across attributes of interest.

  2. 有点类似TIME和UCE那种model editing,但这里只是修改prompt (prompt tuning),不对模型做任何更改,需要提供一个reference image dataset作为attributes of interest。

 

RPG

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

RPG

  1. 对某个较长的caption,使用ChatGPT将其分解为n个sub-caption,再对每个sub-caption进行recaption,并为每个sub-caption在图中分配一个layout。

  2. 生成时,分别使用这n个sub-caption进行去噪,之后将每个sub-caption对应的去噪结果按照layout进行resize并重新拼成原来的空间尺寸,为了确保concat边界的一致性,使用原caption输出的latent,与拼出来的latent做插值。

 

RAG

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

RAG

  1. 类似RPG。

 

ContextCanvas

Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

  1. 利用knowledge graph的retrieval-augmented generation。

 

R2F

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

R2F

 

ConceptDiffusion

Semantic Guidance Tuning for Text-To-Image Diffusion Models

ConceptDiffusion

  1. 将prompt的score拆成不同concept的score的组合,subject concept的score直接计算,abstract concept的score由正交投影计算,组合时计算不同concept的score和prompt的score的相似度决定weight。

 

RCN

Is Your Text-to-Image Model Robust to Caption Noise?

RCN

 

TweedieMix

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

TweedieMix

 

SGOOL

Saliency Guided Optimization of Diffusion Latents

SGOOL

  1. TweedieMix的升级版。

 

DreamWalk

DreamWalk: Style Space Exploration using Diffusion Guidance

  1. 将prompt分解为不同的子prompt,使用不同子prompt的CFG的线性组合进行生成。

 

MCT2I

Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else

  1. StableDiffusion

  2. training-free,只在text embedding上做文章。

  3. text中首先出现的concept往往在生成中占主导地位,可能抢占其它concept,并且这些首先出现的concept的token embedding往往有比较大的normalization,通过scale down可以缓解。

  4. 某些concept的生成可能和它对应的embedding没关系,而是根据其它embedding生成。计算当前embedding和其它embedding的相似性,用其它embedding的加权和表示当前embedding。

 

Magnet

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

Magnet-1

Magnet-2

  1. training-free

 

ToMe

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

ToMe

  1. training-free

  2. ToMe: c^k=nk+iakink是subject,aki是attribute。

  3. ETS: As the semantic information contained in [EOT] can interfere with attribute expression, we mitigate this interference by replacing [EOT] to eliminate attribute information contained within them, retaining only the semantic information of each subject.

  4. 原prompt为C,第k个phrase是CK,ToMe和ETS后得到CLsem=kϵθ(zt,t,c^k)ϵθ(zt,t,Ck)22LentC每个token对应的cross-attention map的熵的和,decreasing the entropy of the cross-attention maps can help ensure that tokens focus exclusively on their designated regions, thereby preventing the cross-attention map from becoming overly divergent.

  5. During generation, we compute these two novel losses to update the composite token during each time.

 

MoCE

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

MoCE

  1. prompt="a tea cup of iced coke",现有的模型大多生成glass cup而非tea cup,这是因为训练数据中iced coke一般和glass cup一起出现,所以提出Mixture of Concept Expert,让GPT规划先生成tea cup再生成iced coke。

 

FDR

On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

 

Prompt LLM Encoding

LI-DiT

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

LI-DiT

  1. 不同的LLM各有优劣,比如encoder-decoder架构的T5和decoder-only架构的GPT,后者在文本理解上更好,但是用它们训练出来的text-to-image模型,后者在图像和文本对齐程度上远没有前者好。

  2. 将不同LLM集成在一起:使用不同LLM分别对prompt进行编码,使用refiner融合它们输出的feature,使用融合后的feature训练text-to-image DiT。

 

LLMDiff

Decoder-Only LLMs are Better Controllers for Diffusion Models

LLMDiff

  1. 由于decoder-only LLM更丰富的语义,使用它编码text训练text-to-image效果会更好。

  2. 训练一个MLP将LLM text embedding转换为CLIP text embedding输入预训练的cross-attention,同时类似IP-Adapter训练一个并行的cross-attention。

 

LLM4GEN

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM4GEN

  1. Cross-Adapter Module和UNet一起使用diffusion loss训练。

  2. 使用LLaVA对数据集中的image进行caption,替代其prompt,进行训练,类似DALLE3。

 

SUR-adapter

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

SUR-adapter

  1. 收集一些complex prompt,使用T2I模型生成图像,使用BLIP进行caption,得到simple prompt,得到simple-complex prompt pair。

  2. Train SUR-adapter to transfer the semantic understanding and reasoning capabilities of large language models and achieves the representation alignment between complex prompts and simple prompts. 让LLM编码simple prompt达到complex prompt的图像生成效果。

 

 

Prompt Rewrite

BeautifulPrompt

BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis

  1. 收集low-quality prompt和high-quality prompt pair的数据集,训练一个语言模型,根据low-quality prompt生成high-quality prompt,使得prompt engineer自动化。

 

DiffChat

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

  1. DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality.

  2. LLM和用户对话,根据用户需求,只对prompt进行修改,不涉及image识别。

 

ChatGen

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

ChatGen

  1. fine-tune一个MLLM,可以改写prompt,生成选择哪个模型,生成推理配置参数。

 

Promptist

Optimizing Prompts for Text-to-Image Generation

Promptist

  1. 训练优化LLM成为一个prompt改写模型。

 

NeuroPrompts

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

  1. 类似Promptist。

 

PRIP

Prompt Refinement with Image Pivot for Text-to-Image Generation

PRIP-1

PRIP-2 PRIP-3

 

 

  1. 使用HPSv2数据集训练模型refine input prompt。

 

Patcher

Repairing Catastrophic-Neglect in Text-to-Image Diffusion Models via Attention-Guided Feature Enhancement

Patcher

  1. 自动检测生成结果中丢失的object,并重写prompt。

 

AP-Adapter

AP-Adapter: Improving Generalization of Automatic Prompts on Unseen Text-to-Image Diffusion Models

AP-Adapter

 

 

Negtive Prompt

ContrastivePrompt

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

  1. 相比于negative prompt使用一些抽象的prompt例如low quality和ugly,contrastive prompt针对prompt设计,去除一些形容词,或使用一些反义prompt,比如with改为without。

 

DPO-Diff

On Discrete Prompt Optimization for Diffusion Models

  1. 利用prompt engineering找到合适的negative prompt。

  2. Our main insight is that prompt engineering can be formulated as a discrete optimization problem in the language space.

  3. To the best of our knowledge, this is the first exploratory work on automated negative prompt optimization.

 

DNP

Improving Image Synthesis with Diffusion-Negative Sampling

DNP

  1. DNP:使用ϵθ(xt,t,ϕ)+s(ϵθ(xt,t,ϕ)ϵθ(xt,t,p))采样一个负样本。

 

 

Syntax

StructureDiffusion

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

StructureDiffusion

  1. 增强属性绑定。

  2. 利用语法结构,提取text中的noun phrase(共k个),使用CLIP text encoder提取每个noun phrase的embedding,替换原text中noun phrase对应位置的embedding,与cross-attention map相乘,算上原text一共得到k+1个output,取平均作为输出。

 

SG-Adapter

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

SG-Adapter

  1. 对CLIP text embedding进行adaptation,使得生成图像的语义更准确。

  2. 使用NLP parser提取text中的subject-relation-object三元组(可能有多个),每个三元组构成一个scene graph,对于每个scene graph,concat三元组单词的CLIP text embedding,过一个线性层得到scene graph embeding,原CLIP text embedding作为Q,scene graph embeding作为KV,进行cross-attention,得到refined text embedding。计算cross-attention map时,只有Q当前的token属于K当前的scene graph时才计算,其余都mask掉。

 

Memorization

MemAttn

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

MemAttn-1

MemAttn-2

  1. cross-attention map为HW×L​,每一行的和为1,对于non-memorization图像,每一行的cross-attention score大都集中在begining token上,并且随着t的减小更加集中;而对于memorization图像,每一行的cross-attention score在begining token上分配的很少,但会集中在某个特定的token上。

  2. cross-attention map为HW×L,对每一列取平均值,然后对每一列求熵,所有列的熵求和,得到attention entropy,attention entropy越高代表更分散的cross-attention score分布。对于non-memorization图像,attention entropy随着t的减小快速下降;对于non-memorization图像,attention entropy比non-memorization的要高。

  3. 利用这些发现可以做检测。

  4. 缓解memorization:直接调节cross-attention的logits,给begining token的logits乘一个较大的数,让cross-attention score大都集中在begining token上。

 

AMG

Towards Memorization-Free Diffusion Models

  1. Anti-Memorization Guidance:设计了三个防止生成memorization sample的度量函数,求梯度作为guidance。

 

NeMo

Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

  1. We propose to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers.

  2. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data.

 

IET-AGC

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

IET-AGC

 

MemBench

MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

  1. In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods.

 

ProCreate

ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation

  1. ProCreate operates on a set of reference images and actively propels the generated image embedding away from the reference embeddings during the generation process.

 

BEA

Exploring Local Memorization in Diffusion Models via Bright Ending Attention

  1. In this paper, we identify and leverage a novel ‘bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models.

  2. BE refers to a distinct cross attention pattern observed in text-to-image generations using diffusion models.

  3. Memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches.

 

MFC

Memories of Forgotten Concepts

 

Guidance

GuidanceInterval

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

  1. We propose to only apply guidance in a continuous interval of noise levels in the middle of the sampling chain and disable it elsewhere. 在EDM上定义一个(σlo,σhi)区间,只在该区间内进行CFG,其余区间进行正常的条件采样。

 

DynamicGuidance

Analysis of Classifier-Free Guidance Weight Schedulers

DynamicGuidance

  1. Simple, monotonically increasing weight schedulers consistently lead to improved performances.

 

S-CFG

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

S-CFG-1

S-CFG-2

  1. We argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.

  2. cross-attention map的shape是HW×L,每一行和为1。先沿列做normalization,让每一列和为1,之后选出每一行最大的那个响应值的位置所在的token作为这个pixel对应的token。这个normalization很重要,不然响应值会集中在begining token上,这一点在MemAttn中也有展现。

  3. cross-attention map的segmantation很粗糙,所以使用self-attention map进行refine,做法是直接将self-attention map和cross-attention map相乘,得到的结果再进行1中的操作。

  4. 进一步优化,计算C¯=1Ri=1RSrC,再进行1中的操作,R4

  5. 将CFG中的ϵθ(zt,t,c)ϵθ(zt,t,ϕ)拆分成M个的和i=1Mγt,imt,i[ϵθ(zt,t,c)ϵθ(zt,t,ϕ)],其中mt,i是第i个token对应的mask,γt,i为rescale系数,另ηt=∥ϵθ(zt,t,c)ϵθ(zt,t,ϕ)2RHW,则γt,i=γ|mt,bηt||mb,i||mt,i||mt,iηt|,其中mb,i是用begining token估计出的background的mask,相当于把token对应区域的CFG的均值rescale到background的均值。

 

WorkingMechanism

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

  1. During the denoising process of the stable diffusion model, the overall shape and details of generated images are respectively reconstructed in the early and final stages of it.

  2. The special token [EOS] dominates the influence of text prompt in the early (overall shape reconstruction) stage of denoising process, when the information from text prompt is also conveyed. Subsequently, the model works on filling the details of generated images mainly depending on themselves.

  3. 在early stage使用CFG,在final stage只使用unconditional score,因此减少了final stage一半的计算量。

 

GuideModel

Plug-and-Play Diffusion Distillation

GuideModel

  1. CFG需要两次forward,计算量太大,因此给模型学习一个guide model作为adapter,与ControlNet对称,将scale作为参数输入,蒸馏CFG。

 

SFG

Segmentation-Free Guidance for Text-to-Image Diffusion Models

  1. 对于某个prompt(a dog on a cough in an office),如果在生成时negative prompt就是prompt去掉某个object的话(a dog in an office),那么最终生成的图像中,这个object(cough)就会变的更显著。

  2. 利用这一特点,在采样时,可以利用cross-attention map估计出每个pixel对应的object,在prompt中去掉这个object作为这个pixel对应的negative prompt。

  3. 因为要利用cross-attention map估计出每个pixel对应的object,所以SFG只在采样后期使用,且ω=2.5就够了。

 

FABRIC

FABRIC: Personalizing Diffusion Models with Iterative Feedback

  1. reference image加噪过UNet,保留所有self-attention的key-value,生成时将这些key-value concat在生成时的self-attention的key-value后进行计算。

  2. 高分的reference image作为cfg的conditional,低分的reference image作为cfg的unconditional,使用上述方法进行生成。

 

CAD

Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion

  1. 使用CLIP对text-to-image数据集进行相似度打分,经过处理后转换为01的coherence score。

  2. 训练diffusion model时,将coherence作为额外的条件。

  3. 生成时使用coherence score的CFG:ϵθ(xt,y,1,t)+ω(ϵθ(xt,y,1,t)ϵθ(xt,y,0,t))

 

Character

TCO

The Chosen One Consistent Characters in Text-to-Image Diffusion Models

TCO

  1. 形容某个character的不同prompt生成具有相同特征的character。

  2. 生成、聚类、用选中的类别(具有相同特征的character的图像)进行LoRA fine-tune。

 

OneActor

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

OneActor-1

OneActor-2

  1. Create consistent images of the same character.

  2. 类似Pix2Pix-Zero,使用一个可训练网络预测text embedding中character word的Δc,三个loss都是diffusion loss。

 

SFT

SFT是在一个固定的数据集上对模型进行fine-tune,只有text-image pair,类似Emu,鼓励模型在这个text上生成对应的image,一般是收集一个高质量的数据集对模型进行fine-tune。

 

RLFT让模型根据某个text生成一个image,使用reward model对该image进行打分,优化reward-weighted likelihood maximization,即最大化rϕ(x0,c)logpθ(x0|c),鼓励模型在这个text上生成reward高的image,并且RLFT每轮优化用的是上一轮优化过的模型生成的样本(SFT一直用的是fine-tune前模型生成的样本),即online。

 

一些评判标准有现成的模型,如评判text-image alignment的CLIP,可以作为reward model直接使用,一些评判标准没有现成的模型,如human feedback,此时需要训练一个reward model,一般做法是通过样本之间的rank学习一个reward model(类似CLIP),比如下面的HPS。

 

DPO避开了reward model的训练,只需要两个样本之间的rank关系就可以训练,所以一般是SFT那样在一个固定的数据集上对模型进行fine-tune。

 

RLFT是一类方法,RLHF是指评判标准是human feedback并且应用RLFT方法的一种应用。

 

Emu

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

  1. LLM可以通过在高质量小数据集上fine-tune的方式显著提高模型输出质量,并且不会影响其泛化能力。

  2. 假设StableDiffusion本身已经具备生成高质量图像的能力,但并没有被有效发掘,导致生成质量参差不齐,Emu通过人工筛选2000张极高质量的图像对StableDiffusion进行fine-tune,让StableDiffusion保持生成高质量图像的能力,同时不失对文本的泛化性。

  3. early stopping(<15k iterations)避免过拟合。

  4. 该方法很通用,还适用于pixel-level diffusion models(Imagen)和masked generative models(Muse)。

 

EvoGen

Progressive Compositionality In Text-to-Image Generative Models

EvoGen

  1. 构造contrastive数据集。

  2. 训练时使用正样本计算diffusion loss,额外使用负样本计算一个contrastive loss L=logexp(sim(ht+,f(c)))exp(sim(ht+,f(c)))+exp(sim(ht,f(c))),其中ht是UNet encoder的某层特征,训练一个MLP进行映射,f(c)是CLIP text embedding。

 

 

RLFT

DPOK

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

DPOK-1 DPOK-2

 

  1. xt看成状态,diffusion model看成policy网络,最终生成图像x0的打分作为最后一步action的reward,之前所有action的reward都为0,policy Gradient fine-tune pre-trained diffusion model,z为text或其它条件,这个梯度是diffusion model每一步sampling的梯度的加和,不是一条梯度链。

  2. 为了避免fine-tune过拟合,加了fine-tuned model生成的x0与原模型生成的x0之间的KL正则。

 

DDPO

Training Diffusion Models with Reinforcement Learning

DDPO

  1. Policy Gradient fine-tune pre-trained diffusion model,公式和DPOK一样,DDPO和DPOK基本是同一时间放出来的。

 

ImageReward

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

  1. 使用某个prompt p生成多张图像进行rank标注,使用这些数据训练reward模型:L=Ep,xi,xj=logσ(fθ(p,xi)fθ(p,xj)),其中xixj排名高。

  2. 目前只用于模型评测,还未用于RLFT。

 

LVLM-ImageReward

Improving Compositional Text-to-image Generation with Large Vision-Language Models

  1. 使用Large Vision-Language Models评定生成图像与文本的对齐性,主要是object number, attribute binding, spatial relationship, aesthetic quality四个方面的对齐。

  2. RLFT模型(online)。

 

HPS

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

HPS

  1. Human Preference Dataset (HPD):一个prompt生成多张image,其中一张被用户选为preference。

  2. Train human reference classifier:类似CLIP,分别编码image和text到同一embedding空间,然后计算相似度。

  3. Human Preference Score (HPS):HPS=100cos(Ev(img),Et(text)

  4. LoRA fine-tune StableDiffusion:不仅使用high-HPS数据进行fine-tune,还使用low-HPS数据,此时给prompt加一个识别符,在采样时给prompt加一个识别符作为negative prompt。

 

HPSv2

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

HPSv2

  1. Human Preference Dataset v2 (HPDv2):使用不同数据集的prompt,使用ChatGPT进行过滤,得到一个质量不错的prompt数据集,每个prompt输入不同text-to-image模型生成多张image,人工标注preference。

  2. Train human reference classifier:结构和HPS一样,还是编码然后计算相似度的模型,但训练时针对一个prompt只随机选两张image,更prefer的label为1,另一个为0,优化KL散度(交叉熵)进行训练。

  3. Human Preference Score v2 (HPSv2)同HPS。

 

RAHF

Rich Human Feedback for Text-to-Image Generation

  1. RichHF-18K dataset includes two heatmaps (artifact/implausibility and misalignment), four fine-grained scores (plausibility, alignment, aesthetics, overall), and one text sequence (misaligned keywords).

 

RLHF

Aligning Text-to-Image Models using Human Feedback

  1. 通过引入人工标注反馈提高image-text对齐程度的fine-tune pre-trained StableDiffusion算法。

  2. StableDiffusion对于一些概念生成还是会时好时坏的,比如count和color,为此可以使用count和color进行造句(可以选其它你认为没有对齐好的概念使用该算法,这里仅以count和color举例),再用每个text生成60多张image,由labeler进行0-1标注,0代表没有对齐(count错了或color错了),1代表对齐。

  3. 训练一个reward function,根据上述image和text的CLIP编码去预测对齐程度(输出0~1),用标注数据进行训练,使用MSE Loss;同时使用数据增强方法(prompt classification)提升reward function性能:对每个已经标注为对齐的image-text pair,将text中的count或color进行更改,生成N-1个与imgae非对齐的text,输入image和N个text到reward function中并输出N个预测值,softmax后使用交叉熵进行分类训练。

  4. 使用reward function RLFT模型(online)。

 

BoigSD

Behavior Optimized Image Generation

  1. 利用DDPO,align SD with a proposed BoigLLM-defined reward。

 

HRF

Avoiding Mode Collapse in Diffusion Models Fine-tuned with Reinforcement Learning

  1. 改进DDPO。

 

Diffusion-KTO

Aligning Diffusion Models by Optimizing Human Utility

Diffusion-KTO

 

PRDP

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

 

RLCM

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

RLCM

 

TexForce

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

  1. 现有的T2I模型大都使用预训练的text encoder,且生成时都需要prompt engineering,这都说明text encoder是suboptimal的,所以可以将T2I生成时的不对齐归因于suboptimal text encoder,所以提出使用DDPO LoRA fine-tune text encoder,让text更具visual特征。

  2. 还可以搭配上DPOK fine-tune UNet的方法一起使用,效果更佳。可以用于fix hands。

 

TextCraftor

TextCraftor: Your Text Encoder Can be Image Quality Controller

TextCraftor

  1. 类似于TexForce。

 

PAHI

Model-Agnostic Human Preference Inversion in Diffusion Models

  1. 使用蒸馏出的一步生成的模型进和打分模型,重参数法优化初始噪声的高斯分布的均值和方差。

  2. 对于某个prompt,从标准高斯分布中随机一个噪声,再从重参数法的高斯分布中随机一个噪声,使用模型分别生成两个样本,使用打分模型分别打分,交叉熵优化均值和方差,使得后者得分更高。

  3. 可以对某个prompt专门优化,也可以使用prompt数据集进行优化。

 

SynArtifact

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

SynArtifact

 

DRaFT

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

LoRA + gradient checkpointing,使用reward function fine-tune StableDiffusion。

 

AlignProp

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

gradient checkpointing,使用reward function fine-tune StableDiffusion。

 

DRTune

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

DRTune

  1. 不同采样方法都可以表示为xt1=atxt+btϵθ(xt,t)+ctϵ

  2. 在使用DRaFT和AlignProp时不再需要gradient checkpointing,直接屏蔽ϵθ(xt,t)xt的梯度,即xt1=atxt+btϵθ(sg(xt),t)+ctϵ,此时xt1xt=at,不需要存储UNet的中间计算结果。

 

Parrot

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

预定义K个指标,训练时随机选择一个指标,在prompt前prepend这个指标的reward-specific identifier,使用DDPO进行训练。

生成时把K个reward-specific identifier concat在一起prepend到prompt。

 

VersaT2I

VersaT2I: Improving Text-to-Image Models with Versatile Reward

  1. ChatGPT生成N个prompt,每个prompt用StableDiffusion生成K张图像,使用reward model为图像打分,每个prompt选出打分最高的图像,最终得到N​对prompt和图像,LoRA fine-tune StableDiffusion,LoRA加在所有cross-attention上。

  2. 不同aspect的reward model各训练出一个LoRA ΔWi,再使用所有选出的数据训练一个LoRA路由,o=W0x+i=1LSoftmax(xWg)ΔWix

 

CoMat

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

CoMat

  1. DPOK,类似fine-tune版本的TokenCompose。

  2. Lcap AR teacher forcing的next token prediction loss之和。

  3. Ltoken forces the model to activate the attention of the object tokens only inside the region. 即让cross-attention map中object token那一列的响应值集中于mask的那几行。

  4. Lpixel forces every pixel in the region to attend only to the object tokens by a binary cross-entropy loss. 即让cross-attention map中mask那几行的响应值集中于object token那一列。

 

DPT

Discriminative Probing and Tuning for Text-to-Image Generation

DPT

  1. 提取StableDiffusion的feature,送入一个Q-Former,使用global matching(CLIP loss)和local grounding(classification,bounding box)任务训练Q-Former。

  2. 训练完成后,给StableDiffusion的所有cross-attention加上LoRA,使用相同的loss一起训练Q-Former和LoRA。

  3. 生成时进行self-correction,对global matching的CLIP loss求梯度作为guidance。

 

SELMA

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

SELMA

 

IterComp

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

  1. A framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation.

  2. We develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models.

  3. We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations.

 

LongAlign

Improving Long-Text Alignment for Text-to-Image Diffusion Models

  1. 利用DRTune做长文本对齐。

 

FaceScore

Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation

  1. 专为人脸设计的score微调模型。

 

F-Bench

F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

 

AIG

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

  1. An inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs.

 

RID

Reward Incremental Learning in Text-to-Image Generation

  1. RLHF连续学习,解决遗忘问题。

 

DPO

Diffusion-DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Diffusion Model Alignment Using Direct Preference Optimization

DPO-1

DPO-2

Diffusion-DPO

  1. w代表win,l代表lose,ref代表原模型,不需要额外训练一个reward model。

  2. 将DPO拓展到整个diffusion chain上。

 

D3PO

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

 

SPO

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

SPO

  1. Existing works including Diffusion-DPO and D3PO measures the quality according to the final generated image x0, and assign the preference for x0 directly to the whole generation trajectory, in other words, all the intermediate states. Contrary to them, our approach randomly samples one xt, by denoising from random Gaussian noise, instead of selecting a pair of win-lose samples.

  2. We build the step-aware preference model by the drawing inspiration from the training process of noisy classifier, which is able to classify noisy intermediate images. We assume the preference order between pair of images can be kept when adding the same noise. After training, the step-aware preference model can be used to predict the preference order among k sampled denoised samples

  3. 随机采样一个xT,从xT生成到某个xt,之后生成kxt1,使用step-aware preference model选出分数最高的xt1作为win,最低的作为lose,应用Diffusion-DPO优化模型。

  4. step-wise resampler:从kxt1中随机选一个可以循环上面的步骤,提高效率。

 

SDPO

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

SDPO

  1. Standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. SDPO is a novel alignment method tailored for few-step diffusion models.

 

PSO

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

  1. 为timestep-distilled diffusion model设计的DPO算法。This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution.

 

LaSRO

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

  1. 为timestep-distilled diffusion model设计的DPO算法。

 

RankDPO

Scalable Ranked Preference Optimization for Text-to-Image Generation

RankDPO

  1. DPO可以利用模型计算score:si=s(xi,c,t,θ)=ϵiϵθ(xti,t,c)22ϵiϵref(xti,t,c)22,this score measures how much better or worse the model prediction is compared to the reference model for the given condition c,score越小代表样本越被当前模型prefer

  2. Rank Loss:E(c,x1,x2,,xk)D,t[0,T][i>jlogσ(β(sisj))]x1,x2,,xk是根绝reward model从坏到好排好序的根据c生成的图像,we want to ensure that the denoising for image xi is better than xj for all i>j,也就是让si尽量小于sj,也就是越来越prefer更好的图像。

 

VisionReward

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

  1. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score.

  2. We develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. 只有postive image在所有preference dimension上都优于negative image的pair才会被用于进行DPO训练,防止混淆。

 

PopAlign

PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

PopAlign

  1. 之前的preference都是两个单独的样本之间的比较,PopAlign将其拓展到两个群体样本之间的比较。

 

NCPPO

Aligning Diffusion Models with Noise-Conditioned Perception

NCPPO

  1. A method that utilizes the U-Net encoder’s embedding space for preference optimization. Perform diffusion preference optimization in a more informative perceptual embedding space.

  2. 将Diffusion-DPO的四个diffusion loss改为UNet encoder feature之间的MSE loss。

 

Curriculum-DPO

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

Curriculum-DPO

  1. 结合了Curriculum Learning的Diffusion-DPO,先学简单的(分差大的)再学难的(分差小的)。

 

PatchDPO

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

PatchDPO

  1. We propose PatchDPO, an advanced model alignment method for personalized image generation by estimating patch quality instead of image quality for model training.

 

Self-Consuming

Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences

Self-Consuming

  1. If the data is curated according to a reward model, then the expected reward of the iterative retraining procedure is maximized.

 

Gene-DPO

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through f-divergence Minimization

Gene-DPO

 

DDE

Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation

DDE

 

SafetyDPO

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

  1. We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2.

 

GFlowNets

GFlowNets are a class of probabilistic methods to train a sampling policy PF(s|s), where the generation process starts from some initial state s0, makes a series of stochastic transitions ss in a direct acyclic graph of states, and eventually reach a terminal state according to an unnormalized probability density (e.g., a non-negative reward).

 

-GFlowNet

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

 

Language

BDM

Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities

利用ControlNet实现StableDiffusion的中文控制。

ControlNet输入变为xt,但在cross-attention层使用Chinese CLIP引入中文,训练时StableDiffusion的text输入设为空串,否则会impede对中文的建模。

 

Taiyi-Diffusion-XL

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

 

PEA-Diffusion

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

PEA-Diffusion

用feature之间的L2 loss代替KD loss。

 

AltDiffusion

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

AltDiffusion

 

LLMDiffusion

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

LLMDiffusion

  1. The pre-trained CLIP model can merely encode English with a maximum token length of 77​. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation.

  2. 在stage 1时,当text length超过77时,切割成多句使用CLIP编码再concat在一起。

 

Resolution

MD

Mixture of Diffusers for Scene Composition and High Resolution Image Generation

Mixture

  1. 分区域生成,每个区域对应一个prompt,进行harmonization组合。

  2. harmonization的关键在于:每一步都进行融合,且相邻区域要有overlap,overlap部分进行weighted sum,harmonization就是通过overlap部分传递的

 

MultiDiffusion

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

  1. 和Mixture of Diffusers类似,不同的是,MultiDiffusion是对去噪结果zt1进行padding,而Mixture of Diffusers是对预测的噪声进行padding。

 

DemoFusion

DemoFusion: Democratising High-Resolution Image Generation With No $$$

DemoFusion

  1. 分辨率由低到高进行生成:对上一分辨率的生成结果进行上采样,再进行diffuse,使用MultiDiffusion进行生成,生成过程中的latent与diffuse得到的latent进行系数为c1的插值,以注入全局信息。

  2. 对MultiDiffusion进行改进,受ScaleCrafter的启发,rather than dilating the convolutional kernel, we directly dilate the sampling within the latent representation,称为dilation sampling,比如图中diffusion model的原生成尺寸是3×3,若取dilation factor为2,就可以从5×5中均匀取3×3进行采样,作为该5×5的采样结果。

  3. 不进行dilation sampling的为ztlocal,进行dilation sampling的为ztglobal,两者进行系数为c2的插值,类似CFG的思想。

  4. 代码中的dilation sampling就是如右下图中只取了四块。

 

MegaFusion

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

MegaFusion

  1. cascade的思想。

  2. 循环进行:perform diffusion sampling starting from xti1, truncate the generation process at ti and compute the approximate clean latent, decode, upsample, encode, diffuse to ti and get xti.

  3. 类似ScaleCrafter,采样时将standard convolution layer改造成dilated convolution layer提高感受野。

  4. 类似RDM,不同分辨率使用不同的noise schedule进行diffuse,以方便relay。

 

FAM

FAM: Diffusion Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

FAM

  1. 类似DemoFusion。

  2. 使用low-res的高频部分来保持structure,high-res的低频部分来refine detail。

 

FreeScale

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

FreeScale

  1. 和DemoFusion一个范式。

  2. restrained dilated convolution: 去噪高分辨率latent时,类似ScaleCrafter使用dilated convolution,but we only apply dilated convolution in the layers of down-blocks and mid-blocks。

  3. scale fusion: 去噪高分辨率latent时,直接计算self-attention称为global attention,类似MultiDiffusion那样分成UNet原分辨率patch进行计算的self-attention称为local attention,globalG(global)是global attention的高频部分,G(local)是local attention的低频部分,两者相加。While local attention tends to produce better local results, it can bring unexpected small object repetition globally. These artifacts mainly arise from dispersed high-frequency signals, which will originally be gathered to the right area through global sampling. Therefore, we replace the high frequency signals in the local representations with those from the global level.

 

AccDiffusion

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

  1. MultiDiffusion这种分patch进行采样再组合的方法很容易出现object repetition的问题,主要原因是不同patch在生成时都是用了相同的prompt,所以每个patch都被迫使去生成prompt中的object,AccDiffusion decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch.

  2. AccDiffusion introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation.

 

AccDiffusion2

AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

AccDiffusion2

  1. 低分辨率生成,上采样后提取每个patch的canny,ControlNet控制配合patch-content-aware prompt进行生成。

 

ResMaster

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

ResMaster

  1. 先使用原模型根据prompt生成一张低分辨率的图,上采样到目标高分辨率,分成N个overlapping的低分辨率的patch,每个patch都使用VAE编码为{ziL}i=1N

  2. 将高分辨率生成过程的zt也按相同方法分成patch {zi,tH}i=1N,生成时,每个patch使用两个guidance进行采样。

  3. fine-grained guidance:对ziL重打caption,作为ϵθ(zi,tH,t,c)的输入,另外使用CLIP image encoder编码ziL,编码结果类似IP-Adapter那样引入并行的cross-attention(两个并行的cross-attention layer是相同的,都是原模型的text cross-attention layer),实验发现这样training-free的推理是可行的。

  4. structural guidance:根据ϵθ(zi,tH,t,c)zi,tH预测z^i,0H,将ziL的低频部分赋值给z^i,0H,作为这一DDIM step的predicted x0

 

HiPrompt

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

SyncTweedies

  1. 类似ResMaster。

 

StreamMultiDiffusion

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

  1. 加速MultiDiffusion。

 

SyncDiffusion

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

  1. MultiDiffusion只能保证相邻的子区域的图片风格一致,无法保证全局风格一致。

  2. 选一个子区域作为锚点,每一步去噪前,计算所有子区域的xt与锚点子区域的xt的LPIPS score(用xt估计出的x^0计算),求梯度作为guidance更新所有子区域的xt,更新好后再进行MultiDiffusion的流程。

 

SyncTweedies

SyncTweedies: A General Generative Framework Based on Synchronized Diffusions

SyncTweedies

  1. Z是canonical space(比如panorama),{Wi}in是instance space(比如正常尺寸图像),扩散模型都是在Wi上训练的,nWi可以是相同的也可以是不同的,f是projection(比如切割到正常尺寸),g是unprojection(比如padding到panorama的尺寸),ϕ是Tweedie formula预测x^0ψ​是计算后验均值的公式。

  2. MultiDiffusion和SyncDiffusion对应case 3。

  3. 本文发现case 2效果最好。

 

SSL-Guidance

Learned representation-guided diffusion models for large-image generation

  1. 用图像的某个patch和这个patch对应的预训练SSL模型提取的feature训练diffusion model。

  2. 生成时先生成feature,再利用MultiDiffusion的方法,逐个patch进行overlap生成。

 

CutDiffusion

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

CutDiffusion

  1. 分为两个阶段[T,T][T,0],每个patch就是原diffusion model的生成尺寸。

  2. 第一阶段[T,T]负责生成主体结构,使用non-overlap的patch进行采样得到采样结果,对于所有patch的同一个位置的pixel进行随机排列(如4个patch的第一个位置分别是1,5,9,13,随机排列成9,5,1,13,那么第一个patch的第一个位置就是9,第二个patch的第一个位置就是5,依此类推),enabling pixels to contribute to the denoising of other images and promoting similarity in content generation across patches.

  3. 第二阶段[T,0]负责refine,类似MultiDiffusion,使用overlap的patch进行采样后取overlapping area的平均。

 

ASGDiffusion

ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance

  1. 多GPU并行加速CutDiffusion。

 

InstantAS

InstantAS: Minimum Coverage Sampling for Arbitrary-Size Image Generation

  1. MultiDiffusion慢的原因是相邻patch需要overlap以传递信息。

  2. InstantAS使用non-overlap的patch进行生成,每一步生成后重新划分patch,这样既传递了信息,又加快了速度。思想有点类似GoodDrag,边生成边优化。

 

VSD

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

  1. 根据attention entropy理论,只需要修改attention的scaling factor就可以使模型生成不同大小的图片。

 

DiffuseHigh

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

DiffuseHigh

  1. 先使用StableDiffusion在原分辨率进行采样,得到x0,上采样为x~0,对x~0进行DWT,只取其低频分量,对x~0进行编码得到z~0,再加噪到第N步得到z^N,使用StableDiffusion对其进行去噪,对每一步估计出的z^0进行解码再进行DWT,使用上述低频分量替代这里的低频分量,iDWT后再编码,作为DDIM中的predicted x0进行采样。

  2. 上述方法可以重复进行,直到得到目标分辨率的图像。

  3. Low-frequency component represents the low-frequency details of the image, encompassing global structures, uniformly-colored regions, and smooth textures. 所以该方法又称为DWT-based Structure Guidance,避免了使用StableDiffusion直接生成高分辨率图像时structure的不和谐。

 

ScaleCrafter

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

  1. 预训练好的StableDiffusion不能直接生成更高分辨率图片的原因是卷积核感受野受限。

  2. We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.

 

FouriScale

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

FouriScale

  1. 和ScaleCrafer类似,都把问题归因于卷积核,在生成更高分辨率图片对feature map就行低通滤波并对卷积核进行dilation。

 

HiDiffusion

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

HiDiffusion

  1. The generated image is highly correlated with the feature map of deep Blocks in structures and feature duplication happens in the deep Blocks. As the higher-resolution feature size of deep blocks is larger than the corresponding size in training, these blocks may fail to incorporate feature information globally to generate a reasonable structure. We contend that if the size of the higher-resolution features of deep blocks is reduced to the corresponding size in training, these blocks can generate reasonable structural information and alleviate feature duplication. Inspired by this motivation, we propose Resolution-aware U-Net (RAU-Net), a simple yet effective method to dynamically resize the features to match the deep blocks.

  2. RAD根据输入分辨率调整第一个conv层的dilation rate,使输出的feature size匹配原模型训练时的feature size。

  3. RAU根据输入分辨率将最后一个conv层前的插值倍数,使输出的feature size匹配当前分辨率。

  4. Both RAD and RAU do not introduce additional trainable parameters. Therefore, RAD and RAU can be integrated into vanilla U-Net without further fine-tuning.

 

Any-Size-Diffusion

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

  1. VAE不动,LoRA fine-tune StableDiffusion,预定义一些长宽比,每个长宽比对应一个图像长宽,训练时,根据图像长宽比找到一个最近的预定义长宽比,将图像resize到其对应的图像长宽,进行训练。这样就可以给定任意长宽比的噪声生成图像。

  2. 利用StableSR的tiled sampling进行超分,类似MultiDiffusion,可以超分到任意分辨率。

 

Self-Cascade

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Self-Cascade

  1. 定义一个递增的分辨率序列,只需要最低分辨率上训练好的diffusion model。

  2. 训练时,任选一个分辨率的x0进行加噪得到xt,在去噪xt时,将上一低分辨率的x0输入网络,得到UNet中一些feature,输入到只有少量参数的upsampler,将输出加在xt对应的feature上,这样相当于把上一分辨率的x0作为条件。

  3. 采样时,先从最低分辨率采样得到样本,加噪到某一中间步后上采样到下一个更高的分辨率,继续采样,以此循环,直到最高分辨率。

 

DiffCollage

DiffCollage: Parallel Generation of Large Content with Diffusion Models

  1. 考虑一张组合图[x1,x2,x3],其中[x1,x2]是原始图像,以[x2]为条件做outpainting生成[x3]

  2. p(x1,x2,x3)=p(x1,x2)p(x3|x2)=p(x1,x2)p(x2,x3)p(x2)

  3. 对应的score为logp(x1,x2,x3)=logp(x1,x2)+logp(x2,x3)logp(x2)

  4. 可以分别训练两个模型,一个拟合原始图像[x1,x2],一种拟合部分图像[x2],然后进行联合采样。

 

ElasticDiffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

  1. 扩散模型在H×W上训练,training-free地采样任意分辨率H¯×W¯的图像。

  2. CFG采样公式ϵθ(xt)+(1+ω)(ϵθ(xt,c)ϵθ(xt))可以看成两个部分,即unconditional score ϵθ(xt)和class direction score ϵθ(xt,c)ϵθ(xt),we use two key insights. First, the class direction score primarily dictates the image’s overall composition, while the unconditional score enhances detail at the pixel level in a more local manner. Second, the unconditional score requires a pixel-specific precision, contributing to the image’s fine-grained details, while class direction score affects pixels collectively, defining the image’s overall composition. 因此unconditional score需要精确计算,class direction score只需要大概计算。

  3. 对于unconditional score,之前的方法都是带overlap的分patch采样,在overlap处取平均,(每个patch H×W,所有patch覆盖H¯×W¯),如MultiDiffusion等。ElasticDiffusion使用另一种方法,将H¯×W¯分成non-overlap的patch,每个patch h×wh<H,w<W,在采样时使用patch周围的pixel作为context拼成H×W输入ϵθ​,输出只保留当前patch的预测,最终所有patch的预测拼在一起即为unconditional score。相比于overlap采样,极大地较少了网络调用次数。

  4. 对于class direction score,将x¯tRH¯×W¯×3下采样到xtRN×M×3,其中H¯W¯=NMN×M尽可能靠近H×W,之后使用一个随机的出色背景将xt填补到H×W,输入ϵθ,将预测结果中填补的部分去掉,再上采样到H¯×W¯​,输入网络时输入条件和不输入条件各预测一次,两者相减得到class direction score。为了防止latent signal的统计量发生变化,这里的上下采样都使用nearest-neighbors mode.

  5. 由于计算class direction score时上下采样都使用nearest-neighbors mode,所以H¯×W¯中大量局部像素都共用相同的class direction score,这会导致生成图像过于平滑,因此借鉴resample technique,第一次预测得到class direction score后,之后重复预测R次,每次使用新预测结果随机替换class direction score中20%​的位置的结果。

  6. Reduced-Resolution Guidance:使用unconditional score估计出一个x^0u;在4中,输入网络时输入条件和不输入条件各预测一次可以得到两个score,使用CFG进行采样,估计出x^0cRN×M×C,上采样到H¯×W¯,与x^0u​相减,求梯度作为另一个额外的guidance,strength为s。Since the overall image structure is determined in the early diffusion steps, we start with s=200 linearly decrease this weight until 60% for the diffusion steps are completed.

 

MagicScroll

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

MultiDiffusion

 

FiT

FiT: Flexible Vision Transformer for Diffusion Model

 

BeyondScene

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

  1. 针对有pose和prompt的高分辨率人物生成。

 

Personalization

direct: 使用一个已有的token,对其token embedding进行优化或适配

transform: 由一个网络将视觉信息转换为token embedding或residual

attach: 附在已有prompt之后

no pseudo word: 不需要使用已有的token或新添加token

 

Subject

TI (direct)

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

  1. StableDiffusion

  2. S, which a new token

  3. diffusion loss只优化token embedding(text encoder前的embedding)。

 

DreamBooth (direct)

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

  1. Imagen

  2. [V] class, where [V] is a rare token

  3. token embedding毕竟表达能力有限,效果不好,所以选择优化token embedding的同时也fine-tune整个模型(包括text encoder)。

  4. fine-tuning有overfitting + language drift的缺点,所以提出Class-specific Prior Preservation Loss,类似Continual Learning的replay方法,生成一些样本和新样本一起作为训练集进行训练,防止过拟合。

  5. 改进版使用LoRA fine-tune diffusion model。

 

CustomDiffusion (direct)

Multi-Concept Customization of Text-to-Image Diffusion

  1. StableDiffusion

  2. [V] class

  3. 同时训练token embedding和cross-attention KV projection matrix。类似DreamBooth,构造一个regularization dataset解决language drift问题。相当于只fine-tune cross-attention KV projection matrix的StableDiffusion版本的DreamBooth。

  4. 可以同时在多组reference images上进行训练,生成时可以使用多个pseudo words构造prompt。

 

DreamBooth++ (direct)

DreamBooth++: Boosting Subject-Driven Generation via Region-Level References Packing

DreamBooth++

  1. DreamBooth

  2. 组图,修改UNet的计算方式,convolution和self-attention的计算限制在各自的region内,训练时优化pseudo word embedding并且fine-tune整个UNet。

  3. 除了DreamBooth的两个loss,还加了一个cross-attention map之间的MSE loss。

 

Improved (direct)

An Improved Method for Personalizing Diffusion Models

  1. StableDiffusion

  2. [V] class

  3. 借鉴Imagic的两阶段训练法,第一阶段只训练token embedding,第二阶段只fine-tune diffusion model。

 

ViCo (direct)

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

ViCo

  1. StableDiffusion

  2. S

  3. 将reference image作为visual condition引入网络。

  4. 使用zt和reference image分别进行text cross-attention,之后加一个image cross-attention,zt的text cross-attention的输出作为Q,reference image的text cross-attention的输出作为KV。

  5. 使用reference image的text cross-attention map估计出一个mask,用这个mask过滤KV,只保留mask内的KV(KV长度变小),让Q只与有object的KV进行计算。

 

HyperDreamBooth (no pseudo word)

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

HyperDreamBooth

  1. 使用CelebA-HQ数据集,训练一个HyperNetwork预测StableDiffusion的所有attention层的LoRA参数去重构图像。StableDiffusion输入统一的"a [V] face"的prompt,其中"[V]"是稀有单词,这里不优化"[V]"的token embedding,因为作者发现只需要LoRA参数,就可以用"[V]"随意造句进行生成了。

  2. 测试时,先使用HyperNetwork预测LoRA参数作为初始化,然后再进行LoRA fine-tune,fine-tune速度比DreamBooth快25倍。

  3. HyperNetwork架构类似Q-Former,使用迭代法从零初始化的参数生成最终参数,预测出的LoRA参数加在StableDiffusion上计算diffusion loss优化HyperNetwork。

 

HyperNetFields (no pseudo word)

HyperNet Fields Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

HyperNetFields

  1. T是优化次数,不是diffusion timestep。

  2. θtθt+1有点类似diffusion中前后两个timestep的latent,θtθt类似diffusion reverse process。

 

DiffLoRA (no pseudo word)

DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

DiffLoRA

  1. diffusion model生成LoRA参数。

 

XTI (direct)

P+: Extended Textual Conditioning in Text-to-Image Generation

  1. StableDiffusion

  2. 定义P+空间:UNet每层cross-attention使用的text embedding的集合。不同层使用不同text embedding有不同的效果。

  3. P+空间的TI:对于某个concept,不同cross-attention层使用不同token embedding进行优化,在StableDiffusion中就是16个不同的token embedding。

  4. 只优化token embedding,不优化模型参数。

  5. 不同层输入不同concept的TI得到的token embedding,还可以达到semantic composition的效果。

 

CustomContrast (transform)

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

CustomContrast

  1. P+空间的TI

  2. reference image作为positive sample,找一些同类的其它图片作为negative sample。

  3. Textual QFormer根据sample提取P+空间的token embedding序列,Visual QFormer根据sample提取visual feature,两个QFormer的query里都有一个cls token,使用两个QFormer的cls token位置的输出计算contrastive loss。

  4. Visual QFormer提取到的reference image的feature以IP-Adapter的形式引入diffusion model。

  5. 在P+空间的token embedding序列引入contrastive loss,拉进positive sample之间相同cross-attention层的token embedding之间的距离,拉远positive sample和negative sample之间相同cross-attention层的token embedding之间的距离。

  6. 使用dffusion loss和两个contrastive loss训练模型。

  7. 使用multi-view数据集训练,同一个物体的使用三个view,第一个作为reference image,第二个作为reference image的positive sample,第三个作为reconstruction target。

 

ProSpect (direct)

ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

  1. StableDiffusion

  2. 定义P空间:将1000步均分为10个阶段,每个阶段使用一个单独的token embedding进行训练。

  3. 不同阶段使用不同reference的token embedding,可以实现material、style、layout的transfer生成与编辑。

 

NeTI (direct)

A Neural Space-Time Representation for Text-to-Image Personalization

  1. StableDiffusion

  2. P+空间是spatial层面的扩展,但diffusion model不同时间步性质表现都不同,所以在时间维度上继续扩展P+空间到P空间,即不同时间步不同cross-attention都使用不同的token embedding。但此时要训练的token embedding就太多了,所以训练一个neural mapper,输入时间步t和cross-attention所在的层数l,输出一个768维的向量作为token embedding。

  3. During optimization, the outputs of the neural mapper are unconstrained, resulting in represen tations that may reside far away from the true distribution of token embeddings typically passed to the text encoder. We set the norm of the network output to be equal to the norm of the embedding of the concept’s “supercategory” token. 例如学习一个cat相关的concept,最终输出为M(t,l)M(t,l)vcat,其中vcat为单词cat的word embedding。

  4. neural mapper是一个MLP,其最后一个hidden layer前的hidden latent h的维度为dh,we find the dh in our mapper greatly affects the tradeoff between the reconstruction quality and the editability of the inverted concept. Theoretically, one can train multiple neural mappers with different representation sizes and choose the one that bests balances reconstruction and editability for each con cept. However, doing so is both cumbersome and imprac tical at scale. 所以在训练时,每一步都随机采样一个tU(0,dh),将h[i>t]部分全部dropout为0,which encourages the network to be robust to different dimensionality sizes and encode more information into the first set of output vectors, which have a lower truncation frequency. 采样时也可以控制这个dropout,如果使用大dropout进行采样,生成的concept就比较粗糙,但可编辑性更强。

  5. Inverting a concept directly into the UNet’s input space, without going through the text encoder, could potentially lead to much quicker convergence and more accurate reconstructions. 所以让neural mapper输出两个向量,一个向量是token embedding,和其他单词一起送入CLIP text encoder,另一个向量不过CLIP text encoder,而是直接加在该CLIP text encoder输出的text embedding上,同样使用上面的normalization,防止过拟合。但是这个额外的向量只加在UNet的cross-attention层的value上,key使用不加额外向量的text encoder的输出,原理同key-locking。

 

ED (direct)

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

ED

  1. 将原图分解为低频和高频分量,分别对应三个pseudo word embedding,原图的pseudo word embedding等于低频和高频分量的pseudo word embedding之和,训练时从低频分量、高频分量和原图三者中随机选一个。通过分解并且分别学习,学习效果更好。

  2. 生成时使用原图的pseudo word embedding,可以结合style描述进行生成,效果比别的方法要好。

 

PerFusion (direct)

Key-Locked Rank One Editing for Text-to-Image Personalization

PerFusion

  1. StableDiffusion

  2. Personlization的两个主要目标是avoid overfitting和preserve the identity,但这两个目标天然存在trade-off,to improve both of these goals simultaneously, our key insight is that models need to disentangle what is generated from where it is generated.

  3. cross-attention中key决定了where it is generated,value决定了what is generated,所以fine-tune diffusion model时只训练WV,同时ROME编辑WK,让WK与pseudo word embedding的计算结果和WK与supercategory word embedding (teddy)的计算结果靠近。

  4. A natural solution is then to edit the weights of the cross-attention layers, WV and WK using ROME. Specifically, when given a target input iHugsy we enforce the K activation to emit a specific target output oHugsyK=Kteddy. Similarly, given a target-input iHugsy, we enforce the V activation to emit a learned output oHugsyV=Vteddy.

 

InFusion (direct)

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

InFusion

  1. 继承PerFusion的思想,训练和生成都有两条generative trajectory(C和F),使用F的cross-attention map取代C的cross-attention map,只训练pseudo word embedding和WV

 

CrossInitialization (direct)

Cross Initialization for Personalized Text-to-Image Generation

  1. TI使用supercategory初始化token embedding v,但如此优化后,会发现最终学到的v的scale增大了几十上百倍,这种较大的变化说明这种v的初始化方法是不够好的,导致优化慢、过拟合、缺乏可编辑性。

  2. 查看token embedding v和其经过text encoder编码后对应位置的embedding E(v),发现text encoder会不断改变token embedding的大小和方向,并且TI优化后,vE(v)的大小和方向会很相似。

  3. 对于某个token embedding v,如果把它换成E(v)再输入text encoder,最终生成效果相似,有点不动点的意思。

CrossInitialization

  1. 这说明TI优化是目标是v=E(v),所以可以使用supercategory的token embedding v的text encoder编码后对应位置的embedding E(v)初始化v,并且加入正则,使v不要离v太远。

 

DC (direct)

Learning to Customize Text-to-Image Diffusion In Diverse Context

DC

  1. 利用MLM加强pseudo word embedding的语言特性。

 

DP (direct)

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

  1. DreamBooth。

  2. 构造更好的regularization dataset。

Data-Oriented

 

UFC (direct)

User-Friendly Customized Generation with Multi-Modal Prompts

UFC

  1. 使用BLIP和ChatGPT构造更好的regularization prompt、customized prompt和generation prompt。

 

SID (direct)

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

SID-1

SID-2

  1. DreamBooth。

  2. 类似DP,在训练时使用尽可能详细的描述,这样可以以减少pseudo中对不相关内容的bias。

  3. 作者总结了几种常见的bias,利用VLM生成含有这些bias描述的句子。

 

SAG (direct)

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

  1. 改进采样过程,c是带有pseudo word的prompt(A pencil sketch of S),定义c0为将c中的pseudo word替换为generic descriptor(A pencil sketch of dog),先计算一个weak CFG:ϵ¯=ϵθ(xt,c)+ωt(ϵθ(xt,c)ϵθ(xt,c0)),再计算最终的CFG:ϵ=ϵ¯+ω(ϵ¯ϵθ(xt,ϕ))进行采样,增强一致性。

 

AlignIT (direct)

AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

AlignIT-1 AlignIT-2
  1. 不同方法在不同stage进行操作,比如TI在stage 1训练token embedding,CatVersion在stage 2训练text encoder,CustomDiffusion在stage 3训练cross-attention KV projection matrix。这三类方法最终都是为了修改最后的KV,送入cross-attention影响最后的图像。

  2. 对于现有的方法,对比模型关于"a cat playing with ball"和"a <sks> playing with ball"的cross-attention map会发现,由于pseudo word的引入,其它没有变的word的cross-attention map也被影响了,这是这些方法效果不好的原因。

  3. 生成时,分别使用原模型+"a cat playing with ball"和customized model+"* <sks> * * *"进行生成,将前者的pseudo word的cross-attention map替换为后者,可以使用在多种TI方法上。

 

PALP (direct)

PALP: Prompt Aligned Personalization of Text-to-Image Models

  1. LoRA版的DreamBooth

  2. test-time fine-tune时,不仅要提供reference image,还要提供生成时需要的prompt,比如"a sketch of [V]",即每次生成前都要进行fine-tune。

  3. Personlization的一个问题是过拟合,过拟合的模型只需要一步就可以从纯噪声预测出subject的形状和特征。Our key idea is to encourage the model’s denoising prediction towards the target prompt。

PALP-1

  1. 除了diffusion loss,加入SDS loss,让根据prompt的预测靠近根据clean prompt的预测。

PALP-2

 

CLiC (direct)

CLiC: Concept Learning in Context

CLiC

  1. StableDiffusion

  2. Custom-Diffusion的RoI版本,对RoI区域的物体进行TI,同时优化cross-attention KV projection matrix。

  3. lcon即为diffusion loss,但提高了RoI区域的比重,目的是学习RoI区域在context中的pattern;lattn提取token对应的cross-attention map,提升RoI区域内的响应值,抑制RoI区域外的响应值;lRoI使用只包含token的句子和只包含RoI区域的图像进行TI。

  4. SDEdit + Blended进行编辑。

 

MagicTailor (direct)

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

MagicTailor

 

EM-Optimization (direct)

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

  1. 使用CLIP text encoder编码super class name初始化token,然后使用EM算法优化。

  2. E-step:随机选择50个步数,对reference image加噪,和带pseudo word的prompt一起送入StableDiffusion,提取pseudo word对应的cross-attention map,取平均,阈值法求出一个mask。

  3. M-step:使用上述mask,masked diffusion loss + masked cross-attention loss优化pseudo word embedding。

 

RPO (direct)

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

RPO

  1. Diffusion-DPO fine-tune diffusion model,目标是让模型在以含有pseudo word的prompt为条件时,更加prefer reference image。

  2. similar loss就是reference image上的diffusion loss。

 

CustomSketching (direct)

CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing

CustomSketching

 

IP-Adapter (no pseudo word, no test-time fine-tuning)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

 

MIP-Adapter (no pseudo word, no test-time fine-tuning)

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

MIP-Adapter

  1. 多reference image的IP-Adapter。

 

DreamTuner (no pseudo word, no test-time fine-tuning / direct)

DreamTuner: Single Image is Enough for Subject-Driven Generation

DreamTuner-1

  1. 类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。

  2. Subject-Encoder:为了解耦内容和背景特征,使用Salient Object Detection去除背景;为了解耦内容和位置特征,可以用预训练的ControlNet引入位置信息,这样学到的都是content特征。

  3. Subject-Encoder-Attention:StableDiffusion的self-attention和cross-attention之间插入一个可训练的cross-attention层(S-E Attention),对reference image进行重构,reference image的self-attention附加到generated image的self-attention中提供参考。

DreamTuner-3

  1. Self-Subject-Attention:The features of reference image extracted by the text-to-image U-Net model are injected to the self-attention layers, which can provide refined and detailed reference because they share the same resolution with the generated features. 生成时每一步直接对reference image随机加噪,输入UNet,提取self-attention layers的key和value,与生成时的self-attention layers的key和value进行如上交互。

  2. 使用上述方法,即使不训练pseudo word embedding,也能进行personlization生成。但使用DreamBooth的方法训练一个pseudo word embedding+fine-tune diffusion model,效果更好。

 

FreeTuner (no pseudo word, no test-time fine-tuning)

FreeTuner: Any Subject in Any Style with Training-free Diffusion

FreeTuner

  1. 类似DreamTuner。

  2. three feature swap operations:1) cross-attention map swap: 将reconstruction branch的subjected-related cross-attention map注入personalized branch,如这里的horse。 2) self-attention map swap: 将reconstruction branch的self-attention map的Msub部分注入personalized branch的self-attention map的相同部分,即MsubSAt+(1Msub)SA~t。 3) latent swap: 将reconstruction branch的ztMsub部分注入personalized branch的self-attention map的相同部分,即Msubzt+(1Msub)z~t。这样即使不训练pseudo word embedding,也能进行personlization生成。

  3. 如果还有style image,使用VGG-19提取feature计算相似度,求梯度作为guidance。

 

SSR-Encoder (no pseudo word, no test-time fine-tuning)

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

SSR-Encoder

  1. 类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。

  2. 使用CLIP text encoder编码query,使用CLIP image encoder编码reference image得到sequence feature,两个feature计算得到cross-attention map,提取CLIP image encoder不同层的sequence feature作为V,共K层,与cross-attention map计算出SSR feature,长度为K,IP-Adapter引入StableDiffusion,训练引入的各种projection matrix。

  3. 使用text-image pair自监督训练,提取关键词作为query,xt​就是reference image的加噪结果。

  4. 这说明CLIP image encoder编码图像得到的sequence feature也是可以用于计算相似度的,不只是CLS token feature可以。

 

Mask-ControlNet (no pseudo word, no test-time fine-tuning)

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Mask-ControlNet

 

DiptychPrompting (no pseudo word, no test-time fine-tuning)

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

DiptychPrompting

  1. diptych: 双连画。

 

DreamMatcher (direct)

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

  1. 针对TI生成过程的优化,对于不同TI方法都适用,如DreamBooth和CustomDiffusion等。

  2. self-attention有两个作用,一是由QK计算出的attention map控制的图像结构,二是由V控制的visual attributes,如颜色、纹理。

  3. TI方法生成的图像,concept的结构都比较好,但是一些具体细节,如颜色、纹理,都和reference image中的concept有出入,所以本方法通过修改TI生成过程中self-attention的V做appearance保持。

  4. 具体做法是dual branch,先对reference image进行DDIM Inversion再重构,得到reconstructive trajectory,另一条从随机噪声出发,带pseudo word的prompt为条件,得到generative trajectory。由于生成图像中concept的位置不确定,和reference image中的concept位置不一致,所以直接用reconstructive trajectory中的V替换generative trajectory中对应的V会出现位置不匹配的问题,所以使用两条trajectory的UNet decoder的一些feature做semantic corresponce,根据semantic corresponce计算出dense displacement field,根据dense displacement field对reconstructive trajectory中的V做warp,使用warp后的V替换generative trajectory中对应的V。

 

DVAR (direct)

Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation

  1. 提出一种early stopping criterion,加速TI接近15倍,并且效果没有明显下降。

 

PACGen (direct)

Generate Anything Anywhere in Any Scene

  1. DreamBooth学到的word也可以用在GLIGEN这种plug-and-play模型。但DreamBooth的一个缺点是不能解耦object和位置的信息,使用GLIGEN这种有额外layout信息的模型进行生成时,一旦修改了位置,就无法很好的生成object。

  2. 实用数据增强方法训练DreamBooth:by incorporating a data augmentation technique that involves aggressive random resizing and repositioning of training images, PACGen effectively disentangles object identity and spatial information in personalized image generation.

 

CI (direct)

Compositional Inversion for Stable Diffusion Models

  1. existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space.

  2. Textual Inversion will make the new (pseudo-)embeddings OOD and incompatible to other concepts in the embedding space, because it does not have enough interactions with others during the post-training learning。加入正则项,使得学到的embedding和一些已知的且相关的concept的embedding不要太远,比如给定和猫相关的reference images时,使得学到的embedding和cat, pet等的embedding靠近。这样学到的embedding更具一般性,和其他单词组合造句时就像用cat造句一样,模型可以识别,也可以和其他学到的embedding组合造句进行multi-concept generation。

 

CoRe (direct)

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

CoRe

  1. 相同的xt,使用三种不同的prompt输入网络,计算loss。

 

SuDe (direct)

SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

SuDe

  1. 如果用传统方法学到的pseudo word进行造句,比如"[V] is running",模型不能正确生成running,但如果使用base class造同样的句子却可以正确生成,这说明学出来的pseudo word并不能继承base class的属性。

  2. 为TI引入正则,让学出来的pseudo word继承base class的属性,最小化[xθ(xt,p[V],t)xθ¯(xt,pbase,t)],其中θ¯是deatch的意思,表明这一项的梯度并不回传。可以用在不同方法上。思想与PALP中防止过拟合类似。

 

ProFusion (direct)

Enhancing Detail Preservation for Customized Text-to-Image Generation A Regularization-Free Approach

  1. 使用不加任何正则项的TI得到token embedding。

  2. 之前的工作加正则项是为了防止过拟合,但也导致了信息提取不充分。本论文提出Fusion Sampling解决这一问题。

 

DisenBooth (direct)

DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation

  1. DreamBooth

  2. 之前的工作如TI和DreamBooth都是为reference images优化一个token,DisenBooth除此之外还为每张reference image编码一个独立的subject-unrelated token,这样有助于学习到所有reference images共有的subject的特征,而忽略每张reference image其它细节(如背景等)。

  3. {xi}i=1KP="a photo of [V] dog",fs=ET(P)fi=maskEI(xi)+MLP(maskEI(xi)),其中mask是一个可学习的向量,用于过滤subject-related信息,skip-connected MLP用于将CLIP image embedding转换为text embedding。L1=i=1Kϵiϵθ(zi,ti,ti,fs+fi)22L2=i=1Kϵiϵθ(zi,ti,ti,fs)22L3=i=1Kcos<fs,fi>降低subject-related与subject-unrelated的相似性。

  4. 使用LoRA进行fine-tune。

 

DreamArtist (direct)

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning

  1. 类似DisenBooth

  2. 借鉴classifier-free guidance,学习两个pseudo-words,positive pseudo word用于提取主要特征(相当于y),negtive pseudo word用于去除多余的特征(相当于ϕ)。具体做法是两个word都用"a photo of []"造句,输入StableDiffusion得到两个输出,使用classifier-free guidance公式计算ϵθ,Tweedie's formula根据ztϵθ计算z^0,使用z的MSE Loss和StableDiffusion的VAE decoder解码后的pixel L1 Loss,促使pseudo-word学习pixel-level的细节。

  3. 生成时使用negtive pseudo word的输出作为u进行cfg生成。

 

StyO (direct)

StyO: Stylize Your Face in Only One-Shot

  1. StableDiffusion

  2. one-shot face stylization: applying the style of a single target image to the source image。

  3. 构造content和style单词,使用三个数据集进行TI,同时也fine-tune StableDiffusion,其中target和source都只有一张图像。

StyO-1

之后使用该prompt进行生成:

StyO-2

 

DreamDistribution (direct on prompt)

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

DreamDistribution

  1. 有点类似De-Diffusion,但不是显式的caption。

  2. K个prompt,每个prompt是一个可训练的word embedding序列,编码后求均值和方差,重参数采样后送入预训练StableDiffusion,优化所有prompt。

  3. 生成时采样即可。

 

SingleInsert (transform)

SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing

SingleInsert

  1. StableDiffusion

  2. 借鉴Break-A-Scene,使用DINO或SAM对intended concept做分割,使用masked diffusion loss进行训练。

  3. 两阶段训练:第一阶段做TI,只训练image encoder;第二阶段fine-tune encoder+diffusion model。

  4. 输入不带pseudo word的text计算Lbg to minimize the impact of the learned embedding on the background area.

 

ELITE (transform)

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

ELITE

  1. StableDiffusion

  2. 两阶段训练:global和local。

  3. global:使用CLIP作为feature extractor提取reference image feature,使用一个global mapping network将CLIP不同层的feature映射为不同token embedding,最深层的feature预测的token embedding对应subject-related information,浅层的feature预测的token embedding对应subject-unrelated information,同时训练global mapping network和cross-attention KV projection matrix。

  4. local:去除reference image背景,使用CLIP作为feature extractor提取其feature,使用一个local mapping network将CLIP feature映射为token embedding,这里只使用最深层的word,额外添加一组cross-attention KV projection matrix进行训练,同时训练local mapping network和new cross-attention KV projection matrix。此时cross-attention的输出是global与local cross-attention的输出的和,global cross-attention依然使用global阶段生成的token embedding作为输入,且只使用最深层的word,但不参与训练。这一阶段类似LoRA,让模型将更多细节绑定到global阶段生成的word embedding上。

 

E4T (transform)

Designing an Encoder for Fast Personalization of Text-to-Image Models

  1. StableDiffusion

  2. Texutal Inversion shows that the word embedding space exhibits a trade-off between reconstruction and editability. This is because more accurate concept representations typically reside far from the real word embeddings, leading to poorer performance when using them in novel prompts. StyleGAN inversion也有这种问题, a two-step solution which consists of approximate-inversion followed by model tuning. The initial inversion can be constrained to an editable region of the latent space, at the cost of providing only an approximate match for the concept. The generator can then be briefly tuned to shift the content in this region of the latent space, so that the approximate reconstruction becomes more accurate.

  3. 每个domain (like face, cat, dog, etc.)训练一个编码器E,将image concept Ic编码为word embedding空间的一个偏移量,ec=edomain+sE(Ic),which constrains our predicted embeddings to reside near the fixed word embedding of the domain’s coarse descriptor edomain,输入StableDiffusion,同时使用LoRA fine-tune cross-attention projection matrix,重构图像,类似Custom-Diffusion。

  4. 使用E(Ic)22作为正则项。

  5. 每个domain先在各自的大数据集上进行预训练,再在给定的几张图像上进行test-time fine-tuning,都用一样的训练方法。

 

DA-E4T (transform)

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

  1. 引入contrastive-based regularization technique,让encoder可以处理不同domain的数据。

 

Cones (direct, no test-time fine-tuning)

Cones: Concept Neurons in Diffusion Models for Customized Generation

Cones

  1. StableDiffusion

  2. training-free

  3. 对每一组concepts,在cross-attention层的KV参数中,找到那些屏蔽掉后能够降低DreamBooth Loss(Reconstruction Loss+Preservation Loss)的神经元(Concept Neurons),不用训练,直接屏蔽掉这些神经元,就能得到对这组concepts敏感的text2img模型。pseudo word用一些已有但不常用的单词,比如AK47等。

 

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

Cones2

  1. 对于某个class的subject,学习一个该class的token的residual token embedding。做法是TI训练text encoder,但这样会使整个句子中的单词偏向subject。加入正则项:使用ChatGPT对class进行造句,分别使用训练后的text encoder和原text encoder对每个句子进行编码,使得句子中非class的单词的token embedding训练前后尽量靠近。最后的residual token embedding也是所有造句中class token embedding的差的均值。(注意,某个单词单独的embedding和其在句子中的embedding是不同的)

  2. 这样每个residual token embedding都是可重复利用的,也可以和别的residual token embedding同时使用,还可以操作cross-attention map指定concept的位置。

 

HiPer (attach)

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

  1. StableDiffusion

  2. 为参考图写一句话,但不包含pseudo word,而是利用text embedding后面的空位,加上personalized embedding,训练时只优化personalized embedding。

personalize

 

HiFi-Tuner (attach)

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

  1. 和HiPer类似,优化最后5个embedding。

 

CatVersion (no pseudo word)

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

  1. StableDiffusion

  2. 将base class word(如dog)输入CLIP,在CLIP的最后3个self-attention层,给key和value分别concat一个可训练的residual embedding,即Kf+ΔKVf+ΔV,使用reference image训练ΔKΔV

CatVersion

 

DPG (no pseudo word)

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

DPG

  1. 在reference image上利用RL直接fine-tune StableDiffusion。

  2. reward定义为diffusion loss。

 

SuTI (no pseudo word, no test-time fine-tuning)

Subject-driven Text-to-Image Generation via Apprenticeship Learning

  1. Imagen

  2. 对每个concept,使用{ 3-10张该concept的图片,该concept对应的文本( 比如berry bowl)} fine-tune一个Imagen模型(让模型将文本与给定图片的视觉特征绑定),之后用这个concept文本构造prompt(比如 berry bowl floating on the river),使用fine-tune后的模型,根据prompt生成目标图片,使用Apprenticeship Learning训练一个diffusion大模型,目标图片作为x0{3-10张同一concept的图片,concept对应的文本,prompt}为条件。

  3. 这样,使用训练好的大模型,给定3-10张unseen concept的图片和这个concept对应的文本,使用这个文本随便构造prompt,就可以生成和prompt和3-10张unseen concept图片都对齐的图像。

 

Obeject-Encoder (no pseudo word, no test-time fine-tuning)

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

  1. Imagen

  2. 只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。

  3. 对于每个domain,使用该domain的数据集进行训练:去除每张image的背景,训练一个object encoder提取object特征,并使用caption模型生成image的text,使用object特征和text两个条件fine-tune Imagen,使用一些正则防止过拟合。

  4. 训练好的模型可以根据reference image的物体特征和用户写的prompt自由生成。

 

InstantBooth (transform, no test-time fine-tuning)

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

  1. StableDiffusion

  2. 类似Object Encoder,只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。

  3. 对于每个domain,使用该domain的数据集进行训练:把每张image看成一个concept进行训练,训练一个encoder编码image得到两个特征,一个concept特征,一个visual特征,concept特征替换text embedding中pseudo word所在位置的embedding,同时将visual特征通过GLIGEN引入StableDiffusion,同时训练encoder和GLIGEN的adapter,使用数据增强和去除背景等方法防止过拟合。并不优化pseudo word的token embedding。

  4. 推理时可以使用pseudo word构造prompt,encoder编码reference images得到的concept特征取均值后替换text embedding中pseudo word所在位置的embedding。

 

Instruct-Imagen (no pseudo word, no test-time fine-tuning)

Instruct-Imagen: Image Generation with Multi-modal Instruction

Instruct-Imagen

Re-Imagen + Instruction Tuning

Re-Imagen的目的是为了让模型condition on multi-modal input

 

BootPIG (no pseudo word, no test-time fine-tuning)

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

BootPIG-1

BootPIG-2

  1. 类似DreamTuner,训练网络直接识别reference image就可以直接生成,不需要test-time fine-tuning。训练整个Reference UNet和Base UNet的self-attention layers里的四个矩阵。

  2. 造数据进行自监督训练。

 

JeDi (no pseudo word, no test-time fine-tuning)

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

JeDi

  1. 类似M2M的image sequence生成方法。

 

Multi-Subject

Break-A-Scene (direct)

Break-A-Scene: Extracting Multiple Concepts from a Single Image

  1. 提取一张图中多个concept。

  2. 给定有分割标注的图片,一次性提取图片中不同object的pseudo word,利用masked diffusion loss + masked cross-attention loss进行训练。

 

UCD (direct)

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

UCD

  1. N张不同图像中学到K个concept,或从同一场景的图像中学到K个concept,可以组合生成。

  2. use the combination of the score of different concepts (a learnable word embedding) to reconstruct images using diffusion loss.

 

ConceptExpress (direct)

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

ConceptExpress

  1. 无监督版的Break-A-Scene。

  2. 利用聚类得到多实例的大致分割,为每个实例分配一个可学习的token embedding,使用masked diffusion loss进行学习。

  3. 使用了对比损失和正则项进行辅助和增强。

 

DisenDiff (direct)

Attention Calibration for Disentangled Text-to-Image Personalization

DisenDiff

  1. CustomDiffusion

  2. 提取一张图中多个concept。

  3. Lbind增大V1和cat以及V2和dog的cross-attention map之间的IoU,Ls&s​减小cat和dog的cross-attention map之间的IoU。

  4. suppress:cross-attention map的平方(element-wise multiplication),抑制low response,增强high response。

 

AttenCraft (direct)

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

AttenCraft

  1. CustomDiffusion

  2. 提取一张图中多个concept。

  3. 先不使用pseudo word,使用concept对应的class word,在某个较小的时间步,使用DatasetDiffusion的方法提取每个concept的mask,使用CustomDiffusion的方法学习时,优化每个pseudo word的cross-attention map和对应的mask之间的KL散度。

 

UnZipLoRA (direct)

UnZipLoRA: Separating Content and Style from a Single Image

UnZipLoRA

  1. 用ZipLoRA的方法直接从reference image中学LcLs两个LoRA。

 

Multi-Subject Composition

Mix-of-Show (direct)

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

Mix-of-Show

  1. TI和P+这种只优化token embedding的方法,如果reference image是in-domain的,那就够用了,但如果是out-of-domain reference image效果就不好了。

  2. (c) 对于token embedding和模型参数都优化的方法(如DreamBooth和CustomDiffusion),如果只用优化好的token embedding和原模型参数进行生成,生成的都较为相似,说明token embedding捕捉的还是in-domain的信息,out-of-domain的信息蕴藏在更新的模型参数中。

  3. (d) 为了将更多的信息转移到token中,采用P+的layer-wise embedding并使用multi-word embedding。

  4. 单个concept的学完后,如何融合多个LoRA参数ΔWi到一个模型中进行multi-concept生成就是一个问题。简单的LoRA参数ΔWi加权相加生成效果不佳,原因是不同参数会互相影响,使用优化方法得到一个融合参数:W=arg minWin(W0+ΔWi)XiWXiF2

 

LoRA-Composer (direct)

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

  1. 解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. training-free方法,需要提供不同concept的bounding box。

  3. 将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,zt输入不同LoRA参数的模型进行一次预测(使用对应的local prompt),在self-attention map和cross-attention map上利用bounding box算几个loss,梯度下降更新zt,将更新后的zt输入原StableDiffusion去噪,依此循环,目的是让concept生成在对应bounding box内且让不同concept互不影响。

 

ZipLoRA (direct)

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

ZipLoRA

  1. D代表原diffusion model,Lc代表content LoRA,Ls带边style LoRA,Lm代表如图融合LcLs(DLm)(xc,pc)(DLc)(xc,pc)2+(DLm)(xs,ps)(DLs)(xs,ps)2

  2. 正则项 imcimsi enforces an orthogonality constraint between the columns of the individual LoRA weights.

 

LoRA.rar (direct)

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

LoRA.rar

  1. loss和ZipLoRA完全一样,但是训练一个HyperNetwork,输入LcLs,预测融合参数mcms

  2. 使用大量不同LoRA训练,这样可以zero-shot,不需要像ZipLoRA那样每对LoRA都要重新训练。

 

LoRACLR (direct)

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

LoRACLR

  1. Given a pre-trained LoRA Vi , we first create pairs of input and output features, denoted as Xi and Yi , respectively.

 

CP (no pseudo word)

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

  1. Continual Learning setup: a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns.

 

MultiBooth (direct)

MultiBooth: Towards Generating All Your Concepts in an Image from Text

MultiBooth-1

MultiBooth-2

  1. 思想类似LoRA-Composer。

 

MC2 (direct)

MC2: Multi-concept Guidance for Customized Multi-concept Generation

MC2

  1. 类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. training-free方法,但不需要提供不同concept的bounding box。

  3. 将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,zt输入不同LoRA参数的模型进行一次预测(使用对应的local prompt),得到cross-attention map,两两计算IoU(共n(n1)2个)求平均作为loss,梯度下降更新zt;将优化后的zt继续输入不同LoRA参数的模型进行预测得到zt1c,i,同时输入StableDiffusion使用空文本串预测得到zt1u,计算zt1u+i=1nωi(zt1c,izt1u)作为最终的zt1

 

OMG (direct)

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

OMG-1

OMG-2

  1. 类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,是一种通用的方法,可以用在不同TI方法上,甚至不同TI方法训练出来的pseudo word和LoRA也可以一起生成。

  2. 使用两阶段进行生成。第一阶段先用general class word替代pseudo word,使用原StableDiffusion进行生成,保留生成过程中所有general class word对应的cross-attention map,使用SAM得到生成结果中general class word对应的mask;第二阶段和第一阶段进行一样的生成过程,但在每一步,对于每个concept,使用pseudo word和对应的LoRA进行生成,所有concept预测的噪声使用第一阶段的mask进行blending,同时也使用第一阶段的cross-attention map替换pseudo word对应的cross-attention map,以做到layout preservation。

 

FreeCustom (no pseudo word, no test-time fine-tuning)

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

FreeCustom

  1. training-free方法,需要提供不同concept的mask。

  2. MRSA:inject KV of self-attention in reference path into composition path。

 

OrthoAdaptation (direct)

Orthogonal Adaptation for Modular Customization of Diffusion Models

LoRA fine-tune时,不同concept使用互相正交的B,固定B,只训练A,这样学到的多个concept可以同时生成,正交性使得不同concept的LoRA参数可以直接相加在一起使用。

 

MLoE (direct)

Mixture of LoRA Experts

  1. 解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. 类似MoE,训练一个gating function,其根据LoRA的输出计算一个gating value,使用gating value线性组合不同LoRA的输出,使用训练LoRA时的数据和loss进行训练。

 

Break-for-Make (direct)

Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization

Break-for-Make

  1. DreamBooth+LoRA(加在cross-attention上),需要同时学两个pseudo word(不同的reference image),一个是content,一个是style。有两个baseline:一个是公用同一个LoRA联合训练,另一个是分开学LoRA然后直接加在一起使用。

  2. 做矩阵分解,Wup=[AB]Wdown=[CD],其中ARd×rBRmd×rCRr×dDRr×(nd),所以ΔW=[ACADBCBD],将AC初始化为互相正交并且不参与训练,相当于保持前d维参数,d<min(m,n)。两个pseudo word分别优化BD,这样AD完全由D决定,BC完全由B决定,两者学习各自concept的特征,BD学习两者交互的特征。

 

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

 

UMM (transform, no test-time fine-tuning)

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

UMM

  1. StableDiffusion

  2. 任务:给一个句子,和句子中某些单词对应的图像,生成句子对应的图像,其中给定图像的单词对应的object要和给定图像相似,相当于可以做composition。类似PbE的self-supervised learning:利用预训练目标检测模型,在LAION数据集上,标注出句子中具体单词对应的object在图像中的位置,构建新的数据集。

  3. 不fine-tune模型,只训练一个MLP,将给定图像的CLIP image embedding转换为token embedding,用TI方法训练这个MLP,类似FastComposer。

 

Subject-Diffusion (transform, no test-time fine-tuning)

Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

  1. StableDiffusion

  2. 训练一个open domain并且不需要test-time fine-tuning的模型。

  3. 数据集:使用BLIP为图像生成caption,提取caption中的subject,使用DINO+SAM分割出每个subject对应的bounding box,在caption后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,构成数据集。

  4. 训练时,使用CLIP image encoder编码每个subject对应的bounding box内的内容,使用编码结果直接替换上述的[placeholder_i]的embedding,并且重新训练text encoder,这样就在建模text前引入融合了图像信息(实验发现这样比建模句子后再融合要好);同时训练cross-attention KV projection matrix(因为他们负责转换text feature);类似GLIGEN在self-attention和cross-attention之间加一个adapter,引入bounding box信息(帮助识别区分多物体)。

  5. 推理时,给定一个caption,在caption之后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,为每个[placeholder_i]提供一张reference image,还可以为每个[placeholder_i]指定一个bounding box。

 

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

 

SE-Guidance (no pseudo word, no test-time fine-tuning)

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

SE-Guidance

  1. 基于IP-Adapter的多subject组合生成。

  2. 为text prompt中某些subject token提供image prompt,阈值法使用subject token对应的text cross-attention map估计出一个mask,乘到对应的image prompt的image cross-attention map上,所有image prompt的image cross-attention的输出加权和。

  3. A&E防止object missing。

 

Concept Discovery

侧重于发掘之前没有的concept。

 

Conceptor

The Hidden Language of Diffusion Models

  1. decomposing an input text prompt into a small set of interpretable elements.

  2. 对于某个concept,造句生成的100张图像,找一堆base word,学习一个MLP,为每个base word预测一个权重,所有base word的线性组合去重构这100张图像。目的是学习这个concept可以由哪些base word解释。

 

CusConcept

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

CusConcept

  1. 类似Conceptor。

 

CGCD

Exploiting Interpretable Capabilities with Concept-Enhanced Diffusion and Prototype Networks

 

ConceptLab (direct)

ConceptLab: Creative Generation using Diffusion Prior Constraints

LayoutAttn

  1. 利用DALLE-2生成一些没有过的concept,比如生成和所有已知pet都不同的pet。

 

PartCraft (direct)

PartCraft: Crafting Creative Objects by Parts

PartCraft

  1. StableDiffusion

  2. 使用DINOv2对数据集进行unsupervised part discovery,分为三阶段k-means,第一阶段k=2,分离前背景,第二阶段对前景进行k-means,分出part,第三阶段对每个part进行k-means,得到label,相同part被聚为一类的使用相同的pseudo word。不同part使用TI一起进行学习。

  3. 可以实现不同part的任意组合,生成新物种。

 

Non-Subject Inversion

ReVersion (direct)

ReVersion: Diffusion-Based Relation Inversion from Images

  1. TI训练优化一个relation token,提取reference images中共同存在的relation特征而不是object特征,比如握手,之后用relation token造句可以生成具有相同relation的图像。

  2. Relation-Steering Contrastive Learning:relation token应该具有介词词性,使用一个contrastive loss,拉近relation token与已有的介词的距离,拉远relation token与其它词性的单词的距离。

 

ReInter (direct)

Customizing Text-to-Image Generation with Inverted Interaction

  1. 类似ReVersion,TI学习物体之间交互关系。

 

Lego (direct)

Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Lego

  1. invert any concepts in exemplar images, such as "frozen in ice", "burnt and melted", and "closed eyes"

  2. 使用contrastive learning,构造concept的同义词作为positive,反义词作为negtive,计算InfoNCE loss。

 

ADI (direct)

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

  1. TI训练优化一个action token embedding,提取reference images中共同存在的action特征而不是object特征,比如倒立,之后用action token造句可以生成具有相同relation的图像。

  2. 对于某个要学习的action,在每一个cross-attention层都优化一个token,这样就不必局限于单个token,语义更丰富。

  3. 避免学到与action无关的特征:(a,c) is an anchor sample, where a denotes the specific action, and c denotes the action-agnostic context contained in the image including human appearance and background,使用其它reference image作为(a,c¯),分别输入网络得到对token的梯度,计算不同channel的差异,差异小的代表该channel是action-related的特征,差异大的代表该channel是action-unrelated的特征,阈值法选取action-related的channel,得到mask。使用其余TI方法对c进行inversion,使用其余action造句并生成图像,得到(a¯,c),使用类似的方法得到一个mask。两个mask的交集作用于由(a,c)计算出的梯度,更新token。

 

ImPoster (no pseudo word)

ImPoster: Text and Frequency Guidance for Personalization in Diffusion Models

ImPoster

  1. 左上角是source image,左下角是driving image,先用这两张image fine-tune diffusion model。

  2. Ga is the L2 distance between the amplitude of the generated latents and the amplitude of the latents of the source image.

  3. Gp is the L2​ distance between the phase of the generated latents and the phase of the latents of the driving image.

 

ViewNeTI (direct)

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

ViewNeTI

 

FSViewFusion (direct)

FSViewFusion: Few-Shots View Generation of Novel Objects

FSViewFusion

 

CustomDiffusion360

Customizing Text-to-Image Diffusion with Camera Viewpoint Control

CustomDiffusion360

  1. Given multi-view images of a new object, we create a customized text-to-image diffusion model with camera pose control.

 

3D-words (direct)

Learning Continuous 3D Words for Text-to-Image Generatio

  1. Learn a continuous function that maps a set of attributes from some continuous domain to the token embedding domain.

 

CustomNet (transform)

CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

  1. 构造同一object不同view的图像作为数据,编码object和view作为条件,使用IP-Adapter进行训练。

 

Face

FastComposer (transform, no test-time fine-tuning)

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

FastComposer

 

UniPortrait (transform, no test-time fine-tuning)

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

UniPortrait

 

DreamIdentity (transform, no test-time fine-tuning)

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

DreamIdentity

  1. 使用预训练的ViT架构的人脸识别模型,提取3,6,9,12和最后一层的CLS token位置的feature,concat在一起,分别使用2个MLP将其转化成2个token embedding,使用diffusion loss和token embedding的L2正则进行训练。

  2. 使用不同层的feature的原因是最后一层的feature蕴含的都是比较高级的语义信息,缺少一些细节。

 

Face2Diffusion (transform, no test-time fine-tuning)

DreamIdentity

  1. 类似DreamIdentity利用预训练人脸模型的multi-scale feature,同时使用一个预训练expression encoder提取表情feature,以20%概率替换为一个可学习的代表无表情的向量,两个feature concat在一起,使用mapping network转化为token embedding,使用diffusion loss进行训练。

 

PhotoMaker (transform, no test-time fine-tuning)

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

PhotoMaker

  1. StableDiffusion

  2. 训练集由多个不同的人物id组成,每个人物id包含同一个人的多个image-text pair,text中包含man或woman描述image。训练时,使用CLIP image encoder将某个id的N张images编码,得到N个image embedding,使用text encoder编码带有base class word(如man)的prompt(长度为L),提取base class word所在位置的token embedding,通过可训练的MLP,将token embedding和N个image embedding都融合一遍,得到N个融合后的 id embedding,stack起来,得到长为N的stacked id embedding,替换掉base class word所在位置的token embedding,得到长度为L1+N的text embedding,送入StableDiffusion的cross-attention,训练MLP进行重构。额外还可以LoRA fine-tune cross-attention layer。

  3. 生成时不再需要额外训练,任意给定某个人物的几张image,编写prompt进行生成。

 

PortraitBooth (transform, no test-time fine-tuning)

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

PortraitBooth

  1. 类似PhotoMaker。

 

FreeCue (transform)

Foundation Cures Personalization: Recovering Facial Personalized Models Prompt Consistency

FreeCue

 

MegaPortrait (no pseudo word, no test-time fine-tuning)

MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation

MegaPortrait

 

Arc2Face (transform, no test-time fine-tuning)

Arc2Face: A Foundation Model of Human Faces

Arc2Face

 

IDAdapter (transform, no test-time fine-tuning)

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

IDAdapter

  1. 图里没画出来,在prompt后加了"the woman is sks",并且at the first embedding layer of the text encoder, we replace the text embedding of the identifier word “sks” with the identity text embedding,但没有优化sks的token embedding,而是用学到的embedding取代。

 

Dense-Face (direct)

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

Dense-Face

  1. We use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation.

 

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

InstantFamily

  1. 使用多人脸图像自监督训练。

  2. 采样时只需要提供aligned faces。

 

DiffSFSR (no pseudo word, no test-time fine-tuning)

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

DiffSFSR-1 DiffSFSR-2
  1. 给定一张人脸图像,和对场景和表情的描述,先使用StableDiffusion根据场景描述生成一张图像作为训练数据,再根据表情描述从数据库中选择一张具有该表情的图像作为表情条件,人脸图像作为id条件。

  2. 将训练数据的人脸部分mask掉(保留场景),concat在zt上,这样就可以让模型专注于人脸部分的建模。

  3. 使用diffusion loss + identity loss + expression loss 一起训练diffusion model,不需要自监督。

 

DemoCaricature (direct)

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

ROME

 

Face Aging (direct)

Identity-Preserving Aging of Face Images via Latent Diffusion Models

  1. DreamBooth

  2. 计算Class-specific Prior Preservation Loss时,将人脸数据按age分组,每组一个组名,如child,old等,使用带有组名的prompt和图像作为数据集。

  3. 训练后,使用photo of a person as 进行生成,实现对某个给定人脸的aging与de-aging。

 

CelebBasis (direct)

Inserting Anybody in Diffusion Models via Celeb Basis

  1. StableDiffusion的text embedding是可以插值进行生成的,基于这一发现,可以收集一些CLIP text encoder能够识别的名人的人名,使用PCA算法计算出它们token embedding的一组基,这组基可以看成人脸特征在token embeeding space的表示。

  2. 训练时,给定任意一张人脸的图片,训练一个MLP去modulate这组基,组成该人脸对应的pseudo word的embedding,插入"a photo of _",使用Textual Inversion方法训练这个MLP。

 

CharacterFactory

CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

CharacterFactory

  1. 人物工厂,不是TI,不需要reference image,直接生成随机的可用的pseudo word embedding。

  2. 使用GAN生成fake embedding,采样名人的人名作为real embedding,对抗训练。

  3. Lcon让生成的pseudo word embedding在不同template prompt的text embedding中表现一致。做法是最小化不同template prompt中这些word在text embedding中对应的embedding的pairwise distances。

 

 

StableIdentity (direct)

StableIdentity: Inserting Anybody into Anywhere at First Sight

受Celeb Basis启发,寻找一些名人的人名,得到他们的word embedding。通过一个MLP将输入人脸图像转化为两个word embedding,通过AdaIN转化到celeb word embedding空间(celeb word embedding的均值和方差分别充当shift和scale),TI训练这个MLP。学到的两个word embedding可以用于任何text-based generative model,比如ControlNet,text2video。

 

SeFi-IDE (direct)

SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation

 

LCM-Lookahead (no pseudo word, no test-time fine-tuning)

LCM-Lookahead for Encoder-based Text-to-Image Personalization

LCM-Lookahead

  1. 专注人脸的IP-Adapter。

 

Inpainting

RealFill

RealFill: Reference-Driven Generation for Authentic Image Completion

有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。

RealFill

 

PVA

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

类似RealFill,有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。

 

Restoration

Personalized Restoration

Personalized Restoration via Dual-Pivot Tuning

  1. 有reference images的restoration任务。

 

Benchmark

DreamBench++

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

  1. Current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive.

  2. DreamBench++ is a human-aligned benchmark automated by advanced multimodal GPT models.

 

Efficiency

HollowedNet

Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models

HollowedNet

 

Lifelong

LFS-Diffusion

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

 

 

X-to-Image (more fine-grained than text-to-image)

training-free的方法大致有两种,一种是类似nursing的操作,设计一种xt与给定条件的loss,在采样过程中计算loss并使用其梯度指导采样,另一种就是直接操作attention map,让其符合给定条件的constraint。

 

Sketch

SKG (sketch + text)

Sketch-Guided Text-to-Image Diffusion Models

SKG

  1. 为预训练好的StableDiffusion引入sketch。

  2. 使用预训练好的edge提取器生成训练数据(自监督),训练一个可以根据UNet的各层feature maps预测edge的MLP。方法类似于Label-Efficient Semantic Segmentation With Diffusion Models。

  3. 采样时用MLP损失函数的梯度做classifier guidance,只在T到0.5T加guidance。

  4. 使用dynamic guidance scheme:α=ztzt12ztL2s,其中s为常数,zt1是原扩散模型采样的结果。其动机是,如果某一步前后变化较大,则表明这一步会生成了更多信息,所以要增大guidance;如果某一步的guidance本身变化较大,则减小scale,防止过度引导。

 

SKSG (sketch + text)

Sketch-Guided Scene Image Generation

SKSG-1 SKSG-2

 

  1. 先利用每个object的sketch和只含有object的prompt单独生成该object,之后对该object进行TI学习。

  2. 将所有object按sketch的位置拼在一起进行blended生成。

 

SketchAdapter (sketch + text)

It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models

SketchAdapter

  1. 只在很少的skecth-image pair数据集上训练,没有text。

  2. 使用CLIP编码sketch,取最后一层的feature sequence,只训练一个sketch adapter,将其转化成CLIP text embedding,送入StableDiffusion的cross-attention进行训练,除了diffusion loss,还有两个额外的loss:每一步预测的z^0过VAE decoder后经过一个sketch提取器,得到的结果与输入sketch计算距离;用image caption模型生成图像caption,送入StableDiffusion,让两个StableDiffusion的预测尽量靠近。

 

ToddlerDiffusion

ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

  1. 模拟人类画图的思路,先生成sketch,再生成palette,最后生成图像。使用ShiftDDPMs的公式,以sketch或palette而不是pure noise为起点进行训练。

 

SKGLO (sketch + text)

Training-Free Sketch-Guided Diffusion with Latent Optimization

SKGLO

  1. training-free

 

KnobGen (sketch + text)

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

KnobGen

  1. CGC将sketch和text feature进行融合,融合后当做text输入diffusion model。

  2. FGC是ControlNet或者T2I-Adapter,乘一个系数进行knob。

 

Layout/Segmentation

Locally-Conditioned-Diffusion

Compositional 3D Scene Generation using Locally Conditioned Diffusion

  1. 给定K个object的layout mask {mk}k=1K和对应的local prompt {yk}k=1K,使用k=1Kmkϵθ(xt,t,yk)进行采样。

 

IIG (bounding box + text)

Semantic-Driven Initial Image Construction for Guided Image Synthesis in Diffusion Model

  1. 受Initial Image Editing的启发,只需要精心构建xT即可实现layout-to-image。

  2. 利用StableDiffusion,最深层cross-attention map的一个值对应zT中一个4×4的noise block,构造prompt,使用denoising第一步得到的cross-attention map的值对noise block进行标注,构建一个趋于生成某一类物体的noise block的数据库。

  3. 生成时,从物体对应的noise block数据库中采样,填在指定的bounding box内进行生成。

 

NoiseCollage (bounding box + text)

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

NoiseCollage

  1. masked cross-attention:layout之内的image feature与object prompt进行cross-attention,layout之外的image feature与global prompt进行cross-attention,两者结果相加。

 

TriggerPatch (bounding box + text)

The Crystal Ball Hypothesis in Diffusion Models: Anticipating Object Positions from Initial Noise

  1. A trigger patch is a patch in the noise space with the following properties: (1) Triggering Effect: When it presents in the initial noise, the trigger patch consistently induces object generation at its corresponding location; (2) Universality Across Prompts: The same trigger patch can trigger the generation of various objects, depending on the given prompt.

  2. We try to train a trigger patch detector, which functions similarly to an object detector but operates in the noise space. 随机噪声,生成图像,使用预训练好的object detector检测物体,检测得到的结果作为该噪声的ground truth,训练trigger patch detector。

  3. 生成时,随机噪声,检测trigger patch,移动trigger patch到目标位置。

 

LayoutDiffuse (bounding box + text)

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

  1. adapt pre-trained unconditional or conditional diffusion models,在每个attention layer后加一个带residual的layout attention layer,即h=LayoutAttn(h)+h。

  2. LayoutAttn(h)将layout分成每个instance单独的layout(即只标识了一个object),每个layout当成mask,提取h中该object的region feature map,然后为每个feature加上该object对应的class label或者caption的learnable embedding,然后做self-attention;对于h,使用空标签或者空字符串的learnable embedding加到每个feature上,做self-attention,作为背景;然后乘上mask加在一起,重叠部分取平均。类似ControlNet,参数初始化为0,LayoutAttn(h)一开始输出为0,训练开始前不影响原网络。

LayoutAttn

 

LayoutDiffusion (bounding box + text)

LayoutDiffusion Controllable Diffusion Model for Layout-to-image Generation

  1. 重新设计UNet,全部重新训练。

 

CreatiLayout (bounding box + text)

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

CreatiLayout

  1. MMDiT

 

RCL2I (bounding box + text)

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

RCL2I

  1. The proposed regional cross-attention layer is inserted into the original diffusion model right after each self-attention layer. The weights of the output linear layer are initialized to zero, ensuring that the model equals to the foundational model at the very beginning.

 

IFAdapter (bounding box + text)

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

IFAdapter

 

PLACE (bounding box + text)

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

PLACE

  1. N个word-layout pair组成一条数据。

  2. layout control map:将layout转换为semantic mask,让对应的word的cross-attention map只有semantic mask内的响应值,但由于StableDiffusion是在8倍下采样的latent上运行的(深层的feature map更小),对mask采取同样的下采样可能会导致一些小物体被忽略,所以这里通过感受野计算mask,对于feature map上每个image token,如果其在原图尺寸上的感受野与当前物体的semantic mask有交集,则设为1,否则设为0。使用原cross-attention map与乘上mask后的cross-attention map的插值。

  3. Semantic Alignment Loss:encourages image tokens to interact more with the same and related semantic regions in the self-attention module, thereby further improving the layout alignment of the generated images. 通过cross-attention控制self-attention,对于某个word,将其cross-attention map(HW)作为权重计算self-attention map(HW×HW)的加权和(HW​),优化这个加权和与cross-attention map靠近。

  4. Layout-Free Prior Preservation Loss:由于数据集较小,为了防止过拟合,使用一些文生图数据计算diffusion loss,此时把layout control map中的semantic mask cross-attention map的插值系数设为0即可。

 

MIGC (bounding box + text)

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

MIGC-1

MIGC-2

  1. 在StableDiffusion原有的cross-attention output上乘上mask,再额外训练一个并行的cross-attention(enhancement attention),在output上乘上mask,两者相加作为当前instance的shading result;再额外训练一个并行的self-attention(layout-attention),在output上分别乘上前景和背景的mask,得到两个shading result;n+2个shading result按mask求和。

  2. 只在mid-layers (i.e., 8 × 8)和the lowest-resolution decoder layers (i.e., 16 × 16)上应用MIGC。

  3. 在COCO上使用diffusion loss训练,同时还优化cross-attention map上背景区域的响应值之和(类似TokenCompose)。

 

HiCo (bounding box + text)

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

HiCo

 

B2B (bounding box + text)

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

  1. StableDiffusion

  2. training-free

  3. box: 对text中有bounding box的object对应的cross-attention map,定义一些bounding box附近的sliding box,bounding box内的响应值减去bounding box外的响应值再加上这些sliding box内的响应值与bounding box内的响应值的IoU(保证均匀),作为object reward。

  4. bind: attribute的cross-attention map与对应的object的cross-attention map在bounding box内的响应值的KL散度的相反数,作为attribute reward。

  5. 两个reward加在一起求梯度作为guidance。

 

R&B (bounding box + text)

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

R&B

  1. StableDiffusion

  2. training-free

 

LAW-Diffusion (bounding box + text)

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

  1. 有点类似SpaText,每个object都对应一个region map,其大小和图像一致,并在bounding box内填上可训练的对应object的embedding,bounding box外填上可训练的background的embedding。所有region map分成patch,不同region map的同一个位置的patch组成一个序列,序列前再prepend一个agg embedding,送入一个ViT,不需要线性映射,不需要加positional embedding,取agg embedding的输出。所有位置都按此处理,按位置排列所有输出,组合成图像大小的一个layout embedding。训练一个diffusion model,将layout embedding与xt concat在一起输入网络。

 

SALT (bounding box + text)

Spatial-Aware Latent Initialization for Controllable Image Generation

SALT

 

Directed Diffusion (bounding box + text)

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

  1. StableDiffusion

  2. training-free

  3. 在生成时,提高text token对应的cross-attention map的bounding box区域的权重。

 

Attention-Refocusing (bounding box + text)

Grounded Text-to-Image Synthesis with Attention Refocusing

  1. StableDiffusion

  2. training-free

  3. attention refocusing

  4. cross-attention refocusing:类似Attend-and-Excite,Lfg:计算每个text token对应的cross-attention map中,对应的bounding box之内的cross-attention response的最大值,求和。Lbg:计算每个text token对应的cross-attention map中,对应的bounding box之外的cross-attention response的最大值,求和。LCAR=Lfg+Lbg

  5. self-attention refocusing:LSAR:对于每个bounding box,对于当前bounding box之内的所有image token,求它们的self-attention map中,所有包含该image token的bounding box所覆盖的地方之外的response的最大值,求和。

  6. 采样时计算上述loss,用LCAR+LSARxt的梯度作为guidance。

 

BACON (bounding box/segmentation + text)

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

BACON

  1. StableDiffusion

  2. training-free

 

BoxDiff (bounding box/segmentation + text)

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作

  4. 类似Attention Refocusing,生成时给定text中某些子字符串对应的bounding box,在对应的cross-attention map中分别使用Inner-Box Constraint (增强bounding box中的response,鼓励当前物体出现在bounding box内),Outer-Box Constraint (削弱bounding box外的response,防止当前物体出现在bounding box外),Corner Constraint (鼓励当前物体填满bounding box,而不是在bounding box生成一个很小的物体),多个loss的和对xt的梯度作为guidance。

 

CSG

Training-free Composite Scene Generation for Layout-to-Image Synthesis

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作

  4. 类似BoxDiff,设计多个constraint loss的和对𝑥𝑡的梯度作为guidance。

 

Zero-Painter (segmentation + text)

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Zero-Painter-1

Zero-Painter-2 Zero-Painter-3
  1. StableDiffusion + StableInpainting

  2. training-free

  3. PACA:增大除了SOT之外所有token的cross-attention map中的mask区域内的响应值。对于SOT有一个很有意思的特点,其cross-attention map中的值哪里被增大了,最终输出的图像哪里就会变成背景,所以可以利用这一特点,对SOT的cross-attention map进行反向操作,增大mask区域外的响应值。

  4. ReGCA:inpainting的cross-attention,背景和前景使用不同的KV,只对背景使用global prompt。

 

CAC (bounding box/segmentation + text)

Localized Text-to-Image Generation for Free via Cross Attention Control

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Enhancing Image Layout Control with Loss-Guided Diffusion Models

  1. StableDiffusion

  2. training-free

  3. Cross Attention Control

  4. 除了text之外,额外提供了m个instance prompt-bounding box/segmentation pairs,生成时,将text和所有instance prompt pad成同一长度,同时送入StableDiffusion,这样就生成了m+1个cross-attention map,乘上bounding box/segmentation的mask,相加得到最后的cross-attention map。相比于Attention Refocusing,不需要计算loss和梯度。

 

SpaText (segmentation + text)

SpaText: Spatio-Textual Representation for Controllable Image Generation

  1. 每个segment对应一个text,可以分区域生成,指定物体之间的空间关系。

  2. 自监督训练,使用预训练分割模型提取图像segments,用CLIP提取每个segment的CLIP image embedding,初始化一个全为0的segmentation map,大小和图像一样,通道数和CLIP image embedding维数一样,将每个segment的CLIP image embedding放到segmentation map中对应位置。

  3. 改造DALLE-2的Decoder,将segmentation map直接concat到xt上作为条件输入,fine-tune decoder,训练时不需要文本。

  4. 推理时用DALLE-2的Prior模型将每个segment对应的text的CLIP text embedding转换成CLIP image embedding,再组装成segmentation map,使用Decoder进行生成。

 

EOCNet (segmentation + text)

Enhancing Object Coherence in Layout-to-Image Synthesis

修改StableDiffusion网络结构,fine-tune。

 

FreestyleNet (segmentation + text)

Freestyle Layout-to-Image Synthesis

将StableDiffusion的cross-attention改为rectified cross-attention:将text token对应的cross-attention map中,在bounding box之内的保留原值,在bounding box之外的设为负无穷。By forcing each text token to affect only pixels in the region specified by the layout, the spatial alignment between the generated image and the given layout is guaranteed。再使用任何layout-based数据fine-tune StableDiffusion。

 

ALDM (segmentation + text)

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

ALDM

  1. 传统训练方法只是将layout作为条件输入模型优化diffusion loss,并没有对layout的显式监督,可能导致生成结果和layout不匹配。一个解决方法是使用预训练的segmentor对x^0​进行分割与给定layout对比计算loss进行优化,we observe that the diffusion model tends to learn a mean mode to meet the requirement of the segmenter, exhibiting little variation.

  2. 引入对抗训练,判别器:训练将ground truth每个pixel正确分类到N个real class,将x^0所有pixel分类到fake class;diffusion model作为生成器:除了diffusion loss,加入adversarial loss,让判别器指导训练。

  3. multistep unrolling:由于layout是diffusion生成早期阶段就决定的,但此时x^0都不太好,所以一次性生成之后K个x^0,计算K个adversarial loss求平均进行训练。

 

DenseDiffusion (segmentation + text)

Dense Text-to-Image Generation with Attention Modulation

  1. StableDiffusion

  2. training-free

  3. 和rectified cross-attention一样的思路,只不过是training-free的,可以直接采样:At cross-attention layers, we modulate the attention scores between paired image and text tokens to have higher values. At self-attention layers, the modulation is applied so that pairs of image tokens belonging to the same object exhibit higher values。这里的paired image and text tokens意思是当前image token的位置在text token所描述的object的bounding box内。

  4. softmax(QKT+Md)M=λtRMpos(1S)λt(1R)Mneg(1S)λt=wtTR为二值矩阵,对于cross-attention,若text token和image token同属同一个segment,则为1,否则为0,对于self-attention,若两个image token同属一个segment,则为1,否则为0;Mpos=max(QKT)QKTMneg=QKTmin(QKT),max和min只针对key-axis,这是为了不让QKT+M偏离原来的QKT太远,同时让调整力度正比于原值与极值的差;S为比例矩阵,如果segments之间面积差别较大,生成质量会受影响,所以对每个image token,计算出其所属的segment的面积占全图的比例,用于正则。

 

SCDM (segmentation)

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis

SCDM

In real-world applications, semantic image synthesis often encounters noisy user inputs. SCDM enhances robustness by stochastically

perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion.

 

MagicMix (layout/style from image/text + text)

MagicMix: Semantic Mixing with Diffusion Models

MagicMix

  1. noisy latents linear combination版本的SDEdit,削弱原图的细节,只保留基本的结构和外观信息。

 

 

DiffFashion

DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models

  1. DiffEdit+MagicMix

 

GeoDiffusion (bounding box -> text -> image)

Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt

  1. translate geometric conditions to text(包括object坐标等),fine-tune StableDiffusion。

 

GLoD (layout + text)

GLoD: Composing Global Contexts and Local Details in Image Generation

GLoD

  1. Masked SEGA.

 

Pose

StablePose

StablePose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

 

Camera Parameter

GenPhotography

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

GenPhotography

  1. 相机参数:bokeh blur, lens, shutter speed, temperature等。

  2. 对于每个相机参数α:收集一些base image,对每个base image打caption作为invariant scene description,再随机采样一些{αi}i=1N,对该base image做变换,得到N image并拼成一个视频,这些视频帧是对于相机参数α变换的反映。

  3. 用这些视频LoRA fine-tune T2V模型,同时训练一个contrastive camera encoder编码相机参数,编码结果拼在invariant scene description的编码结果之后。

  4. contrastive camera encoder: 因为前后帧之间只有某个相机参数不同,所以做差取feature。

  5. 推理时,既可以根据给定的相机参数和prompt生成图像(所有帧使用相同相机参数),也可以对已有图像进行相机参数的编辑(从原图相机参数平滑过渡到目标相机参数)。

  6. 用T2V的原因:即使固定随机种子,只要prompt稍有差异,T2I生成图像也会有很大差异,但T2V可以保持前后帧scene的一致性。

 

 

Scene Graph

DiffuseSG (scene graph)

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

  1. Train a DiffuseSG model (Graph Transformer) to produce layout and then utilize a pretrained layout-to-image model to generate images.

 

DisCo (scene graph)

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

DisCo

 

R3CD (scene graph)

R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion

R3CD

  1. 将scene graph分为多个三元组(object1-relation-object2),所有三元组拼在一起作为条件输入denoising model进行训练。

  2. 除了diffusion loss,还加了两个contrastive loss,从同一个batch中采样具有相同relation的三元组作为positive,batch内其余三元组作为negtive,利用relation的cross-attention map之间的cosine similarity计算一个contrastive loss,再利用三元组的diffusion loss之间的MSE计算一个contrastive loss。

 

SGDiff (scene graph)

Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

SGDiff

 

LAION-SG

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

  1. A large-scale dataset with high-quality structural annotations of scene graphs (SG).

 

Trajectory

TraDiffusion (trajectory + text)

TraDiffusion: Trajectory-Based Training-Free Image Generation

  1. 定义cross-attention map和trajectory之间的energy function,求梯度作为guidance进行采样。

 

Blob

BlobGEN (blob + text)

Compositional Text-to-Image Generation with Dense Blob Representations

  1. GLIGEN with blob tokens

 

DiffUHaul (blob + layout)

DiffUHaul: A Training-Free Method for Object Dragging in Images

DiffUHaul

 

Image

IP-Adapter (image + text)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

IP-Adapter

  1. StableDiffusion

  2. CLIP image encoder提取image embedding,训练一个线性层将其映射到长为4的sequence,类似StyleAdapter,加一个和text cross-attention layer并行的可训练的image cross-attention layer,使用原来的数据集,训练线性层和image cross-attention layer。

  3. 训练好的模型可以与ControlNet和T2IAdapter一起使用,无需额外训练。

  4. IP-Adapter+:在text cross-attention layer之后加可训练的image cross-attention layer。

 

IPAdapter-Instruct (image + text)

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

  1. 基于IP-Adapter+,在image cross-attention layer再加一个text cross-attention layer,与instruction进行交互,使用instruction editing数据进行训练。

  2. 使用prompt,ip image,instruction一起生成。

 

Semantica (image)

  1. 使用成对的图像数据集,其中一张作为condition,另一张作为target,重新训练一个U-ViT的diffusion model,we do not use any text inputs and only rely on image conditioning.

  2. 使用预训练的CLIP或者DINO编码图像得到的token sequence或者CLS token作为condition,当使用token sequence时使用cross-attention,当使用CLS token时使用FiLM。

 

FiVA (image)

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

FiVA

  1. We constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes 1 M high quality generated images with visual attribute annotations.

 

PuLID (image + text)

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

PuLID

  1. IP-Adapter在训练时使用从原图中提取的feature,这一定程度上会导致模型过拟合,除了diffusion loss,还引入了两个alignment loss和一个ID loss。

  2. 训练时构造两条contrastive paths,one path with ID:两个cross-attention都用;the other path without ID:只用text cross-attention。为了确保sementic alignment使用text作为Q,image feature作为KV,计算cross-attention map,优化两条paths的cross-attention map之间的MSE loss。The insight behind our semantic alignment loss is simple: if the embedding of ID does not affect the original model’s behavior, then the response of the UNet features to the prompt should be similar in both paths.

  3. 为了确保layout alignment,同时优化两条paths的image feature的MSE loss。

  4. 使用4步生成,使用生成的图像计算ID loss。

 

InstantID (image + text)

InstantID: Zero-shot Identity-Preserving Generation in Seconds

InstantID

  1. 上半部分类似IP-Adapter,只是将CLIP image embedding换成了face id embedding。但是作者认为这种方法不够好,因为image token和text token本身提供的信息就不同,控制的方式和力度也不同,但是IP-Adapter却把他们concat在一起,有互相dominate和impair的可能。

  2. 提出使用另一个IdentityNet(ControlNet架构)提供额外的空间信息,根据上述原因,这里的ControlNet去掉了text的cross-attention,只保留face id embedding的cross-attention。这里只提供双眼、鼻子、嘴巴的key points作为输入,一方面是因为数据集比较多样,更多的key points会导致检测困难,让数据变脏;另一方面是为了方便生成,也可以增加使用文本或者其他ControlNet的可编辑性。

  3. 在人脸数据集上自监督训练。

 

ID-Aligner (image + text)

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

ID-Aligner

  1. A general framework to achieve identity preservation via feedback learning.

 

PF-Diffusion (image)

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

  1. 类似ObjectStitch,训练一个SeeCoder将reference image转换为CLIP text embedding,然后使用其替换StableDiffusion的CLIP text encoder,实现只使用reference image生成图像。还可以使用ControlNet引入其它条件。

 

M2M (image sequence)

Many-to-many Image Generation with Auto-regressive Diffusion Models

M2M

  1. 构造一个image sequence数据集。

  2. 训练时每个样本是一个image sequence {z0i}i=1N,加噪得到{zti}i=1N,输入diffusion model,{zti}i=1N作为Q,{z0i}i=1N作为KV,进行cross-attention,加一个causal mask,zti中的pixel只能与z0<i的图像的pixel进行attention。

 

Manga

DiffSensei

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

 

MangaDiffusion

Manga Generation via Layout-controllable Diffusion

 

General

Late-Constraint (sketch/edge/segmentation + text)

Late-Constraint Diffusion Guidance for Controllable Image Synthesis

  1. 为预训练好的StableDiffusion引入各种条件,算是SKG的升级版。

  2. 使用预训练好的模型抽取image的各种conditions(如mask、edge等),训练一个可以根据UNet的各层feature maps预测conditions的condition adapter。

  3. 采样时,用当前的feature maps输入到condition adapter得到预测的conditions,与给定的conditions计算距离,求梯度作为guidance。

  4. 这类方法本质上还是训练一个noisy classifier,但使用的是diffusion model的feature。

 

Readout-Guidance (sketch/edge/pose/depth/drag + text)

Readout Guidance: Learning Control from Diffusion Features

Readout-Guidance-1

Readout-Guidance-2

  1. 和Late-Constraint类似,分为spatial和relative两种head。

  2. spatial包含pose,edge,depth等,训练模型根据diffusion feature预测ground truth,采样时根据预测和给定的label计算MSE loss,求梯度作为guidance。

  3. relative包含corresponce feature和appearance similarity,训练模型根据两个不同图像的diffusion feature进行预测。

  4. drag:corresponce feature head uses image pairs with labeled point correspondences and trains a network such that the feature distance between corresponding points is minimized, i.e., the target point feature is the nearest neighbor for a given source point feature. We compute pseudo-labels using a point tracking algorithm to track a grid of query points across the entire video. We randomly select two frames from the same video and a subset of the tracked points that are visible in both frames. 训练时,将输入的diffusion feature转化为一个feature map,image pairs的feature map之间的corresponding point feature之间计算loss;编辑时,先将原图输入UNet得到diffusion feature,再送入网络提取feature map,计算其staring point处的feature与生成图像的feature map的target point处的feature的距离,求梯度作为guidance。

 

MCM (segmentation/sketch + text)

Modulating Pretrained Diffusion Models for Multimodal Image

xt,ϵθ(xt),y1,,yn一起输入MCM网络,输出modulate参数γt,vt,使用ϵt=ϵθ(xt)(1+γt)+vt根据Tweedie's formula计算出x^0,MSE进行训练。

 

Acceptable Swap-Sampling (concept from text)

Amazing Combinatorial Creation Acceptable Swap-Sampling for Text-to-Image Generation

给定两个object text,生成两个concept融合在一起的图像,类似MagicMix。

对于一个0-1的列交换向量,其长度和CLIP编码结果的维度相同,若向量某位置为0,则选取第二个object text的CLIP编码结果的该位置的列向量,若向量某位置为1,则选取第一个object text的CLIP编码结果的该位置的列向量,组合成一个新的CLIP编码结果,将其输入到StableDiffusion是可以生成两个concept融合在一起的图像的。

实践中,随机采样一堆列交换向量,每个列交换向量按上述流程生成图像,再使用一些选取策略从所有图像中选出最符合标准的图像。

 

SCEdit (keypoints/depth/edge/segmentation + text)

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

SCEdit

  1. 通过对skip conncetion的feature做editing实现fine-tune或者可控生成。

  2. SC-Tuner:OjSC(xNj)=Tj(xNj)+xNj,其中j为decoder block的index,N代表UNet层数,xNj是第j层decoder对应的encoder输出的feature,Tj是可训练的Tuner,先过一个矩阵降维,再过GELU,最后过一个矩阵升维,这里只针对channal维进行操作。该方法可以视为LoRA的counterpart,是一种通用的fine-tune方法,比如可以用于将模型adapt到某个style domain。

  3. CSC-Tuner:OjCSC(xNj)=m=1Mαm(Tj(xNj+cjm)+cjm)+xNj,其中{cm}m=1MM个条件,如depth等,这些条件也送入一个可训练的hint block产生multi-scale feature。该方法可以视为ControlNet的counterpart。

 

GLIGEN (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

GLIGEN: Open-Set Grounded Text-to-Image Generation

GLIGEN

  1. Stableiffusion

  2. 除了caption,额外给定一组entity和对应的grounding信息(比如layout),进行spatial control。

  3. 在self-attention和cross-attention之间加一个可训练的gated self-attention层,把grounding token和visual token接在一起做self-attention,输出只保留visual token所在位置的部分,乘上一个可训练的gate标量,residual连接。gate标量初始化为0,类似ControlNet的zero-conv,保证一开始的网络和Stableiffusion有一样的效果。

  4. grounding token由entity和对应的grounding的feature同时输入一个可训练的MLP预测。entity可以是文本或者图像,为文本时就用预训练文本编码器提取其feature,为图像时就用预训练图像编码器提取其feature,grounding使用Fourier embedding提取其feature,如果是layout,就是左上右下两个坐标,如果是keypoint,就是一个坐标,如果是depth map,此时就没有entity了,直接使用一个网络将其转换为h×w个token,同时将depth map降采样后concat到输入上,训练StableDiffusion的第一个卷积层。

 

ReGround (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

ReGround: Improving Textual and Spatial Grounding at No Cost

GLIGEN

  1. 把GLIGEN改成类似IP-Adapter的并行attention形式,不用重新训练,直接把训练好的GLIGEN改成ReGround的形式,效果也能变好。

 

InteractDiffusion (interaction + text)

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

定义interaction是一个三元组,分别是主体(subject)、动作(action)和客体(object),三者分别对应一个文本描述和一个bounding box,主体和客体使用同一个MLP,将文本(预训练文本编码)和bounding box(Fourier embedding)转化一个token,动作用另一个MLP也转化为一个token。

如果一张图中有多个interaction,那么不同interaction之间无法区分,所以为每个interaction加一个可训练的embedding,类似positional embedding。同样,一个interaction中三元组之间也无法区分,所以为三者各加一个可训练的embedding,所有interaction公用该embedding。

得到最终的embedding后,类似GLIGEN进行训练。

InteractDiffusion

 

InstDiff (box/mask/scribble/point + text)

InstanceDiffusion: Instance-level Control for Image Generation

InstDiff

 

ControlNet (edge/segmentation/keypoints + text)

Adding Conditional Control to Text-to-Image Diffusion Models

  1. 为预训练好的StableDiffusion引入类似PDAE的条件模块ControlNet。

  2. ControlNet:固定StableDiffusion,复制StableDiffusion的UNet的encoder和middle block的每个block进行训练,输出与UNet对应的decoder的输出进行加和。zero convolution是所有参数都初始化为0的1x1卷积层,这样在训练前整个trainable copy的输出为0,不影响原网络。

  3. condition一般和原图尺寸一样。由于要和原网络的input相加,所以尺寸必须和原网络的input相同。StableDiffusion的input是降维后latent,所以condition也需要降维,所以就需要额外训练一个encoder对condition进行编码降维。

  4. 多个ControlNet可以组合使用。

  5. StableDiffusion一般必须用classifier-free guidance才能生成较好的图像,此时ControlNet可用于both unconditional and conditional prediction,也可只用于conditional prediction。但是如果想不使用prompt进行生成,此时如果将ControlNet用于both,cfg退化,效果不好;如果将ControlNet只用于conditional prediction,会导致guidance太强,解决方案为resolution weighting。

ControlNet1

ControlNet2

 

ControlNet-XS

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

ControlNet-XS

  1. ControlNet存在information delay的问题,即某个时间步的去噪时,SD encoder不知道control信息,ControlNet encoder不知道generative的信息。

  2. ControlNet-XS让两个encoder之间同步information,一个的feature map过一个可训练的convolution后加在另一个上,反之亦然,这样ControlNet encoder就不需要复制SD encoder了,而是可以使用参数量更少的处理同维度feature map的网络,随机初始化进行训练即可,效果还比ControlNet要好。

 

CtrLoRA

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

CtrLoRA

 

FineControlNet

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

FineControlNet

  1. StableDiffusion + ControlNet

  2. training-free

  3. 将多实例输入进行分离,修改cross-attention,每个实例过一次cross-attention,所有实例的输出相加得到最后输出。在UNet feature上进行操作,所以在UNet encoder部分,只融合text信息,在UNet decoder部分,同时融合control信息和text信息。

 

SmartControl

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

SmartControl

  1. Relax the visual condition on the areas that are conflicted with text prompts. 如使用deer的depth map生成tiger时,鹿角部分需要舍去。

  2. ControlNet可以使用一个α控制条件的强度hi+1=Di(hi+αhcondi),当α减小时,样本符合条件的程度也会降低,但不稳定,每个visual condition都要精心挑选不同的α,因此根据这一点构造一个relax alignment的数据集,之后训练一个SmartControl。

 

ControlNet++

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

ControlNet++-1

ControlNet++-2

  1. 加噪后去噪一步,使用x^0进行fine-tune。

 

X-Adapter

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

X-Adapter

  1. train a universal compatible adapter so that plugins of the base stable diffusion model (such as ControlNet on SD) can be directly utilized in the upgraded diffusion model (such as SDXL).

  2. 训练一个mapper,将base model的decoder的feature映射到upgraded model的decoder的feature维度并加上去,使用upgraded model的diffusion loss训练mapper。注意训练时,upgraded model输入的是empty prompt。

 

CCM

CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

CCM

  1. The way to add new conditional controls to the pre-trained CMs.

  2. ControlNet can be successfully established through the consistency training technique.

 

CoDi

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

  1. 和CCM的目标一样,使用consistency distillation在预训练diffusion model的基础上训练一个类似ControlNet的网络进行快速的条件生成。

  2. 类似ControlNet,也是复制一个UNet encoder出来,但并不是skip connect到预训练diffusion model UNet decoder,而是将其每一层输出的feature与预训练diffusion model UNet encoder对应的每一层的输出进行线性插值插值,插值系数也是可学习的,初始化为0,使用consistency training technique进行训练。

 

Ctrl-Adapter

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Ctrl-Adapter

  1. 类似X-Adapter。Pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden for many users.

 

ControlNeXt

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

ControlNeXt

  1. We remove the control branch and replace it with a lightweight convolution module composed solely of multiple ResNet blocks. We integrate the controls into the denoising branch at a single selected middle block by directly adding them to the denoising features after normalization through Cross Normalization.

  2. 不再是复制原模型,而是使用一个轻量级的模块处理条件,并且只将结果在原模型的某个中间的block引入。极大的降低了参数量和计算量。

 

MGPF

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

  1. 针对visual controls are misaligned with text prompts的问题,比如prompt中提到了某个object,但visual control中没有对应的edge,这样使用ControlNet生成出的图像会丢失这个object。

  2. 这本质上是ControlNet主导了生成的结果,所以提出了一种training-free的方法,根据每个object的edge提取mask,所有mask组合在一起,将ControlNet的feature乘上该mask再加到UNet decoder的feature上,目的是让ControlNet只负责生成有visual controls的objects,our experimental results show that the application of masks to ControlNet features substantially mitigates conflicts between mismatched textual and visual controls, effectively addressing the problem of object missing in generated images.

  3. 针对属性不绑定的问题,计算attibute和object的cross-attention map之间的overlap,梯度下降优化zt

 

CNC (depth/image/depth and image + text)

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

CNC-1

CNC-2

  1. StableDiffusion + ControlNet

  2. 自监督训练,对于某张图像,提取salient object的mask,图像乘上mask即为foreground图像,图像乘上mask的补码再对salient object部分进行inapinting得到background图像。分别对foreground和background图像提取depth。

  3. 提取foreground和background图像的CLIP image embedding,经过一个网络后concat在text embedding后,在ControlNet的cross-attention层用上mask,让Q和foreground K只在mask区域有值,让Q和background K只在mask区域之外有值。

  4. foreground和background是不对等的,对调它们的输入会生成不同位置关系的图像,所以叫3D depth aware。

 

FreeControl (keypoints/depth/edge/segmentation/mesh + text)

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

  1. StableDiffusion

  2. training-free

  3. DDIM Inversion时,UNet decoder第一个self-attention之前的feature(query, key, value)为C×H×W,看成H×W个长度为C的向量,求PCA后,取前三个为基,求feature在这三个基上的坐标为3×H×W,画成图后具有分割属性。同一concept不同模态的图片进行DDIM Inversion有同样的效果。同一concept不同模态求得的基也是通用的。

FreeControl1

  1. 利用这一属性,先生成一些target concept的图片,得到N×C×H×W的feature,看成N×H×W个长度为C的向量,求PCA后取基。生成过程中,让生成图像的feature在这组基上的坐标与condition的feature在这组基上的坐标靠近,计算loss求梯度作为guidance。思想和Late-Constraint类似,只不过是training-free的。

FreeControl2

 

T2I-Adapter (edge/segmentation/keypoints + text)

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

T2I-Adapter

  1. 为预训练好的StableDiffusion的encoder输出的各分辨率的feature map加上由condition计算出的同尺寸的feature map,只优化T2I-Adapter。

 

BTC (sketch/depth/pose + text)

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

BTC

  1. 训练时不需要text,且只需要几十到几百个样本。

  2. 类似T2I-Adapter,训练一个prompt-free condition encoder,其输出的feature map加在StableDiffusion的encoder输出的各分辨率的feature map上。prompt-free condition encoder从StableDiffusion的encoder复制而来,去掉了cross-attention层,每个尺寸的feature map输入一个额外的zero convolution层。

 

DiffBlender (sketch/depth/edge/box/keypoints/color + text)

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

  1. StableDiffusion

  2. self-attention和cross-attention之间插入可训练的local self-attention和global self-attention进行多模态训练。

DiffBlender

 

Universal Guidance (segmentation/detection/face recognition/style + text)

Universal Guidance for Diffusion Models

  1. StableDiffusion

  2. forward guidance:利用Tweedie's formula根据xtϵθ(xt,t)计算x^0,输入off-the-shelf segmentation/detection/face recognition/style模型计算loss,求梯度作为guidance。

  3. backward guidance:在上述guidance的基础上,使用Decomposed Diffusion Sampling优化一个Δx0进一步guidance。

  4. 采样的每一步都使用resample technique重复多次forward guidance + backward guidance。

 

Multi-Condition

baseline:ControlNet、T2I-Adapter等模型,不同condition单独训练好后,可以通过feanture插值的方式进行组合使用,实现multi-condition控制。

 

Composer (shape/semantics/sketch/masking/style/content/intensity/pallete/text)

Composer: Creative and Controllable Image Synthesis with Composable Conditions

  1. 用各种预训练网络提取图像的各种结构、语义、特征信息,然后作为条件训练GLIDE。

  2. 训练技巧:以0.1的概率丢弃全部conditions,以0.7的概率包含全部conditions,每个condition独立以0.5概率丢弃。

 

MaxFusion

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

MaxFusion

  1. 一个ControlNet接收不同模态输入进行训练,图中的不同task使用的是相同的网络。

  2. 将不同模态在每层计算完成后得到的feature进行merge然后skip-connect到UNet decoder,merge后的feature再unmerge为原来的数量输入到下一层。

  3. merge策略:对于每个spatial位置,计算两个feature之间的相关性,如果大于某个预设的阈值,就取两个feature的平均;如果小于阈值,就分别计算它们相对于各自整个feature的标准差,选择标准差较大的那个feature。

  4. baseline是Multi-T2I Adapter和Multi-ControlNet,即每个task单独训练一个T2I Adapter或ControlNet,然后一起使用。

 

OmniControlNet

OmniControlNet: Dual-stage Integration for Conditional Image Generation

OmniControlNet

  1. 先为不同模态分别学习一个pseudo word,例如使用几张depth map images和"use <depth> as feature"利用TI学习"<depth>"的word embedding。

  2. 之后使用不同模态训练ControlNet,其中trainable copy的prompt之前加上对应条件的模态的"use <depth> as feature",这样一个ControlNet就可以处理不同模态的条件。

 

gControlNet

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

  1. 多个模态的condition融合,输入到一个ControlNet进行训练,实现任意种模态的condition组合生成。

 

Uni-ControlNet

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Uni-ControlNet-1

Uni-ControlNet-2

 

AnyControl

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

AnyControl

  1. ControlNet

  2. multi-control fusion block是cross-attention,让query token与visual token进行交互(text token不参与,直接输入下一层),visual token要加positional embedding以区分不同spatial control,multi-control alignment block就是self-attention,让query token获取信息。

  3. query token最终的输出送入ControlNet的cross-attention。

  4. 训练时随机drop不同spatial control,以让模型适用于不同数量的spatial control。

 

DynamicControl

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

DynamicControl

 

FaceComposer

FaceComposer: A Unified Model for Versatile Facial Content Creation

  1. 类似Composer,专做人脸,还支持talking face生成。

 

TASC

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

TASC

  1. 解决prompt中有但是depth map中没有的物体在生成时丢失的问题。

 

Any-to-Any

Versatile Diffusion

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

VersatileDiffusion

 

Multi-Source

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

  1. 多个音乐源拼接在一起进行训练,训练时所有音乐源都使用相同的时间步,噪声不一样。

  2. total generation

  3. partial generation:blended inpainting,配乐。

  4. source separation:将某个要分离出来的音乐源视为所有音乐源的和减去其它音乐源的和。

 

UniDiffuser

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

  1. 使用预训练编码器将image和text都转换为token,额外训练两个decoder,可以根据token重构image和text。

  2. text-image联合训练,使用U-ViT架构,训练时两者采样不同的时间步和噪声,这样可以做到unconditional(另一个模态一直输入噪声),conditional(另一个模态一直输入条件),joint(同步生成) sampling。

 

OneDiffusion

One Diffusion to Generate Them All

OneDiffusion

  1. 类似UniDiffuser。

 

ONE-PIC

Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

ONE-PIC

 

EasyGen

Making Multimodal Generation Easier When Diffusion Models Meet LLMS

  1. BiDiffuser:fine-tune UniDiffuser,只进行image-to-text和text-to-image,ϵxϵθx(xtx,y0,tx,0)22+ϵyϵθy(x0,yty,0,ty)22,即不再训练他们的联合分布。

  2. 将BiDiffuser和LLM联合。

EasyGen

 

CoDi

Any-to-Any Generation via Composable Diffusion

  1. 目标:generate any combination of output modalities from any combination of input modalities.

  2. We begin with a pretrained text-image paired encoder, i.e., CLIP. We then train audio and video prompt encoders on audio-text and video-text paired datasets using contrastive learning, with text and image encoder weights frozen。这样每个模态就能得到一个encoder,且编码结果共享一个common embedding space。每个模态以编码结果为条件训练一个diffusion model。

  3. 上面训练得到的是单模态的diffusion model,只能单对单自生成,还不能多对多生成。使用text-image数据,为text diffusion model和image diffusion model的UNet各自加入新的cross-attention层,训练时只训练这个cross-attention层,cross-attention的方式是为每个模态的noisy latent设计一个independent encoder,将不同模态的noisy latent嵌入到一个common embedding space,attend这个embedding token,除了diffusion loss同时也利用contrastive learning进行训练,这样text和image的noisy latent就可以通过它们的encoder对齐。之后固定住text的encoder和cross-attention weights,用text-audio数据,重复该方法,训练得到audio的encoder和cross-attention weights。之后固定audio的encoder和cross-attention weights,用audio-video数据,重复该方法,训练得到video的encoder和cross-attention weights。这样在cross-attention中,四种模态的noisy latent都被对齐了,之后可以interpolation不同noisy latent的encoder embedding进行joint sampling,即使这种combination可能没训练过。

 

CoDi-2

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

对于多模态数据,利用Codi的multimodal encoder,将其它模态的编码结果(feature sequence)送入LLM进行训练,对输出(feature sequence)进行回归,同时将其输入对应模态的diffusion model计算diffusion loss,两个loss一起训练。

text还是token prediction loss进行训练。

本质还是feature-based而非token-based。

 

GlueGen

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

为不同模态语料(如语音、外文等)学习一个编码网络,使编码结果(分布)与现有的StableDiffusion的text encoder的编码结果(分布)对齐。

这样就可以无缝切换,使用训练好的编码网络为StableDiffusion提供cross-attention的kv,做不同模态的生成。

不用fine-tune StableDiffusion,而且fine-tune会导致对之前模态的遗忘。

 

In-Context/Prompt/Instruction

UniControl

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

  1. 模仿InstructGPT训练可根据instruction进行生成的StableDiffusion。

  2. 将不同任务整理成统一形式的task,每个task包含task instruction(如segmentation to image),prompt,visual conditon(segmentation)和target image,训练时使用ControlNet架构,prompt输入StableDiffusion,task instruction和visual condition输入ControlNet,多个task一起训练。可以泛化到zero-shot task和zero-shot task combination(如segmentation + skeleton to image)。

 

PromptDiffusion

In-Context Learning Unlocked for Diffusion Models

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

  1. prompt由一个example pair和一个text构成,example pair由query image(如segmentation、edge map等)和query image对应的real image组成,之后给定一个新的query image,模型需要根据example pair和text生成对齐的图像。

  2. 训练好的模型还可以适用于unseen example pair,即In-Context Learning(无需训练的学习框架)。

  3. 模型架构和ControlNet一致,只是输入的条件变成了example pair和新的query image的组合。

 

ContextDiffusion

Context Diffusion: In-Context Aware Image Generation

ContextDiffusion

 

ImageBrush

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

ContextDiffusion

  1. 和PromptDiffusion一样的In-Context Learning,example pair + query image + target image组成一个2×2的grid作为数据进行训练,example pair和query image保持不变,diffusion训练生成target image。

 

InstructGIE

InstructGIE: Towards Generalizable Image Editing

InstructGIE

  1. 和ImageBrush类似。

 

Analogist

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Analogist

  1. 和ImageBrush类似。

 

ReEdit

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

ReEdit

  1. 和ImageBrush类似。

 

Human/Hand

HumanSD

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

HumanSD-1

HumanSD-2

  1. skeleton也用VAE encoder编码,concat在zt上。

  2. ϵθ(zt,t,c)ϵ输入pre-trained human pose heatmap estimator就可以估计出一个heat map,把这个heat map当做diffusion loss的权重。思想类似PDAE的梯度估计器,ϵθ(zt,t,c)ϵ类似梯度估计器的输出。

 

HairDiffusion

HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion

HairDiffusion

 

Parts2Whole

From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

Parts2Whole

  1. Appearance Encoder的输入不加噪,且每个part image独立输入提供reference feature,输入的text为该part image对应的类别,如face、hair等。

  2. Shared Self-Attention的思想类似GLIGEN,进行self-attention后只保留image feature。如果有part image的mask,attention时只attend unmask部分的pixel。

  3. Decoupled Cross-Attention是IP-Adapter,两个并行的cross-attention layer分别处理text和part image。

 

HandRefiner

HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting

HandRefiner

  1. hand depth map + ControlNet

 

Hand2Diffusion

Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

Hand2Diffusion

  1. 先生成手再生成body。

 

HanDiffuser

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

HanDiffuser

  1. 以hand params为中介进行生成。

 

RHanDS

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

RHanDS-1

RHanDS-2 RHanDS-3

 

  1. 将畸形的手从原图中割下来,输入RHanDS进行修复,之后再粘贴回原图。

  2. RHanDS的训练包含两个阶段,第一阶段构造数据集(同一个人的两只手作为一对数据)训练保持style,第二阶段使用一个3D模型提取mesh训练根据structure重构。该3D模型也可以根据畸形的手提取出正常手的mesh。

 

HumanRefiner

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

HumanRefiner

  1. AbHuman数据集:使用StableDiffusion生成human图像,人工标注了异常分数以及异常的区域,之后训练一个打分模型和一个异常目标检测模型。

  2. 在AbHuman上fine-tune一下StableDiffusion,不然StableDiffusion无法识别含有异常描述的prompt,之后CFG + score guidance进行生成。

  3. 之后的refine是可选项。

 

Hand1000

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Hand1000

 

MoLE

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

MoLE

  1. MoE的方法组合LoRA参数。

 

Text/Glyph

TextDiffuser

TextDiffuser: Diffusion Models as Text Painters

  1. 生成带文字的图片。

  2. 先训练一个Transformer生成文字的layout,再训练一个以layout的mask为条件的diffusion model生成图片。

 

TextDiffuser-2

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

  1. 训练一个LLM对text rendering进行layout planning,之后训练一个diffusion model根据layout planning进行生成。

 

CustomText

CustomText: Customized Textual Image Generation using Diffusion Models

CustomText

 

RefDiffuser

Conditional Text-to-Image Generation with Reference Guidance

RefDiffuser

 

GlyphControl

GlyphControl: Glyph Conditional Control for Visual Text Generation

  1. 自监督训练,使用OCR模型识别带文字图像中的文字,并将其输入ControlNet训练重构原图。

 

GlyphDraw

GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently

  1. 所有条件输入UNet重新训练。

 

GlyphDraw2

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models

  1. ControlNet

 

TextGen

How Control Information Influences Multilingual Text Image Generation and Editing?

TextGen

  1. ControlNet

 

AnyText

AnyText: Multilingual Visual Text Generation And Editing

AnyText

 

UDiffText

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

UDiffText

 

Brush Your Text

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

  1. ControlNet + cross-attention mask constraint

 

LTOS

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

LTOS

  1. object-layout control module由GLIGEN实现。

  2. visual-text rendering module由ControlNet实现(在GLIGEN的基础上),类似ControlNet-XS解决information delay问题一样,为了让layout与glyph信息有交互,让skip feature与backbone feature进行cross-attention后再进行skip-connection。

 

AMOSampler

AMOSampler: Enhancing Text Rendering with Overshooting

AMOSampler

  1. training-free

  2. 使用Text Rendering部分计算cross-attention map,we then average the attention map over different layers and

    heads and rescale its values between 0 and 1.

  3. ODE Overshooting: 从ztzo,we apply the resulting attention map to give different image patches different amounts of overshooting. cross-attention map大的地方说明已经生成得不错了,所以步子大一点,cross-attention map小的地方说明已经生成得不行,所以步子小一点。这样zo不同image patch在不同的时间步。

  4. 根据ODE Overshooting时的步子,对不同image patch加不同大小的噪,让所有image patch回到相同时间步,得到zs

 

TextCenGen

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

TextCenGen

  1. training-free.

 

ARTIST

ARTIST: Improving the Generation of Text-rich Images by Disentanglement

ARTIST

  1. text module:先使用只有text的黑白图片训练一个diffusion model。

  2. visual module:固定text module,使用带text的真实图片训练一个diffusion model,for each intermediate feature from the mid-block and up-block layers of text module, we propose to use a trainable convolutional layer to project the feature and add it element-wisely onto the corresponding intermediate output feature of the visual module.

 

JoyType

JoyType: A Robust Design for Multilingual Visual Text Creation

JoyType

  1. ControlNet

  2. diffusion loss + x^0 OCR loss

 

MGI

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

MGI

 

TextMaster

TextMaster: Universal Controllable Text Edit

 

SIGIL

Towards Visual Text Design Transfer Across Languages

 

Image Composition

Collage-Diffusion

Collage Diffusion

  1. 将不同collage拼在一起并保证harmonization(无重叠)。

  2. 使用TI将每个collage编码进text embedding,同时修改StableDiffusion的cross-attention,类似MaskDiffusion引入mask信息,一起训练。

  3. 生成时为每个collage的pseudo word对应的cross-attenion map引入mask。

 

DiffHarmonization

Zero-Shot Image Harmonization with Generative Model Prior

  1. Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with that of the background.

  2. To achieve image harmonization, we can leverage a word whose attention is mainly constrained to the foreground area of the composite image, and replace it with another word that can illustrate the background environment.

 

DiffHarmony

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model

DiffHarmony

 

RecDiffusion

RecDiffusion: Rectangling for Image Stitching with Diffusion Models

RecDiffusion

  1. task:rectangling

 

PrimeComposer

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

PrimeComposer

  1. pixel composition:按照mask直接拼接在一起。

  2. correlation diffuser:object的inversion过程中的self-attention layer的KV取代pixel composition的self-attention layer的KV,注意只取代Mobj位置的KV。

  3. RCA:限制object对应的cross-attention在mask内,mask之外的响应值赋为负无穷。

  4. 每一步latent都要和background的inversion过程中的latent再做pixel composition,以保持背景。

 

FreeCompose

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

FreeCompose-1

FreeCompose-2

  1. 利用DDS优化图像进行image composition。

  2. 对于object removal:Po是something in some place,Pt是some place,the KV of the self-attention layers masked by M are excluded during noise prediction.

  3. 对于image composition:带T2I-Adapter的DDS,PoPt相同,需要提取原图中object的sketch,以及期望该object在目标图像中的sketch(可以和原图中的object的sketch不同,比如变成举手),优化时replaces the optimized image KV of the self-attention layers with input image KV of the self-attention layers。

  4. 对于image harmonization:Po是空字符串,Pt是a harmonious scene,两张要拼的图像直接粘在一起作为输入图像。

 

Diffusion-in-Diffusion

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

 

TF-ICON

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

  1. 将reference image注入到main image中,并且符合为main image的风格。

  2. 使用exceptional inversion将两个image编码到噪声,然后将reference image的编码噪声resize并注入到main image的编码噪声中,再生成。

 

TALE

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

TALE

 

CD

Composite Diffusion

  1. scaffolding stage: 根据condition生成到某一中间步,只有大致的结构。

  2. harmonization:text-guided generation or blended(若有segmentation condition)

 

LRDiff

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

LRDiff

  1. vision guidance: 给xt加上(2M1)δ,其中M是和图像大小一样的mask,指定物体的区域,δ是个标量,for the region containing an object, we add δ to enhance the generation tendency of that object. Conversely, for areas outside the targetregion, we subtract δ to suppress the generation tendency of the object. 这个δ的值可以由cross-attention map高响应值区域的xt的值取平均得到。

 

MagicFace

MagicFace: Training-free Universal-Style Human Image Customized Synthesis

MagicFace

  1. 图中画错了,应该是t>αT

  2. 利用cross-attention和self-attention估计出每个concept的mask。

  3. RSA: self-attention时concat上所有concept的K和V,计算self-attention map时乘上一个mask(也是concat在一起),抬高不同concept对应区域的权重。

  4. RBA: 每个concept单独计算出一个self-attention map,只留下mask区域内的。

 

Make-A-Storyboard

Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Make-A-Storyboard

  1. 使用TI分别学习concept和scene,如果直接用concept+scene造句,生成效果不佳。可以先用concept生成,然后提取mask。然后分别用concept和scene进行生成,到某一步λ时,使用mask融合两者的xλ,之后进行交替生成,一步使用只带concept的句子,一步使用只带scene的句子。

 

AnyScene

AnyScene: Customized Image Synthesis with Composited Foreground

AnyScene

  1. Foreground Injection Module是ControlNet架构自监督训练。

 

GP

Generative Photomontage

GP

  1. Note that we inject the initially generated self-attention features for all images except for QB, the query features of the base image. If we inject the initial QB​ features, we often observe suboptimal blending at the seams.

  2. Q influences the image structure, while K and V influence the appearance. Hence, injecting QB (and thus completely overwriting the Q features) eliminates the opportunity for the model to adapt the image structure near the seams.

 

Image Editing through Text

Summarization

MDP

MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

  1. 通用框架,类似survey。

 

Mask-Based

除了提供text,还需要指定需要编辑的区域,编辑时使用text-guided inpainting方法,保持unmask部分不变,参考Inpainting部分。

 

IIE

Guided Image Synthesis via Initial Image Editing in Diffusion Model

対生成图像不满意的地方,可以对xT对应的地方进行re-randomize;还可以通过移动生成图像中物体所在位置对应初始噪声区域来变换生成图像中物体的位置。

 

MaSaFusion

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

MaSaFusion

 

ITIE

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

ITIE

  1. Grounded-SAM获取mask后使用inpainting方法进行编辑。

 

MagicQuill

MagicQuill: An Intelligent Interactive Image Editing System

MagicQuill

 

Mask-Free

难点在于如何保持图像除编辑外的背景和其它内容与原图一致。

 

Baseline1

  1. DDIM Inversion + Conditional Generation

 

Baseline2

  1. Text-Guided SDEdit

 

LASPA (real image editing, no fine-tune)

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

LASPA

  1. 为了保持原图的细节,最直接的做法就是将原图注入生成过程中,SDEdit相当于只是单步注入,LASPA在每一步都注入,使用最简单的插值法。

 

LaF (real image editing, no fine-tune)

Text Guided Image Editing with Automatic Concept Locating and Forgetting

LaF-1

LaF-2

  1. Text-Guided SDEdit

  2. Text-Guided SDEdit方法会使编辑生成的concept受限于原图,如shape等,因此使用语法分析器分析出要忘记的concept cn,在生成时进行CFG采样:ϵθ(xt)+ω(ϵθ(xt,cp)ϵθ(xt))η(ϵθ(xt,cn)ϵθ(xt))

 

RF-Inversion (real image editing, no fine-tune)

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

  1. Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator.

 

RF-Solver-Edit (real image editing, no fine-tune)

Taming Rectified Flow for Inversion and Editing

  1. RF-Solver not only significantly enhances the accuracy of inversion and reconstruction, but also improves performance on fundamental tasks such as T2I generation.

  2. 用RF-Solver进行inversion后进行类似P2P的编辑。

 

FlowInversion (real image editing, no fine-tune)

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

FlowInversion

  1. RF inversion

 

P2P (generated image editing, real image editing, no fine-tune)

Prompt-to-Prompt Image Editing with Cross Attention Control

P2P

  1. Imagen

  2. text2img模型生成的图片的结构主要由随机种子和cross-attention决定,通过保持随机种子不变(使用DDIM时就是控制起始噪声不变),操控cross-attention可以实现内容保持。

  3. 此方法并不是对已有图片做编辑,而是从高斯噪声开始的,并行地生成两张图,一张根据source prompt生成,一张根据target prompt生成(程序运行前并不知道原图是什么样),相当于两条并行的使用source prompt的reconstruction generative trajectory和使用target prompt的editing generative trajectory,前者为后者提供cross-attention map用于修改自身的cross-attention map以达到编辑的效果。

  4. 对Imagen的text 64x64模型的16x16 resolution的hybrid-attention中的cross-attention部分进行操作,不操作self-attention部分,super resolution模型还用Imagen原来的。

  5. KV都变成了visual token+target prompt token,对新的QK的计算结果即cross-attention map做操纵,主要有三种:word swap:除了被换的词,其它都用原来的cross-attention map;adding a new phrase:旧phrase部分都用原来的cross-attention map;attention re-weighting:给原来的cross-attention map要增强/减弱的词乘常数系数。

  6. 上述都是generated image editing方法,如果想做real image editing,需要进行DDIM Inversion。先使用source prompt对原图进行DDIM Inversion加噪,从得到的xT开始,再使用target prompt进行和上面一样的editing操作。因为编辑时是两条并行的generative trajectory(reconstruction和editing),它们必须要用一样的ω,且editing时需要用比较大的ω才有比较好的效果,所以reconstruction也需要用比较大的ω,所以DDIM Inversion需要使用较小的ω(参考Inversion部分),但重构效果依然比不上DDIM Inversion和reconstruction都使用较小的ω时的效果,编辑后的图片背景和原图有较大差异,此时就需要降低editing generative trajectory的ω。这就是 distortion-editability tradeoff (使用较小的ω,原图背景保持得很好,但编辑效果不好;使用较大的ω,编辑效果很好,但原图背景不能保持)。为此,本论文提供一种细粒度的解决方案,即使用用户提供的原图描述中的要编辑的词对应的attention map生成一个mask(阈值法),该mask会保护修改词之外的region,进行blended生成。

 

NTI (real image editing)

Null-text Inversion for Editing Real Images using Guided Diffusion Models

  1. StableDiffusion

  2. 解决P2P的real image editing时使用较小的ω进行DDIM Inversion后使用较大的ω进行editing时重构效果较差的问题。

  3. 先使用ω=1(需要source prompt)进行DDIM Inversion,记录所有zt,然后为每个zt分配一个null-text embedding ϕt,初始化z^T=zT,从t=T开始到t=0,使用z^t,source prompt,ϕtω=7.5生成zt1,使之靠近zt1,只优化ϕt,优化好后,使用z^tϕt以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ϕt帮助,可以重构图像。

  4. editing时,从z^T=zT开始,使用target prompt和训练好的{ϕt}t=0T以及ω=7.5进行生成。可以是P2P模式也可以不是。

 

PTI (real image editing)

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

  1. StableDiffusion

  2. 不需要source prompt,所以DDIM Inversion时ω只能为0。

  3. 类似Null-text Inversion,先使用ω=0进行DDIM Inversion,记录所有zt,然后为每个zt分配一个conditional embedding ct,初始化z^T=zT,从t=T开始到t=0,使用z^tct以及ω=7.5生成zt1,使之靠近zt1,只优化ct,优化好后,使用z^tct以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ct帮助,可以重构图像。

  4. editing是非P2P模式时:从zT开始,使用ω=7.5c=ηc+(1η)ct,其中η[0,1]η=0时就是重构图像,η=1时就是普通的DDIM Edit。

 

GEO (real image editing)

InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models

  1. 需要先perform manual pixel-level editing using techniques such as brush strokes, image pasting, or selective edits得到大概的编辑图,再进行refine。这与NTI这种从原图开始编辑的方法不同。

  2. refine的过程:使用edit prompt对大概的编辑图进行DDIM Inversion到某一中间步后再CFG生成,类似purification。

  3. DDIM Inversion过程中,根据zt得到zt+1后,分别使用ztzt+1计算一个predicted z0,求两者的MSE loss,梯度下降优化这一步的ϵt=ϵθ(zt,t,c),优化N步后,根据zt和优化结果计算得到最终的zt+1。这么做是为了保持原图细节。

 

BARET (real image editing)

BARET: Balanced Attention based Real image Editing driven by Target-text Inversion

  1. 类似Prompt Tuning Inversion,但以target prompt embedding初始化ct进行优化。

 

NPI (real image editing)

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

不用像Null-text Inversion优化ϕ,而是直接用source prompt替换ϕ

这样DDIM Inversion和reconstruction时,无论ω是多少,ϵ(xt,t)=ϵθ(xt,t,ϕ)+ω[ϵθ(xt,t,c)ϵθ(xt,t,c)]=ϵθ(xt,t,ϕ),保证了重构质量。

editing时使用source prompt作为negtive prompt。

NPI

 

ProxEdit (real image editing)

ProxEdit: Improving Tuning-Free Real Image Editing with Proximal Guidance

  1. improved NPI

 

StyleDiffusion (real image editing)

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

StyleDiffusion-2

  1. 类似NTI,先使用ω=1(需要source prompt)进行DDIM Inversion,记录所有z^t,初始化z~T=z^T,从t=T开始到t=0,使用z~tω=7.5生成zt1,使之靠近z^t1,这一过程中source prompt的CLIP编码作为K,图像的CLIP编码经过一个可训练网络Mt的编码作为V,只优化Mt,优化好后,使用z~tMt以及ω=7.5生成z~t1,进入下一步。这样可以保证使用ω=7.5的情况下,有Mt​帮助,可以重构图像。

  2. 除了使用zt1z^t1之间的MSE loss之外,还是用cross-attention map之间的MSE loss。

  3. editing时,从z~T=z^T开始,使用target prompt和训练好的Mt以及ω=7.5进行生成。

 

FlexiEdit (real image editing, no fine-tune)

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

FlexiEdit

  1. 第一阶段:对DDIM Inversion得到的zt进行FFT,过滤掉一些高频信息(边缘、layout)方便编辑动作,同时保持低频信息(背景),再进行editing generative trajectory。

  2. 第二阶段:对第一阶段得到的图像进行SDEdit,生成过程中注入reconstruction generative trajectory的self-attention的KV,与原图的特征对齐。

 

DirectInv (real image editing)

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

DirectInv-1

DirectInv-2

  1. P2P的reconstruction generative trajectory每一步都做修正,使zFsg与DDIM Inversion时的zIp相同,保证reconstruction generative trajectory重构结果与原图一致。

  2. training-free,不需要任何优化。

 

SimInversion (real image editing)

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

SimInversion

  1. 改进P2P,disentangle the guidance scale for the source and target branches to reduce the error.

  2. ωs=0.5ωt=7.5

 

SYE (real image editing)

Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing

SYE

  1. We introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image.

  2. α¯t=11+exp(k(t0.6T)), our logistic noise schedule avoid the singularity of dxtdt at t=0.

 

InfEdit (real image editing)

Inversion-Free Image Editing with Natural Language

  1. DDIM选取σt=1α¯t,DDIM采样过程和Consistency Model Multistep采样过程一致,称为Denoising Diffusion Consistent Model (DDCM)。

  2. 利用这一点就不需要对原图DDIM Inversion就可以进行编辑。

 

IterInv (real image editing)

IterInv: Iterative Inversion for Pixel-Level T2I Models

  1. 针对含有super-resolution stage的inversion。

 

KV-Inversion (real image editing)

KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing

  1. The contents (texture and identity) are mainly controled in the self-attention layer, we choose to learn the K and V embeddings in the self-attention layer.

  2. 先使用ω=7.5(存疑)进行DDIM Inversion,记录所有zt,然后为每个zt准备self-attention的KV projection matrix的LoRA参数ψt,初始化z^T=zT,从t=T开始到t=0,使用z^tψt以及ω=7.5生成zt1,使之靠近zt1,优化ψt,优化好后,使用z^tψt以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ψt帮助,可以重构图像。

  3. editing时,从zT开始,使用target prompt和训练好的{ψt}t=0T以及ω=7.5进行生成。

 

EDICT (real image editing, no fine-tune)

EDICT: Exact Diffusion Inversion via Coupled Transformations

  1. StableDiffusion

  2. 非P2P模式,直接用source prompt进行DDIM Inversion,然后用target prompt生成,都使用较大的 ω

  3. 利用Flow-based Generative Models中的Affine Coupling Layer的思想,设计了可逆的denoising过程,确保使用较大的 ω 时先加噪再重构可以还原原图。

 

BDIA (real image editing, no fine-tune)

Exact Diffusion Inversion via Bi-directional Integration Approximation

 

BELM (real image editing, no fine-tune)

BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models

 

AIDI (real image editing, no fine-tune)

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

AIDI

  1. DDIM的生成过程xt1=atxt+btϵ(xt,t)是不可逆的,DDIM Inversion做了ϵ(xt,t)ϵ(xt1,t1)的近似假设。如果不做近似假设,需要在已知xt1的情况下求解xt=1atxt1atbtϵ(xt,t),所以DDIM Inversion的每一步就转化成求函数f(xt)=1atxt1btatϵ(xt,t)的不动点,其中xt1是已知的,可以看成常数,使用求不动点的Anderson算法求得xt

  2. (这一条与本论文提出的算法无关)本论文发现,P2P中使用非对称的ω进行编辑效果会更好:DDIM Inversion使用ω=0(即纯图像DDIM Inversion,不使用CFG),reconstruction和editing generative trajectory使用较大的ω,效果比原始P2P都使用一样的ω要好,这与(Inversion-DDIM Inversion-4)中的结论一致;对于EDICT,DDIM Inversion使用ω=0,editing使用较大的ω,效果比原始EDICT都使用一样的ω要好。

  3. 本论文使用P2P算法,在DDIM Inversion时使用上述不动点算法,DDIM Inversion和reconstruction generative trajectory都使用较小的ω,for editing generative trajectory, we introduce a blended ω to apply larger guidance scales for pixels relevant to editing and lower ones for the rest to keep them unedited. 使用reconstruction generative trajectory的cross-attention map计算一个mask,由该mask决定which pixels are relevant to editing。

 

SPDInv (real image editing, no fine-tune)

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

  1. 类似AIDI,也是求不动点。DDIM的生成过程xt1=atxt+btϵ(xt,t)是不可逆的,DDIM Inversion做了ϵ(xt,t)ϵ(xt1,t1)的近似假设,于是有xt=1atxt1atbtϵ(xt1,t1),每一步得到近似的xt后,进一步使用f(xt)xt2梯度下降优化xt​求不动点。

  2. 可以用于多种编辑方法,如P2P,MasaCtrl,PNP,ELITE

 

FPI (real image editing, no fine-tune)

Fixed-Point Inversion for Text-to-Image Diffusion Models

  1. 不动点。

 

FPI (real image editing, no fine-tune)

Exploring Fixed Point in Image Editing Theoretical Support and Convergence Optimization

  1. 不动点。

 

AdapEdit (real image editing)

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

  1. 基于P2P的soft editing。

 

FPE (real image editing)

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

FPE-1

FPE-2

  1. P2P是替换cross-attention map,但是需要找到real image的prompt,虽然可行但效果不好。本文发现替换self-attention map也是可以的。

  2. real image editing时,DDIM Inversion不需要prompt,reconstruction也不需要prompt。

 

FateZero (real image editing, no fine-tune)

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

FateZero

  1. StableDiffusion

  2. P2P是将generative trajectory的cross-attention map注入到editing trajectory里,本论文直接将DDIM Inversion时的attention map注入到editing trajectory,此时就不需要generative trajectory了。这样做重构的效果也很好。

  3. 都使用较大的ω,DDIM Inversion时记录所有时间步的self-attention map和cross-attention map,编辑生成时,类似P2P,将prompt中没变的部分的cross-attention map替换成DDIM Inversion时的cross-attention map,同时替换所有self-attention map (to preserve the original structure and motion during the style and attribute editing)。

 

ALE-Edit

Addressing Attribute Leakages in Diffusion-based Image Editing without Training

ALE-Edit

  1. 解决editing attribute leakage的问题,其它物体受被编辑物体的影响也被改变了。

  2. 每张图有N个物体,每个物体有一对source and target prompt构成edit prompt list,使用and分别将N个source/target prompt连起来,分别得到base source prompt和base target prompt。

  3. ORE: 编码base target prompt得到text embedding ebase,再分别编码N个物体的target prompt得到ei,对于第i个物体,使用ei替换ebase中对应位置的embedding,并将除ei之外的其它位置的embedding置为0(防止泄露),并将EOS替换为ei的EOS(防止泄露),得到第i个物体的最终target prompt。重复就可以得到N个物体的target prompt。

  4. RGB-CAM: 分别使用base target prompt和N个target prompt进行推理,得到N+1个cross-attention map,之后根据segmentation mask对cross-attention map进行blend(加权和)。

  5. BB: 根据background mask对reconstrution generative trajectory和editing generative trajectory的latent进行blend(加权和)。

 

MasaCtrl (real image editing, no fine-tune)

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

  1. StableDiffusion

  2. 先用source prompt对原图进行DDIM Inversion加噪,从得到的xT开始,使用ω=7.5,类似P2P,分别用source prompt和target prompt进行reconstruction和editing。每一步,操作editing generative trajectory的UNet decoder的后几层的self-attention,用同一位置的reconstrcution generative trajectory的self-attention的KV替换editing generative trajectory的self-attention的KV(Q还是editing generative trajectory自己的)。

  3. 只修改UNet decoder的后几层的self-attention:the Query features in the shallow layers of U-Net (e.g., encoder part) cannot obtain clear layout and structure corresponding to the modified prompt。

  4. 只在中间的几步进行操作:performing self-attention control in the early steps can disrupt the layout formation of the target image. In the premature step, the target image layout has not yet been formed.

  5. 同时,每一步,两条generative trajectory都使用阈值法根据cross-attention map计算一个object的mask,限制editing generative trajectory的object区域的self-attention只参考reconstruction generative trajectory的object区域的信息。

MasaCtrl

  1. 相比于P2P只操控cross-attention,MasaCtrl只操控self-attention,操控cross-attention适合做物体增删,操控self-attention适合做动作改变。

 

DiT4Edit (real image editing, no fine-tune)

DiT4Edit: Diffusion Transformer for Image Editing

  1. DiT版本的MasaCtrl。

 

MRGD (real image editing, no fine-tune)

Multi-Region Text-Driven Manipulation of Diffusion Imagery

  1. MultiDiffusion版本的P2P,对不同region进行编辑。

 

ObjectVariations (generated image editing, no fine-tune)

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

ObjectVariations-2

ObjectVariations-2

  1. StableDiffusion

  2. 对图像中某个物体做变换,而其它部分不改变,如将篮子变成盘子。两条并行的generative trajectory,在某个时间段内将句子中的单词替换。

  3. shape preservation:在cross-attention map上使用阈值法标定出某个需要shape preservation的word对应的object的位置,然后在之前的self-attention map中,将该object所有的pixel对应的self-attention map的行和列注入到新的generative trajectory上。也可以将要编辑的object标定出来,然后把标定之外的pixel当做背景,对这些pixel做shape preservation。

  4. 使用Null-text Inversion可以做real image editing。

 

ViMAEdit (real image editing)

Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing

ViMAEdit

 

DDS (real image editing)

Delta Denoising Score

DDS-1 DDS-2

 

  1. 将图像本身看成参数,就可以利用SDS进行编辑(输入target prompt,梯度更新图像),但这样会导致图像模糊,如图中上半部分。

  2. 导致这种情况的原因是SDS loss中含有偏离项,因此将SDS loss分为两项,一项是用于编辑的,一项是使得图像变模糊的偏离项。提出DDS loss,即θLDDS=(ϵϕ(zt,t,y)ϵϕ(z^t,t,y^))zθ,其中θ即为要编辑的图像,z就是θ(这里为了一般性,保留这种SDS的写法),z^为原图,y为target prompt,y^为原图的prompt,一般由target prompt修改得到,训练时使用相同的tϵ。显然,θLDDS=θLSDS(z,y)θLSDS(z^,y^)。对于任何paired image-text,如果使用θLSDS去更新图像也会导致图像变模糊,这表明paired image-text的θLSDS非零(是零的话它就不会更新图像了),但理想情况下它应该是零,因为image-text是pair的,不需要优化,所以这个paired image-text的非零的θLSDS就是导致图像模糊的偏离项。进一步发现,使用image-text都很相似的pair之间计算出来的θLSDS的norm是差不多的,所以可以近似将θLSDS(z^,y^)视为θLSDS(z,y)的偏离项,DDS即为去除了偏离项的优化目标。

  3. DDS对于每个编辑需求都要进行反向传播更新,比较消耗计算资源,进一步可以通过DDS训练一个编辑模型,如图所示。

 

DreamSampler (real image editing)

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

DreamSampler

  1. 将DDS的随机采样时间步和噪声改为在DDIM采样过程中进行score distillation,不需要提供原图的prompt也能进行DDS编辑。

  2. Specifically, in contrast to the original DDS method that adds newly sampled Gaussian noise to z0, DreamSampler adds the estimated noise by ϵθ in the previous timestep of reverse sampling. With initial noise computed by DDIM inversion, reverse sampling do not deviate significantly from the reconstruction trajectory even though source description is not given.

 

SmoothDiffusion (real image editing)

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

SmoothDiffusion

  1. 确保diffusion model的latent space smoothness,smooth latent spaces ensure that a perturbation on an input latent (xT) corresponds to a steady change in the output image (DDIM sampled x0).

  2. 做法是训练时加一个正则项Step-wise Variation Regularization。

  3. ω=7.5的DDIM Inversion and Reconstruction有好处,从而也有利于editing。

 

IP2P (real image editing, retrain)

InstructPix2Pix: Learning to Follow Image Editing Instructions

IP2P

  1. 利用GPT3,StableDiffusion,P2P(generated image editing)创建一个数据集,每条数据包含原图,原图描述,目标描述和目标图片,训练一个新的StableDiffusion,以原图和目标描述为条件,建模目标图片,这样在推理时就不需要原图描述了。

 

Emu-Edit (real image editing, retrain)

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

  1. 类似IP2P,创建新数据集进行训练。

  2. 像Emu一样,训练完后使用少量高质量数据进行fine-tune。

 

UltraEdit (real image editing, retrain)

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

  1. A large-scale (~4M editing samples), automatically generated dataset for instruction-based image editing.

 

SeedEdit (real image editing, retrain)

SeedEdit: Align Image Re-generation to Image Editing

  1. 自举式迭代训练。

 

InstructMove (real image editing, retrain)

Instruction-based Image Manipulation by Watching How Things Move

InstructMove

  1. 类似AnyDoor标注instruction-based editing dataset。

  2. 将source image concat到zt上进行训练。

 

PbI (real image editing, retrain)

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

PbI

  1. 与IP2P使用P2P构造数据集不同,PbI使用PbE的思想构造数据集。

  2. editing model和IP2P一样。

 

Diffree (real image editing, retrain)

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Diffree-1

Diffree-2

  1. 使用分割数据集+inpainting模型造数据。

  2. 训练时根据inpainting结果和text去预测原图,除了diffusion loss,还训练一个小模型预测mask,对最终结果进行blend。

 

RIE (real image editing, retrain)

Referring Image Editing: Object-level Image Editing via Referring Expressions

RIE

  1. 比general image editing更加精细

  2. 利用现有的image composition model、region-based image editing model、image inpainting model构造数据集进行训练。

  3. 编辑模型是一个conditional diffusion model,source image和referring expression作为条件送入cross-attention进行训练。

 

EditWorld (real image editing, retrain)

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

  1. 使用GPT生成input text,instruction和output text,使用SDXL根据input text生成Iori,根据生成过程中的cross-attention map使用阈值法得到每个text token的mask,取所有text token的mask的并作为Iori的前景mask,根据output text对Iori的前景进行inpainting得到Itar,inpainting时使用IP-Aadpter和ControlNet保持ItarIori的一致性,IoriItar​和instruction作为一条数据。

  2. editing model与IP2P一样。

 

LIME (real image editing, retrain)

LIME: Localized Image Editing via Attention Regularization in Diffusion Models

  1. 预训练好的InstructPix2Pix。

  2. 提取原图的UNet features,resize,concat,normalization,聚类,得到segmentation。

  3. 提取目标描述中related token的cross-attention map,算出响应值最高的几个点,这几个点所在的segment拼在一起,即为RoI区域。

  4. 在IP2P生成时做blended editing,同时利用RoI修改cross-attention map,对于unrelated token的cross-attention map,RoI区域内的都减去一个较大的常数值,避免unrelated token对编辑造成影响。

 

FoI (real image editing, retrain)

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

  1. 预训练好的InstructPix2Pix。

  2. 从instruction中提取关键词,使用该关键词对应的cross-attention map,多次进行平方+norm的操作拉开高低值之间的差距,使用阈值法估算出一个mask。

  3. 对instruction的所有token的cross-attention map,mask区域内的响应值做增强,mask区域外的响应值使用ϵθ(zt,t,I,ϕ)(即null-instruction)的cross-attention map替换。

  4. 采样时,对sT(ϵθ(zt,t,I,T)ϵθ(zt,t,I,ϕ))乘上mask。

 

WYS (real image editing, retrain)

Watch Your Steps: Local Image and Scene Editing by Text Instructions

  1. 预训练好的InstructPix2Pix。

  2. 类似DiffEdit,在编辑之前先计算一个mask,在InstructPix2Pix生成时做blended editing。

 

ZONE (real image editing, retrain)

ZONE: Zero-Shot Instruction-Guided Local Editing

ZONE

  1. 预训练好的InstructPix2Pix。

  2. description-guided model类似StableDiffusion的cross-attention map是token-wise的,instruction-guided model类似InstructPix2Pix的cross-attention map是consistent的。所以在InstructPix2Pix的cross-attention map上利用阈值法估计出一个mask。但这个mask过于粗糙,所以将InstructPix2Pix的编辑结果送入SAM,利用IoU选出重叠最大的segment作为mask。得到mask后,用原图的mask之外的部分替换InstructPix2Pix的编辑结果的mask之外的部分,再利用一些平滑操作去除artifact。

 

VisII (real image editing, retrain)

Visual Instruction Inversion: Image Editing via Visual Prompting

  1. 基于IP2P做Visual Instruction的Textual Inversion。

  2. IP2P的输入是原图和instruction,输出是编辑后的图像。现给定一对原图和编辑后的图像的示例,在IP2P上利用TI的思想学习一个instruction的embedding,之后就可以把这个学到的instruction embedding用在其它图像上,实现与示例类似的编辑效果。

 

E4C (real image editing, fine-tune)

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

E4C

  1. DDIM Inversion使用ω=0(即纯图像DDIM Inversion,不使用CFG),two branch模式,reconstruction generative trajectory和DirectInv一样,每一步不是使用上一步生成的zt,而是使用DDIM Inversion时得到的zt​​输入UNet,给editing generative trajectory提供KV或Q。

  2. Queries for structure and layout, whereas keys and values for textures and appearance. 对于保持layout的编辑,选择替换Q,此时就不需要下面的优化;对于需要编辑layout的编辑,选择替换KV,此时需要下面的优化。

  3. 类似DiffusionCLIP,两个loss优化Q的projection matrix,一个是CLIP direction loss LCLIP,一个是两个trajective最终生成的z0之间的MSE loss LReg

 

Imagic (real image editing, fine-tune)

Imagic: Text-Based Real Image Editing with Diffusion Models

  1. Imagen

  2. 只给原图和target prompt

  3. 先以target prompt embedding为起点,使用TI优化出一个source prompt embedding,之后fix source prompt embedding,fine-tune Imagen,之后使用source prompt embedding和target prompt embedding线性插值进行生成。

  4. 不fine-tune Imagen做不到图像保持,类似DragDiffusion,所以fine-tune很重要。

 

FastEdit (real image editing, fine-tune)

FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

FastEdit

  1. 改进优化Imagic,使得每次编辑的速度提升20多倍。

  2. 优化一:不使用text-to-image模型,而是使用image-to-image模型,其可以根据CLIP image embedding生成图像,这样就不需要TI优化source prompt embedding了,原图的CLIP image embedding就可以作为source embedding,这里利用了CLIP embedding的对齐性质。

  3. 优化二:使用原图的CLIP image embedding对diffusion model进行fine-tune重构原图,这里根据原图的CLIP image embedding和target prompt的CLIP text embedding的差异度选择fine-tune的时间步范围,减少fine-tune次数。

  4. 优化三:使用LoRA fine-tune,减少fine-tune参数量。

  5. fine-tune结束后,类似Imagic,可以使用原图的CLIP image embedding和target prompt的CLIP text embedding的插值进行编辑生成。

 

Forgedit (real image editing, fine-tune)

Forgedit: Text Guided Image Editing via Learning and Forgetting

  1. setting与Imagic相同,做法稍有差异。

  2. vision language joint learning:使用BLIP为原图生成source prompt,将source prompt输入CLIP得到source prompt embedding,再使用该embedding和原图一起fine-tune Imagen,这里embedding也参与优化。fine-tune Imagen时只更新一部分参数,并且发现The encoder of UNets learns the pose, angle and overall layout of the image. The decoder learns the appearance and textures instead.所以可以forget参数:If the target prompt tends to edit the pose and layout, we choose to forget parameters of encoder. If the target prompt aims to edit the appearance, the parameters of decoder should be forgotten.

  3. 生成时,计算target prompt embedding与优化得到的source prompt embedding正交的部分作为editing embedding,使用优化得到的source prompt embedding与editing embedding的线性组合进行生成,目的是为了保持原图细节。

 

DBEST (real image editing, fine-tune)

On Manipulating Scene Text in the Wild with Diffusion Models

DBEST

和Imagic顺序相反,因为这里提供了source prompt。

先fine-tune diffusion model,再使用预训练好的text recognition model的交叉熵loss优化target prompt embedding。

 

PNP (real image editing, no fine-tune)

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

  1. StableDiffusion

  2. 使用uncondtional DDIM Inversion(输入ϕ)编码原图到噪声,从同一起点开始两条并行的generative trajectory,一条使用ϕ一条使用target prompt,每一步对editing generative trajectory进行feature injection和self-attention map injection。和pix2pix-zero思想很像。

  3. feature injection:和MasaCtrl得出一样的结论,UNet深层的feature有更好的structure信息。使用reconstruction generative trajectory的UNet较深层的feature map替换editing generative trajectory的。但这样虽然很好了的保留了原图的structure信息,但也有一些纹理信息泄露到了生成图像中。

  4. self-attention map injection:使用reconstruction generative trajectory的self-attention map(Softmax(QKT))替换editing generative trajectory的,使得纹理信息保持一致。

 

Self-Guidance (real image editing, no fine-tune)

Diffusion Self-Guidance for Controllable Image Generation

Self-Guidance-1 Self-Guidance-2
  1. 用cross-attention map或者UNet feature map计算loss并求梯度作为guidance,实现物体移动、改变大小、改变外观等编辑功能。

  2. position:object对应的word对应的cross-attention map的质心位置。

  3. shape:对object对应的word对应的cross-attention map使用阈值法得到一个二值mask。

  4. apperance:使用上述mask乘上UNet feature map后求均值。

  5. 编辑时两条trajectory,一条generative或者reconstruction trajectory,一条editing trajectory,计算所有不想改变的物体对应的word对应的shape和apperance之间的MSE loss,再根据编辑需求计算loss,求梯度指导editing trajectory生成。

  6. 物体移动:计算某个object对应的word对应的position与期望位置之间的MSE loss。

  7. 改变大小:计算某个object对应的word对应的shape与期望的shape之间的MSE loss。

  8. 改变外观:计算某个object对应的word对应的appearance与期望apperance之间的MSE loss。

 

Guide-and-Rescale (real image editing, no fine-tune)

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Guide-and-Rescale

  1. 使用DDIM Inversion过程中的latent和editing过程中的latent输入UNet,计算两者self-attention map和feature之间的MSE loss,求梯度进行guidance。

  2. 注意计算loss时输入的都是source prompt,目的是保持layout一致。

 

AGG (real image editing)

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

  1. 结合了MCG和DDS的guidance方法,使用任意loss指导采样。

 

Asyrp (real image editing)

Diffusion Models Already Have a Semantic Latent Space

  1. λCLIPLdirection(Ptedit,ytarget;Ptsource,ysource)+λrecon|PteditPtsource| ,只优化 ft(h)=Δh

  2. 训练时先对数据集图像使用Sfor步DDIM Inversion到T,将得到的latents保存(可以重复利用),再从latents开始生成两条轨迹,都是生成Sedit步,但是只生成到tedit而不是0。一条是原本的轨迹,一条是不断shift的轨迹,每一步都用上述loss进行一次优化,类似DiffusionCLIP的GPU-efficient方法。

 

Interpretable h-space (real image edting)

Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models

  1. 虽然还是在h-space做,但采样不再是非对称的了,还是用原DDIM的公式,过UNet时修改h,这说明Asyrp中那个证明是不对的;而且这样每步生成时只需要过一次UNet了,更高效。

  2. unsupervised global: 生成一些样本,保存所有时间步的ht,对每个时间步t,用所有样本的ht计算PCA,得到n个主分量{vtj}j=0n,生成时每步使用h^t=ht+γvtj可以对任何图像进行编辑,不同时间步的第j个主分量具有相同的语义。

  3. unsupervised image-specific: 比如一个睁眼闭眼的编辑方向,对某张带着墨镜的人脸是没有意义的。使用类似h-space微分几何的方法,在h-space中找到能使ϵθ(xt)输出变化最大的方向。虽然是image-specific的,但是在某张图上找到的编辑方向也是可以应用于其它样本的。

  4. supervised: 使用标注的数据对,每对数据中正例含有某个属性,负例不含该属性,每对正例的ht减去负例的ht,取平均作为编辑方向,但是这样做有耦合的问题,因为正负例不可能完全只有一个属性的对立。使用类似找正交基的方法,每次计算一个编辑方向时,去除之前已发现的所有编辑方向的影响,最终得到解耦的编辑方向。

 

ChatFace (real image editing)

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

CLIP Direcitonal Loss为Diff-AE的z训练一个网络预测Δz

 

ZIP (real image editing)

Zero-Shot Inversion Process for Image Attribute Editing with Diffusion Models

ZIP

 

Self-Discovering (real image editing)

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

self-discovering

 

GANTASTIC (real image editing)

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

GANTASTIC

  1. 把StyleGAN已经学到的interpretable direction迁移到StbaleDiffusion上,使用两个loss学习一个CLIP text embedding d

  2. Llatent=Et,ϵ[ϵθ(xt,t,d)ϵθ(xt,t,d)22],让两者在d​的作用下差异化最大,即让d学习两者之间最大的差异在哪,和SDS异曲同工。

  3. Lsem=1cossim(EI(x),d)+cossim(EI(x),d)确保语义。

 

NoiseCLR (real image editing)

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

identify interpretable directions in text embedding space of text-to-image diffusion models

In noisy space, for edits carried out by the same direction to be attracted towards each other, while edits conducted by different directions to repel one another, in line with the core principles of contrastive learning.

NoiseCLR

 

StyleDis (real image editing, no fine-tune)

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

  1. StableDiffusion

  2. c(0)是不带style的描述,c(1)是带style的描述,学习一个schedule λt,从相同的xT开始,使得用ct=λtc(0)+(1λt)c(1)生成的图像x0λ和用c(0)生成的图像x0(0)内容基本相似,且含有c(1)的style,Lclip(x0(0),c(0);x0λ,c(1))+βLperc(x0(0),x0λ)。类似DiffusionCLIP的优化方法,但只优化λt,不fine-tune StableDiffusion。

  3. 训练好之后还可用于image editing,只能用在符合c(0)描述的real image上,使用DDIM Inversion加噪得到xT后,使用ct作为条件生成。

 

SINE (real image editing, fine-tune)

SINE: SINgle Image Editing with Text-to-Image Diffusion Models

  1. StableDiffusion

  2. 类似DreamBooth,用原图和带有pseudo word的prompt,fine tune pseudo word embedding和StableDiffusion,每编辑一张图就要fine-tune一次模型。

  3. 提出Patch-Based Fine-Tuning,假设StableDiffusion LDM尺寸为p×p,Autoencoder输入尺寸为sp×sp,其中s为4或8。fine-tune时随机采样原图的一个patch,resize到sp×sp,同时将这个patch的位置信息编码输入到StableDiffusion中,一方面可以提高泛化能力,另一方面能使模型输出任意尺寸的图片。编辑生成时使用原图的位置信息。

  4. 编辑时使用model-based classifier-free guidance,把fine-tuned模型看作专门生成这个single image的unconditional模型。ωϵθ(zt,c)+(1ω)ϵθ(zt)=ω[vϵθ(zt,c)+(1v)ϵ^θ(zt,c^)]+(1ω)ϵθ(zt),其中ϵ^θ是fine-tuned模型,c^是"a photo/painting of a [*] [class noun]",c是target prompt。

  5. 不需要DDIM Inversion。

 

SEGA (generated image editing, no fine-tune)

SEGA: Instructing Diffusion using Semantic Dimensions

  1. CFG的线性组合

 

DiffEdit (real image editing, no fine-tune)

DiffEdit: Diffusion-based Semantic Image Editing with Mask Guidance

  1. 自动计算mask的Blended Diffusion。

  2. 对于text-to-image模型,分别输入source prompt和ϕ,根据去噪的差异,估算一个mask;使用uncondtional DDIM Inversion(输入ϕ)编码原图到某一中间步;使用target prompt进行生成,每一步使用mask进行blended生成。

  3. 理论证明了,使用uncondtional DDIM Inversion加噪,比SDE直接一步加噪,重构效果更好。

 

LIPE (real image editing, no fine-tune)

LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

LIPE-1

LIPE-2

  1. 有待编辑物体的reference image的编辑任务。

  2. 先利用TI技术学习一个待编辑物体的pseudo word,利用pseudo word简单造句得到identity-aware prompt,之后使用source prompt对原图进行DDIM Inversion,记录所有latent,再使用source prompt和identity-aware prompt进行重构,利用重构最后一步时pseudo word的cross-attention map估算出一个mask,从inversion得到的噪声开始使用target prompt和identity-aware prompt进行编辑生成,编辑的每一步,根据pseudo word的cross-attention map估算出一个mask,取两个mask的并,根据这个mask进行latent的blend,mask区域内取编辑生成的latent,mask区域外取inversion时的latent,以保持背景不变。

 

DM-Align (real image editing, no fine-tune)

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

DM-Align

  1. 自动计算mask,转换为inpainting问题。

 

FISEdit (real image editing, no fine-tune)

FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference

类似DiffEdit自动计算mask:利用P2P的方法操作cross-attention map,使用两个generative trajectory输出的feature map计算出difference mask记为要编辑的区域。

 

InstDiffEdit (real image editing, no fine-tune)

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

类似DiffEdit自动计算mask:利用target prompt的start token对应的cross-attention map具有全局语义信息的性质,计算其余token的cross-attention map与其的相似度,使用最相似的那个token的cross-attention map,处理后估计一个mask。

 

Diff-AE & PDAE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Unsupervised Representation Learning from Pre-trained Probabilistic Diffusion Models

训练自编码器,在隐空间训练线性分类器,利用属性超平面的法向量作为编辑方向。

 

DiffEx

Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models

 

DisControlFace (real image editing)

DisControlFace: Disentangled Control for Personalized Facial Image Editing

使用预训练的Diff-AE,额外训练一个ControlNet引入控制信息,但这样训练有一个问题,the pre-trained Diff-AE backbone can already allow near-exact image reconstruction, only limited gradients can be generated during error back propagation, which are far from sufficient to effectively train ControlNet。所以引入masked-autoencoding的思想,训练时使用masked x0作为Diff-AE的输入,相当于训练ControlNet进行inpainting。

采样时,先估计出原图的控制信息,然后可以对控制信息进行编辑,再生成,同时使用0.750.5×(Tt)/T的动态mask stragety对原图进行mask,即每步输入的z都不一样。

 

UFIE

User-friendly Image Editing with Minimal Text Input: Leveraging Captioning and Injection Techniques

传统的编辑方法类似P2P需要用户提供source prompt和target prompt,本论文使用现有的caption模型为原图生成source prompt,用户只需要指出需要修改source prompt中哪些concept即可。

 

HIVE

HIVE: Harnessing Human Feedback for Instructional Visual Editing

  1. 训练一个StableDiffusion,以原图和target prompt为条件,对目标图像进行去噪。

  2. 引入human feedback,使用learned reward function fine-tune上述StableDiffusion。

 

DialogPaint

DialogPaint: A Dialog-based Image Editing Model

  1. StableDiffusion

  2. multi-turn editing

 

EMILIE

Iterative Multi-granular Image Editing using Diffusion Models

  1. StableDiffusion

  2. multi-turn editing,在StableDiffusion的latent space上进行多轮编辑。

 

MLLM-Plan

GenArtist

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

  1. 利用GPT-4调度使用各种编辑方法进行编辑。

  2. GPT-4是只能生成text的MLLM,所以只能帮助做plan,无法直接根据需求生成图像。

 

Ground-A-Score

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

Ground-A-Score

  1. 针对复杂编辑要求的DDS方法。

  2. 使用GPT-4V分解编辑需求和编辑区域,得到原图的prompt序列{xk}k=1n、target prompt序列{yk}k=1n和mask序列{mk}k=1n,使用分区域的DDS更新原图zLDDS=k=1nmk(ϵϕ(zt,t,yk)ϵϕ(z^t,t,xk))

  3. GPT-4V是只能生成text的MLLM,所以只能帮助做plan,无法直接根据需求生成图像。

 

SANE

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

SANE

  1. 利用LLM将ambiguous instruction改写为多个specific instructions,利用IP2P模型组合多个instructions进行编辑。

 

DVP

Image Translation as Diffusion Visual Programmers

DVP

  1. CFG的strength很敏感,很小的改动会导致生成图像很大的不同,每张图都去要调整strength,不实用。受style transfer的instance normalization的启发,提出Instance Normalization Guidance:ϵ=σ(ϵu)conv(ϵuμ(ϵc)σ(ϵc))+μ(ϵu),其中conv()是一个1×1卷积。主要目的是降低ϵu的影响,因为ϵu的自由度太高了。

 

TIE

TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing

TIE

  1. 构造CoT数据fine-tune MLLM。

 

EVLM

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

  1. fine-tune VLM,根据reference image和text instruction,generates much more precise editing instructions.

 

MLLM-Feature

MGIE

Guiding Instruction-based Image Editing via Multimodal Large Language Models

MGIE

  1. 使用InstructPix2Pix的数据集,让MLLM根据图像和old instruction生成new instruction,给new instruction后加一些可训练的[IMG] token。

  2. 将old instruction、原图和new instruction输入LLaVA,训练生成new instruction的text部分,同时将[IMG]部分的feature作为editing command,和原图一起输入一个diffusion model,生成目标图像,所有可训练模块一起训练。

  3. LLaVA是只能生成text的MLLM,无法直接根据需求生成图像,这里借助了MLLM的编码能力,为其feature训练一个diffusion decoder。

 

CAFE

Customization Assistant for Text-to-image Generation

CAFE

  1. 和MGIE类似。

 

EmoEdit

EmoEdit: Evoking Emotions through Image Manipulation

EmoEdit

  1. 根据emotion生成instruction,使用预训练IP2P进行编辑。

 

RP2P

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

  1. We introduce ReasonPix2Pix, a dataset specifically tailored for instruction-based image editing with a focus on reasoning capabilities. 构造数据集时生成具有联想能力的instruction,比如使用the owner of the castle is a vampire代替make the castles dark.

  2. 原图和instruction输入MLLM,使用MLLM输出的feature和原图作为条件fine-tune StableDiffusion,生成目标图像。

 

Video

Frame2Frame

Pathways on the Image Manifold: Image Editing via Video Generation

Frame2Frame

 

 

Image Editing through Reference Image

可以看成image-guided inpainting,参考Inpainting部分的text-guided inpainting,只是将条件从text换成了image。

 

ILVR

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

ILVR

  1. Latent Variable Refifinement: match the low-pass filter feature of noisy latents to that of reference image

 

FGD

Filter-Guided Diffusion for Controllable Image Generation

  1. 类似ILVR,设计filter让生成样本与reference image在特定属性上一致。

 

PbE

Paint by Example: Exemplar-based Image Editing with Diffusion Models

PbE

  1. StableDiffusion

  2. 输入原图,mask,reference image,输出原图mask部分被reference image取代并融合的图片。整体架构和text-guided image inpainting类似,将reference image看成text,作为condition输入到StabelDiffusion中,masked image和zt的concat作为输入,重新训练StableDiffusion,使用全图diffusion loss。

  3. self-supervised learning:使用带有bounding box的图像数据集进行自监督训练,即将bounding box内区域作为mask,bounding box内图片作为参考图片。这样训练时模型很容易过拟合,模型只学到学到一个简单的复制粘贴,提出两个解决方案:Information Bottleneck:因为我们需要将参考图片移植到原图mask区域,模型很容易去记忆图片空间信息而不是去理解上下文信息,所以我们将参考图片压缩,提高重构难度,即将其剪切并使用CLIP image encoder编码,结果作为StableDiffusion的KV进行cross-attention。Strong Augmentation:自己造的数据集存在domain gap between train-test,因为训练集中的参考图片本来就是原图切下来的,而测试集中基本都是无关的,所以我们对训练集中的参考图片进行数据增强(翻转、旋转、模糊等),又由于bounding box都是紧贴物体的,不利于模型泛化,所以对mask区域也进行数据增强,先用Bessel曲线拟合bounding box,再在曲线上均匀采样20个点,随机延伸1~5个像素点。

  4. 类似inpainting的blended采样。

  5. classifier-free guidance:20%的概率用可一个训练的向量替代CLIP image encoder编码结果,采样时guidance scale可以控制融合程度。

 

PbS

Reference-based Image Composition with Sketch via Structure-aware Diffusion Model

PbS

  1. 在PbE的基础上,还需要提供mask部分的sketch作为条件(concat),进一步提高可控性。

 

 

ControlCom

ControlCom: Controllable Image Composition using Diffusion Model

ControlCom

  1. 挖掉图像前景做自监督训练。

  2. 一个额外的indicator决定是否改变被挖出来的前景的illumination和pose,indicator也作为条件输入diffusion model进行训练。

 

MADD

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

MADD

  1. SAM + inpainting挖掉图像前景做自监督训练。

  2. The model supports flexible prompts p, which include points, bounding boxes, masks, and even null prompts.

 

IMPRINT

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

IMPRINT

  1. 使用multi-view数据集训练一个image encoder(主体DINOv2 + 小adapter,两者都参与训练),输入一个view的图像生成embedding序列,送入StableDiffusion,重构另一个view的图像。训练image encoder和StableDiffusion的decoder。

  2. 固定image encoder的主体部分,重新训练一个diffusion model,自监督训练,image encoder的adapter也参与训练。

 

DreamInpainter

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

  1. 两个条件,一个reference image,一个text,这样不仅可以将reference image填入mask,还能通过text进行控制,比如动作等。

  2. 之前的方法使用CLIP编码reference image,缺少了对细节的提取,这里使用预训练扩散模型UNet encoder编码reference image,时间步为0,取32×32×768的feature map,但如果直接使用该feature map作为条件会造成copy-paste过拟合,将其看成1024×768的向量,计算各像素之间cosine similarity,得到1024×1024的相似度矩阵,沿任一维求和,与其他像素相似度低的像素得分低,取得分最低的K个像素,得到K×768作为条件。

 

PhD

Paste, Inpaint and Harmonize via Denoising Subject-Driven Image Editing with Pre-Trained Diffusion Model

  1. 将exemplar去除背景,直接paste在目标区域,作为条件输入ControlNet进行类似PbE的self-supervised learning。

 

RefPaint

Reference-based Painterly Inpainting via Diffusion Crossing the Wild Reference Domain Gap

  1. 在Versatile Diffusion基础加了一个mask branch,reference image(训练时是被mask掉的部分)做context flow,masked image做mask branch,进行self-supervised的inpainting训练。

 

ObjectStitch

ObjectStitch: Generative Object Compositing

  1. 用的是pre-trained text2img diffusion model,由于给的是object图片而不是text,所以需要一个模块将object图片转换为text embedding,即content adaptor,类似TI:使用训练好的CLIP和大规模image-caption数据训练一个content adaptor,content adaptor将CLIP的image embedding映射到text embedding空间,得到translated embedding,然后让它尽量靠近CLIP的text embedding。训练好之后再用pre-trained text2img diffusion model和textual inversion方法fine-tune content adaptor。

  2. 固定content adaptor,fine-tune pre-trained text2img diffusion model。

  3. 类似inpainting的blended采样,diffusion model只输入translated embedding。

 

LogoSticker

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

LogoSticker

  1. 基于TI。

 

AnyDoor

AnyDoor: Zero-shot Object-level Image Customization

AnyDoor-1

AnyDoor-2 AnyDoor-3
  1. 使用DINOv2提取物体的ID tokens,既用了global token(1×1536),也用了patch tokens(256×1536),concat在一起后使用一个线性层映射为257×1024,代替text embedding输入cross-attention,代表了物体的全局特征,但是该特征丢失了物体细节,所以使用高通滤波提取物体细节特征,插入原图要放物体的地方,输入Detail Extractor(ControlNet架构),两者互补。训练时同时fine-tune UNet decoder。

  2. 之前的使用图像自监督训练的方法虽然有数据增强,但还是会导致多样性不足的问题,所以提出使用视频数据集造数据:对同一场景随机采样两帧,提取一帧的物体作为target,另一帧作为目标。

 

Bifrost

BIFRÖST: 3D-Aware Image compositing with Language Instructions

  1. 类似AnyDoor,额外加上了language instruction作为条件。

 

AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status

 

LAR-Gen

Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance

LAR-Gen

  1. Locate:StableInpainting

  2. Assign:IP-Adapter

  3. 第一阶段:StableInpainting+IP-Adapter训练Diffusion UNet。

  4. 第二阶段:把第一阶段训练好的Diffusion UNet复制出一个RefineNet,RefineNet UNet decoder的self-attention前的feature送入Diffusion UNet,与对应的feature concat在一起进行self-attention,只训练RefineNet的image cross-attention。

  5. self-supervised learning,训练时subject image是从scene image中挖出来的,使用LLaVA生成subject image的caption作为text。

  6. blended采样。

 

PAIR-Diffusion

PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models

  1. 使用预训练模型提取图像的segmentation map作为图像的structure特征,再使用一个预训练的图像编码器编码图像,提取浅层feature map,取segmentation map中每个segment对应区域的feature map的spatial pool作为该segment的appearance特征,两者作为条件训练diffusion model。

  2. structure编辑:对分割图进行编辑(比如改变某个object的形状、去掉某个object)

  3. appearance编辑:提供一张reference image,用其全图的或者其中某个object的appearance特征替换某个segment对应的appearance特征,进行生成。

  4. 注意,编辑时不需要DDIM Inversion,直接根据条件从噪声开始生成即可。但毕竟structure和appearance不包含图像全部特征,所以未编辑部分会有一些变化。但编辑时可以对未编辑的segment进行mask,类似inpainting的blended采样。

 

CustomNet

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

  1. 使用SAM提取出原图中的object和background,估算出object的viewpoints,使用zero-1-to-3生成一个随机viewpoints的novel view object,训练一个diffusion model,novel view object、background和viewpoints作为条件,预测原图。

  2. 生成时,可以指定object的角度、在图像中的位置以及背景。

 

Custom-Edit

Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

  1. 给定一张image和几张reference images,将image中某个object替换为reference image中的concept。

  2. Custom-Diffusion方法提取reference image中的concept到pseudo word。

  3. Prompt2Prompt + Null-text Inversion做real image editing,用pseudo word替换prompt中object对应的word。

 

DreamEdit

DreamEdit: Subject-driven Image Editing

  1. 和CustomEdit一样,但是基于mask的,DreamBooth做完TI后,做text-guided inpainting采样(blended)。

 

DreamCom

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

  1. self-supervised learning,给定3~5张reference images,每张都有bounding box (mask)标注其中物体,将mask和masked image concat在zt上,使用一个稀有的单词造句(如a sks cat)进行TI,同时fine-tune StableDiffusion。

  2. 生成时,给定背景图和想要object出现的位置的bounding box (mask),使用上述句子进行生成。

 

SpecRef

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing

  1. 有reference image的P2P。

SpecRef-1

SpecRef-2

 

VisCtrl

Tuning-Free Visual Customization via View Iterative Self-Attention Control

VisCtrl

  1. CLiC的无pseudo word版本,直接使用self-attention KV注入实现concept替换。

 

FreeEdit

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

FreeEdit-1 FreeEdit-2

 

  1. 构造数据集训练。

 

Unconstrained

Thinking Outside the BBox: Unconstrained Generative Object Compositing

  1. 准备自监督训练数据,对背景图进行inpainting,训练时有50%概率不使用mask,这样既可以做mask-free也可以做mask-based object stitch。

  2. 提取mask时,还提取了object的shadow mask和reflection mask,使得模型在object stitch的同时可以生成影子。

 

Try-On

TryOnDiffusion

TryOnDiffusion: A Tale of Two UNets

TryOnDiffusion

  1. cascade模式

  2. 使用Parallel UNet是为了解决channel-wise concatenation效果不行的问题,所以改用cross-attention机制,绿线代表将feature当成KV送入主UNet。

 

StableVITON

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

StableVITON

  1. 使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果。

  2. CLIP image embedding丢了很多信息,所以在decoder block之间再插入一个zero cross-attention block引入细节。

  3. 在text cross-attention里,某个word对应的cross-attention map是这个物体的大致轮廓,但是在zero cross-attention block里是image cross-attention,query里衣服上某个image token对应的cross-attention map应该是key中同样位置的image token,而非整个衣服区域,所以cross-attention map应该是尽量集中于一点的,所以额外使用了一个attention total variation loss, which is designed to enforce the center coordinates on the attention map uniformly distributed, thereby alleviating interference among attention scores located at dispersed positions. 即让query里不同image token对应的cross-attention map差异尽量大。

 

TED-VITON

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

TED-VITON

  1. MMDiT

 

MMTryon

MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

MMTryon

  1. StableDiffusion的cross-attention换为Multi-Modal Attention block,self-attention换为Multi-Reference Attention block。

 

TryOn-Adapter

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

TryOn-Adapter

Try-On-Adapter

Try-On-Adapter: A Simple and Flexible Try-On Paradigm

Try-On-Adapter

 

PLTON

Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

PLTON

  1. 类似StableVITON,使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果,Dynamic Extractor使用CLIP image encoder编码图像,但是之后的MLP是可训练的。

  2. HF-Map输入一个可训练的ControlNet。

 

StableGarment

StableGarment: Garment-Centric Generation via Stable Diffusion

StableGarment

 

GarDiff

Improving Virtual Try-On with Garment-focused Diffusion Models

GarDiff

 

DTC

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

DTC

  1. Paint by Example是重新训练整个conditional StableDiffusion,这里改用ControlNet架构。

 

IDM-VTON

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

IDM-VTON

 

Wear-Any-Way

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Wear-Any-Way

  1. 利用semantic correspondecce,分别将穿着garment的person图像和garment图像输入同一个StableDiffusion,提取feature,计算相似性,可以得到correspondecce作为监督数据,这样生成时可以指定衣服的穿着方式,比如衣角扬起等。

 

FIA-VTON

Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

FIA-VTON

 

BootComp

Controllable Human Image Generation with Personalized Multi-Garments

BootComp

 

AnyFit

AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario

AnyFit

 

AnyDressing

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

AnyDressing

 

TPD

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

TPD

 

FLDM-VTON

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

FLDM-VTON

  1. 使用额外的off-the-shelf clothes flattening network进行监督。

 

M&M-VTO

M&M VTO: Multi-Garment Virtual Try-On and Editing

M&M-VTO

 

ShoeModel

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

ShoeModel-1 ShoeModel-2
  1. 构造数据自监督训练。

 

BooW-VTON

BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

BooW-VTON

 

Face

FaceStudio

FaceStudio: Put Your Face Everywhere in Seconds

  1. 人脸挖出来,自监督训练。

 

HS-Diffusion

HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping

  1. 换头,预训练模型进行blended inpainting生成。

 

EmojiDiff

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

EmojiDiff

 

Stable-Makeup

Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

Stable-Makeup

  1. 使用ChatGPT生成不同makeup style的prompt,使用LEDITS对没有makeup的人脸图像进行编辑,生成带makeup的人脸图像,监督训练。

  2. 类似IP-Adapter,将CLIP提取的global token加patch tokens送入cross-attention。

 

SHMT

SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models

Makeup

 

Stable-Hair

Stable-Hair: Real-World Hair Transfer via Diffusion Model

Stable-Hair-1 Stable-Hair-2
  1. 造数据:要transfer什么就把什么留下,对其它部分进行inpainting。

 

MLLM

可以实现多种task,如text2img generation,personalization,editing等。

 

BLIP-Diffusion

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

BLIP-Diffusion

  1. 利用BLIP的方法,先使用大规模image-text数据预训练一个multimodal image encoder,可以从image中提取text-aligned特征。

  2. 给定subject image和subject text,输入multimodal image encoder,得到subject image的特征,再训练一个MLP将其转化为text embedding。之后利用subject image构造训练image(如替换背景等)和对应的prompt,将subject image特征转化后的text embedding接在prompt之后,输入text encoder,输出再输入StableDiffuion进行训练。multimodal image encoder、MLP、text encoder和StableDiffuion一起训练。

  3. 给定subject image、subject text和prompt就能生成,不需要test-time fine-tune了。

 

UNIMO-G

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

UNIMO-G

  1. 自监督训练:先用caption模型得到图像的caption,再用Grounding DINO和SAM得到caption中的object的图像,将caption中的object word替换为图像,得到interleaved数据集,输入预训练MLLM进行编码,编码结果(所有token的last hidden layer的输出)送入StableDiffusion重构图像,只训练StableDiffusion。

  2. 因为MLLM输入中可能包含image entity,为了让生成结果更好地保持image entity的细节,在StableDiffusion的cross-attention增加zt和image entity的cross-attention,和TokenCompose一样对cross-attention map使用segmentation map进行监督(自监督训练,所以segmentation map是已知的),一方面可以提升训练效果,另一方面可以在推理时指定位置。

 

Kosmos-G

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Kosmos-G

Kosmos-G-AlignerNet

  1. MLLM:使用CLIP提取image embedding,use attentive pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练MLLM和CLIP的最后一层,只在text token上算loss,类似Emu2的caption阶段。

  2. AlignerNet:为了直接使用StableDiffusion(不需要训练)进行生成,训练一个AlignerNet,将Kosmos-G的输出转换到CLIP text embedding的domain,训练时只给一个text,分别使用Kosmos-G(所有token的last hidden layer的输出)和CLIP text encoder编码,得到st,训练一个Q-Former M,生成结果M(s)t计算MSE loss;为了防止the reduction in feature discrimination,再训练一个Q-Former N,生成结果N(M(s))s计算MSE loss,两个Q-Former一起训练。

  3. We can also align MLLM with Kosmos-G through directly using diffusion loss with the help of AlignerNet. While it is more costly and leads to worse performance under the same GPU days.

 

Emu2

Generative Multimodal Models are In-Context Learners

Emu2

  1. caption:使用CLIP提取image embedding,use mean pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练CLIP(注意不是MLLM),只在text token上算loss,该阶段的目的是得到一个image encoder。

  2. caption+regression:固定image encoder,用interleaved text image数据进行next-token prediction训练MLLM,在text token上算分类loss,在image feature上算regression loss。

  3. StableDiffusion:训练StableDiffusion对image encoder的编码结果进行解码。

 

GILL

Generating images with multimodal language models

GILL-1

GILL-2

  1. caption:类似LLaVa,用image-caption数据进行next-token prediction训练一个projection layer,只在text token上算loss。

  2. producing image:给LLM的embedding层加r个可训练的image token embedding,把caption+image输入,image统一用这r个image token embedding代替,回归这r个image token embedding,这里只训练这r个image token embedding。没看懂,根本没有引入image。

  3. 类似Kosmos-G,训练一个Q-Former将这r个image token embedding转化为CLIP text embedding,与caption的CLIP text embedding计算MSE loss。

 

UniReal

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

UniReal

  1. Transfusion的多任务版本。

 

Image Editing through Point-based Supervision

Self-Guidance

Diffusion Self-Guidance for Controllable Image Generation

 

DragDiffusion

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

DragDiffusion-1

DragDiffusion-2 DragDiffusion-3

 

  1. StableDiffusion

  2. 先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第三层的output feature map做motion supervision和point tracking,不断梯度下降优化zt,最后从更新好的z^t开始DDIM去噪生成编辑后的样本。

 

CLIPDrag

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

  1. 和DragDiffusion方法类似,只是在Eq(4)的motion supervision时引入了一个额外的CLIP direction loss,使用文本提高drag编辑的效果。

  2. Lms求梯度得到Gl,对Lclip求梯度得到Gg,计算两者的cosine similarity,如果大于0,说明两者是consistent的,因此给Gl加上Gg,如果小于0,说明Gl要破坏图像语义了,因此给Gl减去GgGl方向的分量。

 

GoodDrag

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

GoodDrag

  1. DragDiffusion在zt上进行motion supervision和point tracking,完成后再进行去噪生成,这样容易使zt偏移到域外,GoodDrag采用交替的方式,去噪生成可以将zt拉回域内。

 

DragText

DRAGTEXT: Rethinking Text Embedding in Point-based Image Editing

DragText

  1. DragDiffusion中计算Fq(zt)时是需要输入text的,所以motion supervision时不仅可以使用Lms优化zt,还可以优化text embedding以提升效果,有点类似TI。

 

DragNoise

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

DragNoise

  1. 和DragDiffusion一样,先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第三层的output feature map做motion supervision和point tracking,不断梯度下降优化zt,最后从更新好的z^t开始DDIM去噪生成编辑后的样本。

  2. we observe a forgetting issue where subsequent denoising processes tend to overlook the manipulation effect by simply performing diffusion semantic optimization on one timestep. Propagating the bottleneck feature to later timesteps does not have a significant influence on the overall semantics, we copy this optimized bottleneck feature s^t and substitute them in the subsequent timesteps.

 

AdaptiveDrag

AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing

AdaptiveDrag

 

EasyDrag

EasyDrag: Efficient Point-based Manipulation on Diffusion Models

EasyDrag

  1. 不需要LoRA fine-tune,直接将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第二和第三层的output feature map,上采样到和zt相同的尺寸后concat在一起做motion supervision。

  2. motion supervision时,EasyDrag始终使用原图的zt对应的feature map的原始点附近的feature作为目标,而DragDiffusion是使用上一轮优化后的zt​对应的feature map的drag后的点附近的feature作为目标。

  3. reference guidance使用DDIM Inversion时的zt进行self-attention的KV替换。

 

StableDrag

StableDrag: Stable Dragging for Point-based Image Editing

StableDrag-1

StableDrag-2

  1. 在point tracking时,除了使用传统的training-free的差异计算法,还使用一个可训练的track model,其是一个可训练的1×1的卷积核,track model的训练只在原图以用户指定的starting point为中心的local patch上进行,训练完成后在后续的motion supervision和point tracking中全程使用。使用卷积核在local patch上进行卷积,得到一个和local patch同大小的score map,ground truth是以用户指定的starting point为中心的一个符合高斯分布的score map,两个map计算MSE优化卷积核。

  2. 在进行long-range drag时,图像内容难免会发生较大变化,point feature也会发生改变,此时让它和原图的starting point feature保持一致就不科学了,not only ensuring high-quality and comprehensive supervision at each step but also allowing for suitable modifications to accommodate the novel content creation for the updated states. 因此根据point tracking的结果计算一个confidence score,当confidence score较大时,就使用上一步的point feature作为监督优化latent,当confidence score较小时,就使用原图的starting point feature作为监督优化latent。

 

FreeDrag

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

  1. feature dragging:之前的方法的point dragging是一片区域内的feature计算point-to-point的损失函数再求和,feature dragging计算一片区域内的feature aggregate Fr(hik)=qiΩ(hik,r)F(qi),再进行feature tracking Ldrag=i=1nFr(hik)Tik1,其中Tik+1=λikFr(hik)+(1λik)Tik是一个adaptive updating的template feature,Ti0=Fr(pi0)λi0=0λik根据Ldrag优化后的大小决定,如果优化后Ldrag较大,λik就设为较小的数,减少Tik+1的变化,如果优化后Ldrag较小,λik就设为较大的数,增大Tik+1的变化,这与StableDrag的思想一致。

  2. line search with backtracking:we constraint hik to the line extending from pi0 to ti.

 

DragonDiffusion

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

  1. StableDiffusion

  2. 从DIFT获取灵感,模型输出的feature具有correspondence性质,相同物体对应区域的feature具有很高的相似性。

  3. 类似P2P+self-guidance,两条并行的generative trajectory,一条是reconstruction,一条是editing,用各自第2,3层的输出feature(self-guidance是用attention)计算loss(原区域和目标区域的feature的相似度),求梯度作为guidance。

  4. 将editing generative trajectory的UNet decoder的self-attention的key-value替换为reconstruction generative trajectory的。

 

DiffEditor

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

  1. DragonDiffusion的改进版

  2. 先使用LAION训练一个image prompt encoder:具体做法是先使用预训练的CLIP image encoder将图像编码为长257的embedding sequence,作为cross-attention的key-value送入一个QFormer,输出长64的embedding sequence,送入StableDiffusion的cross-attention,只训练这个QFormer。在编辑时,在editing generative trajectory上使用原图的image prompt,效果更好。

  3. 作者发现如果在DragonDiffusion中使用随机初始化而非DDIM inversion得到的zT,编辑效果更好,但是其余无关细节也会发生改变,这再次证明了consistency和editing flexibility的两难困境。所以作者在编辑时,在某一段的DDIM生成中引入微量的随机性,即σt>0

  4. 利用RePaint的resample technique,即从zt生成zt1后,再加噪回到zt,重复该过程,可以避免一步生成不准确导致最终结果不和谐的问题。之前的resample technique都使用随机加噪,引入了不确定性,这里使用DDIM inversion进行确定性的加噪。

 

LucidDrag

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner

LucidDrag

  1. editing guidance就是DragonDiffusion的guidance。

 

Pixel-wise-Guidance

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

  1. 与SKG和Late-Constraint类似。

  2. 对于分割数据集{x0,y},将xt输入预训练扩散模型,使用UNet feature map训练一个语义分割模型。

  3. 对图像的segmentation map进行编辑,同时根据编辑结果计算一个mask,算mask-based方法。

  4. 编辑时先DDIM Inversion到某一中间步,再生成,将生成时的UNet feature map输入语义分割模型生成segmentation map,计算其和编辑后的segmentation map之间的loss,求梯度作为guidance。

 

RegionDrag

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

RegionDrag

  1. Point-Based Drag有两个缺点,一是语义不明确,只给一个起点和一个终点,合理的但语义不同的编辑结果可能有很多,二是过程过于复杂,所以提出Region-Based Drag。

 

Readout-Guidance

Readout Guidance: Learning Control from Diffusion Features

 

SDE-Drag

The Blessing of Randomness: SDE beats ODE in General Dfusion-based Image Editing

  1. 方法就是CycleDiffusion

  2. unified framework

SDE-Drag-1

The first stage initially produces an intermediate latent variable xt0 through either a noise-adding process (SDEdit) or an inversion process (DiffEdit). Then, the latent variable xt0 is manipulated manuall or transferred to a different data domain by changing the condition in a task-specific manner, resulting in x^t0.

The second stage starts from x^t0 and produces the edited image x^0 following either an ODE solver, an SDE Solver, or a Cycle-SDE process.

We show that the additional noise in the SDE formulation (including both the original SDE and Cycle-SDE) provides a way to reduce the gap caused by mismatched prior distributions (between p(x^t0) and p(xt0)), while the gap remains invariant in the ODE formulation, suggesting the blessing of randomness in diffusion-based image editing.

操控过的x^t0分布偏离了xt0的分布,但如果有了随机性,生成时,这个偏离会原来越小,如果没有随机性(即ODE),生成时,这个偏离会保持不变。

  1. Drag

SDE-Drag-2

When the target point is far from the source point, it is challenging to drag the content in a single operation. To this end, we divide the process of Drag-SDE into m steps along the segment joining the two points and each one moves an equal distance sequentially.

 

RotationDrag

RotationDrag: Point-based Image Editing with Rotated Diffusion Features

the point-based editing method under rotation scenario

 

Motion-Guidance

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

  1. 使用a differentiable off-the-shelf optical flow estimator估算diffusion model每一步生成的x^0​与原图之间的光流,与用户给定的光流计算loss,求梯度作为guidance。

  2. 根据用户给定的光流估算一个mask,blended生成。

  3. 才用Repaint的resample technique。

 

Magic-Fixup

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Magic-Fixup-1

Magic-Fixup-2

  1. detail extractor和synthesizer都用StableDiffusion初始化,都去掉了cross-attention,input block都做了扩展,两者都参与训练。

  2. 相当于给synthesizer的self-attention之后加了个cross-attention,Q是自己,KV是detail extractor的self-attention之前的feature。

  3. 类似AnyDoor,使用视频造数据进行训练。

  4. 生成时从α¯tIcoarse+1α¯tϵ开始,如果从标准高斯分布开始效果会差。

 

SceneDiffusion

Move Anything with Layered Scene Diffusion

  1. 类似Locally-Conditioned-Diffusion和MultiDiffusion,给定图像和其对应的layout,可以通过对layout进行移动从而实现对物体的移动。

  2. 除了移动,增删layout可以实现物体的增删,调整layout图层顺序可以实现物体的前后调整。

 

InstantDrag

InstantDrag: Improving Interactivity in Drag-based Image Editing

  1. 两个模型:a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion).

  2. InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation.

  3. FlowGen:根据drag生成光流图。

  4. FlowDiffusion:使用视频数据集自监督训练一个根据当前帧和光流生成目标帧的模型。

 

Model Editing / Concept Removal / Unlearning

TIME

TIME: Editing Implicit Assumptions in Text-to-Image Diffusion Models

  1. 当prompt没有指明时,模型会做一些Implicit Assumptions进行生成,比如生成的玫瑰都是红色,医生都是男性。本方法将编辑这种Implicit Assumptions(是编辑,不是去除),比如将玫瑰是红色编辑为玫瑰是蓝色,这样模型以后再见到带有玫瑰的prompt时,就会默认生成蓝色的玫瑰。

  2. 做法是为所有cross-attention训练新的KV projection matrix,让新矩阵与玫瑰的乘积靠近原矩阵与蓝色玫瑰的乘积,这样新矩阵就会默认将玫瑰映射到原来模型里的蓝色玫瑰的投影。

 

UCE

Unified Concept Editing in Diffusion Models

UCE

  1. 和TIME类似,闭式解修改所有cross-attention的KV projection matrix。

 

RECE

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

  1. 改进UCE的公式。

 

MACE

MACE: Mass Concept Erasure in Diffusion Models

MACE

  1. 最后的融合多个LoRA成一个LoRA的方法类似Mix-of-Show中的方法。

 

SLD

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

SLD

 

ESD

Erasing Concepts from Diffusion Models

  1. fine-tune StableDiffusion

  2. 反向编辑,对图像中与文本相关的内容进行擦除。

  3. 反向利用classifier guidance,fine-tune模型让预测的噪声与预训练模型的反向classifier guidance的噪声靠近。

ESD

 

AC

Ablating Concepts in Text-to-Image Diffusion Models

  1. 让StableDiffuion忘记一些concept,比如使用带有"in the style of Van Gogh"的prompt时,模型就会忽略"Van Gogh",生成正常style的图片。

  2. 使用"in the style of Van Gogh"构造一些prompt c,去掉"in the style of Van Gogh"就能得到对应的原prompt c,使用c生成的图片作为训练数据,fine-tune StableDiffuion,目标是对于同一个输入xt,使得以c为条件的输出靠近以c为条件的输出,以c为条件时disable grad。

 

Unlearning

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Unlearning

  1. 对抗训练,让模型在grumpy cat和cat时预测的noise无法分辨,这样修改后的模型遇到grumpy cat时会按cat生成,忽略grumpy。

 

FMN

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

  1. 对于一些想让StableDiffuion忘记的concept,收集一些reference images,并用concept造一些prompt,fine-tune整个StableDiffusion,loss为所有cross-attention layer处的concept对应的cross-attention map的所有响应值的平方和。

  2. 注意fine-tune时不需要diffusion loss。

 

PCE

Pruning for Robust Concept Erasing in Diffusion Models

  1. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs.

  2. stage 1: We use a numerical criterion to identify concept neurons.

  3. stage 2: We validate concept neurons are sensitive to adversarial prompts.

 

ConceptPrune

ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

  1. We first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning.

 

EIUP

EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts

EIUP

  1. original prompt是"a girl",erasure prompt是"naked",erasure prompt的cross-attention map注入original prompt的cross-attention map中并进行抑制。

 

Prompt-Tuning-Erase

Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts

  1. 学习一个prompt embedding,其可以直接concat在CLIP text emebdding后送入cross-attention。

  2. 类似EM算法,轮流更新prompt embedding pk和StableDiffusion ϵθk,假设原StableDiffusion为ϵθ,带有erase concept的prompt为ce。更新pk时,最小化ϵθk(ce,p)ϵθ(ce),这是为了将ce的知识transfer到p中,因为θk已经被优化去除了ce的知识,为了使loss更小,p必须去学ce的知识,一步更新得到pk+1;更新θk时使用两个loss,一个是ϵθk(ce)ϵθ(),让模型去除ce的知识,另一个是ϵθk(ce,p)ϵθ()作为正则项,一步更新得到θk+1。如此循环直到收敛。

 

SuppressEOT

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

SuppressEOT-1

SuppressEOT-2

  1. 只针对"... without xxx"句型的prompt仍然生成带有"xxx"的图像的情况。

  2. zero xxx和zero EOT都不解决问题,只有同时zero才有效;EOT之间距离也很近。

  3. 对x和EOT的矩阵((N|p|1)×768)做SVD,the main singular values are corresponding to the suppressed information (the negative target),所以做奇异值的抑制,之后复原再进行生成。

 

SepCE4MU

Separable Multi-Concept Erasure from Diffusion Models

 

AbO

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

 

Geom-Erasing

Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models

使用带有二维码、水印、文字的image-text pair数据集,将二维码、水印、文字的位置信息加进text,fine-tune StableDiffusion,这样生成时只用原text就可以避免生成二维码、水印、文字。

Geom-Erasing

 

Ring-A-Bell

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

 

Diff-QuickFix

Localizing and Editing Knowledge in Text-to-Image Generative Models

不同属性的知识(objects style color action)分布在UNet中不同block中,只对想要编辑或者ablate的concept对应的属性对应的block做fine-tune。

 

EraseDiff

EraseDiff: Erasing Data Influence in Diffusion Models

在训练时,对于需要遗忘的数据使用非高斯分布的噪声进行加噪,这样采样时就不会生成这些数据。

 

TV

Robust Concept Erasure Using Task Vectors

 

Editioning

Training-free Editioning of Text-to-Image Models

  1. 和erasing相反,让模型专注于某个concept的生成。

 

Position

Position: Towards Implicit Prompt For Text-To-Image Models

  1. erase concept后,用户依然可以通过implicit prompt生成该concept,比如erase了"Eiffel Tower",使用"Located in France, an iconic iron lattice tower, symbolizing the romance of Paris and French engineering prowess."依然可以生成。

  2. 针对这一问题提出了Benchmark,但没有提出解决方案。

 

Six-CD

Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models

  1. Existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric.

 

DUO

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

  1. 针对image而非prompt的unlearning,prompt unlearning虽然阻止了模型在碰到特定的prompt时触发生成相应的内容,但diffusion model还是有生成该内容的能力的,而image unlearning直接让diffusion model失去生成该内容的能力。

  2. 做法是针对某个prompt分别收集retain和forget样本,使用Diffusion-DPO优化diffusion model。

 

Meta-Unlearnings

Meta-Unlearning on Diffusion Models Preventing Relearning Unlearned Concepts

 

RGD

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

 

CAD

Unveiling Concept Attribution in Diffusion Models

 

Image-to-Image Translation

SDEdit (no fine-tune)

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

  1. 需要source domain和target domain上训练好的diffusion model。

 

Inversion-by-Inversion (no fine-tune)

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

  1. two-stage SDEdit

 

UNIT-DDPM (retrain)

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models

Geom-Erasing UNIT-inference
  1. unpaired

  2. domain translation function提取domain信息。

 

LaDiffGAN (retrain)

LaDiffGAN: Training GANs with Diffusion Supervision in Latent Spaces

  1. 类似Diff-Instruct,使用diffusion model训练image-to-image translation的GAN模型。

 

CycleNet (retrain)

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

  1. unpaired

  2. 使用ControlNet将源域x0引入作为条件 ϵθ(yt,cy,x0),依赖域内重构和源域目标域之间互相translation的一致性进行训练。

CycleNet

 

Palette (retrain)

Palette: Image-to-Image diffusion models

  1. paired,self-supervised learning,自动生成paired数据,如colorization,inpainting等

  2. condition source image through concatenation

 

DDBM (retrain)

Denoising Diffusion Bridge Models

DDBM

  1. paired

  2. 扩散过程变成从一个分布的point扩散到另一个分布的paired point,修改了公式进行训练,qtTtT的前向过程,类似q(xt|x0)一样有闭式解。

  3. 和ShiftDDPMs类似。

 

DBIM (retrain)

Diffusion Bridge Implicit Models

  1. DDBM的DDIM版,DBIM相对于DDBM相当于DDIM相对于DDPM,可以使用预训练好的DDBM进行加速采样。

 

LSB (no fine-tune)

Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation

LSB

 

EBDM (retrain)

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

EBDM

 

CDBM (retrain)

Consistency Diffusion Bridge Models

CBDM

 

A-Bridges (retrain)

Score-Based Image-to-Image Brownian Bridge

A-Bridges

 

FSBM (retrain)

Feedback Schrodinger Bridge Matching

 

ILVR (no fine-tune)

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

unpaired

 

DiffusionCLIP (fine-tune)

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

  1. unpaired

  2. 一个单词对应一个domain对应一个fine-tuned模型。

  3. pre-trained unoncditional DDPM + pre-trained CLIP

  4. 对数据集图像使用Sfor步DDIM Inversion到return step t0,将得到的latents保存(可以重复利用),再从这些latents开始生成Sgen步,用最后得到的x0计算CLIP directional loss,对DDPM进行一次fine-tune,类似递归神经网络。

  5. GPU-efficient:从latents开始生成的Sgen步中,每一步得到的x^0计算CLIP directional loss,对DDPM进行一次fine-tune,相当于同一批样本要对网络fine-tune Sgen步。

 

Rectifier (fine-tune)

High-Fidelity Diffusion-based Image Editing

  1. unpaired

  2. DiffusionCLIP

  3. 训练网络预测卷积层的LoRA参数,这样不需要像DiffusionCLIP那样递归优化。

rectifier-1

rectifier-2

 

EGSDE (no fine-tune)

EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations

  1. 只需要target domain上训练好的diffusion model,给定source domain上的原图做SDEdit,使用两个预训练好的energy function进行指导采样。

  2. 改变domain-specific特征: 训练一个domain classifier,去除分类层变为一个编码器,计算生成的latent和原图的noisy latent的feature之间的cosine similarity,求梯度作为guidance。

  3. 保留domain-independent特征: low-pass filter,计算生成的latent和原图的noisy latent的低通滤波之间的L2距离,求梯度作为guidance。

 

DDIB (no fine-tune)

Dual Diffusion Implicit Bridges for Image-to-Image Translation

  1. 需要source domain和target domain上训练好的diffusion model。

  2. Probability Flow ODE在source domain和target domain之间构成薛定谔桥。

  3. Cycle Consistency:source domain上的样本x0,用source domain上训练的diffusion model的Probability Flow ODE加噪到xT,用target domian上训练的diffusion model的Probability Flow ODE降噪到x0,再对x0进行同样的操作,若Probability Flow ODE的离散化误差为0(DDIM是Probability Flow ODE的一种误差很小的离散化),则可以完全复原x0

  4. cycle的前半段即为translation。

 

DECDM (no fine-tune)

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

  1. DDIB在文档模型上的应用。

 

CycleDiffusion (no fine-tune)

Unifying Diffusion Models' Latent Space, With Applications to Cyclediffusion and Guidance

  1. 需要source domain和target domain上训练好的diffusion model。

  2. translation过程和DDIB一样,但使用DPMEncoder代替Probability Flow ODE。

  3. 如果使用同一个text-to-image模型,两个不同text作为condition,可以分别看成source domain和target domain上训练好的DPM,可以用这种方法既可以做image-to-image translation也可以做image editing。

先用source domain模型编码

DPM-Encoder

再用target domain模型解码

DPM-Decoder

注意DPM-Encoder是针对stochastic diffusion models的。

 

DDPM-Inversion (no fine-tune)

An Edit Friendly DDPM Noise Space

  1. 方法同DPM-Encoder(作者声称和DPM-Encoder不一样,但并没有看出有什么区别,有可能说的是之前版本的DPM-Encoder?)

 

LEDITS (no fine-tune)

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

  1. DDPM-Inversion+SEGA(多个guidance的combination)

 

LEDITS++ (no fine-tune)

LEDITS++: Limitless Image Editing using Text-to-Image Models

  1. 用DPM-Solver做inversion,同时使用cross-attention map和DiffEdit的方法估计mask,做mask-based editing。

 

TurboEdit (no fine-tune)

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

  1. SDXL-Turbo + DDPM-Inversion做加速编辑。

 

Pix2Pix-Zero (no fine-tune)

Zero-shot Image-to-Image Translation

Pix2Pix-Zero

  1. 需要预训练好的StableDiffusion,做类似cat dog这样的zero-shot image-to-image translation,不同text输入StableDiffusion可以看成不同domain上训练好的diffusion model。

  2. 使用BLIP生成原图(cat图)的描述,使用CLIP对描述进行编码得到c;使用GPT3分别对cat和dog造大量句子,然后使用CLIP编码每个句子,算出它们的均值差Δc

  3. 使用c对图像做regularized DDIM Inversion,得到xT。DDIM Inversion时每一步使用两个loss梯度下降优化ϵθ的预测结果(代码里用的是ϵθ(zt,t,c),没有使用cfg),一个loss计算不同位置之间的相关性,另一个loss计算每个位置和标准高斯分布的KL散度。

  4. xTc对原图进行重构,存下各时间步的cross-attention map Mtref;之后用xTc+Δc进行生成,在每个时间步,先过一次UNet计算cross-attention map Mtedit,利用||MtrefMtedit||2xt进行一步优化,再用优化后的xtc+Δc过一次UNet预测xt1

 

PIC (no fine-tune)

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

  1. 和Pix2Pix-Zero一样的任务。

  2. 使用source prompt将原图DDIM Inversion后,将source prompt换为target prompt进行生成做translation效果不好,原因是在去噪早期阶段text embedding的abrupt transition。

  3. we formulate a noise prediction strategy for the text-driven image-to-image translation by progressively updating the text prompt embedding via time-dependent interpolations of the source and target prompt embeddings.

 

FBSDiff (no fine-tune)

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

FBSDiff

  1. 需要预训练好的StableDiffusion,做类似man girl这样的zero-shot image-to-image translation,不同text输入StableDiffusion可以看成不同domain上训练好的diffusion model。

 

CDM (retrain)

  1. unpaired

  2. 训练diffusion model的同时训练两个encoder,一个编码content,一个编码style,利用inductive bias,content是一个spatial layout mask,使用时降/上采样到feature map的尺寸;style是一个向量,代表高维语义。在UNet每一层用AdaGN,style做channel-wise affine transformation,content和AdaGN输出做spatial上的乘。

  3. 采样时,先用自身编码结果DDIM Inversion到噪声,再用目标图像的content或style进行生成。

 

DiffuseIT (no fine-tune)

Diffusion-based Image Translation using Disentangled Style and Content Representation

SDEdit + guidace + resample technique

DiffuseIT

 

Few-Shot Diffusion (fine-tune)

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

  1. unpaired

  2. 使用source domain A上预训练好的diffusion model和少量的target domain B上的样本做model adaption,得到一个target domain的diffusion model。做法是使用source domain A上预训练好的diffusion model初始化模型,使用任意source domain的图片xA translate后的结果x0AB对模型进行fine-tune(类似DiffusionCLIP),目标为Directional Distribution Consistency Loss (between xA and x0AB) + Gram Matrix Style Loss (between x0AB and xB) + diffusion loss。其中diffusion loss由target domain B上的样本进行训练,x0AB为target domain A上的样本用Tweedie's formula根据diffusion model输出计算的x^0

  3. Directional Distribution Consistency Loss:先使用数据集和CLIP得到一个cross-domain direction vector w=1mi=1mE(xiB)1ni=1nE(xiA) ,loss为LDDC=E(xA)+w,E(x0AB)

  4. 做translation时类似SDEdit,只用target domain的diffusion model。

 

Fine-grained Appearance Transfer (no fine-tune)

Fine-grained Appearance Transfer with Diffusion Models

  1. unpaired

  2. 利用DIFT做semantic matching和feature transfer。

Fine-grained Appearance Transfer

 

S2ST

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

  1. unpaired

  2. 将原图DDIM Inversion到zT,再从zT开始生成,使用生成结果计算loss,优化zT。之后从优化后的zT开始,类似Null-Text Inversion,边优化边生成。

structure loss: 生成的图像和原图的sobel gradient之间的MSE loss

appearance loss: 选取几张target domain的图像,autoencoder编码后计算均值,和生成的z0计算MSE loss

S2ST

 

FCDiffusion (fine-tune)

Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

FCDiffusion-1

FCDiffusion-2

  1. 自监督训练ControlNet。Training to reconstruct the lossless image features z0 with the lossy control signal c=FFM(z0) and the paired text prompt y​.

  2. translation时,先对原图进行DDIM Inversion,再使用target prompt和原图的不同frequency进行生成。

 

Style Transfer

Text

DiffStyler (training-free)

DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization

  1. 加噪150 steps,去噪50 steps,每一步用Tweedie's formula根据xtϵθ(xt,t)计算x^0,对x^0使用各种loss进行优化。

 

ZeCon (training-free)

Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer

ZeCon

  1. 对原图先加噪到中间步,再从中间步噪声开始去噪,每一步用Tweedie's formula根据xt计算x^0,用x^0计算CLIP loss + contrastive loss(用于content preservation),求梯度作为guidance。

 

Specialist-Diffusion (training-based)

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style

  1. StableDiffusion

  2. 每个style(如Flatten Design, Fantasy, Food doodle等)收集几十对text-image数据,做数据增强,fine-tune StableDiffusion,作为这个style的specialist diffusion,输入文本就可以生成这个style的图像。

 

Image

StyleAdapter (training-based)

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

  1. StableDiffusion

  2. 类似IP-Adapter:使用CLIP编码所有reference images,输入一个可训练的StyEmb网络得到style feature,给StableDiffusion插入一个可训练的cross-attention层,image token与style feature做cross-attention,其输出与text cross-attention层输出加在一起送到下一层(Tow-Path Cross-Attention),训练StyEmb和新插入的cross-attention层对reference images进行重构。

  3. 采样时根据prompt和style image进行生成。

  4. For data augmentation, we apply the random crop, resize, horizontal flipping, rotation, etc., to generate K = 3 style references for each input image during training.

 

DEADiff (training-based)

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

DEADiff

  1. StableDiffusion

  2. 使用frozen CLIP提取reference image的feature,Q-Former的query与feature和"content"/"style"进行cross-attention,Q-Former的输出输入StableDiffusion的text cross-attention,新训练一个KV projection maxtrix(Q还用text cross-attention的),将Q-Former的输出project之后与text的KV concat在一起进行计算,算是IP-Adapter的变种。

  3. 训练时如果使用"style",就用style相同但content不同的image pair进行训练,"content"同理。注意推理时只使用"style",训练时的"content"是为了让style representation的提取更加解耦。

 

ColorizeDiffusion (training-based)

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

  1. 自监督训练,提取图像的sketch,对图像加噪,加噪结果和sketch加在一起作为UNet的输入,将UNet的cross-attention改造为linear层,使用预训练CLIP提取图像的image embedding送入linear层,所有参数一起训练进行重构。

  2. 采样时,提取原图的sketch和reference image的CLIP image embedding输入网络进行生成,保持原图的结构,完成风格向reference image的转化。

  3. 还可以使用文本对reference image embedding进行manipulation,因为CLIP的text embedding和image embedding已经对齐了,所以可以在CLIP embedding空间,根据所给text和scale对reference image embedding进行manipulation。

 

ArtFusion (training-based)

ArtFusion: Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models

  1. 训练一个以content和style为条件的diffusion model,以输入数据自身的content(LDM的VAE提取)和style(vgg feature)为条件做self-reconstruction。

  2. 采样时使用不同的content image和style image。

 

SGDiff (training-based)

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

  1. 类似ArtFusion,使用输入数据的patch作为style。

 

StyleDiffusion (training-based)

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

StyleDiffusion-2

  1. 先使用预训练Style Removal模型去除原图和reference image的style,类似DiffusionCLIP,用CLIP directional loss fine-tune模型,一个style一个模型,在CLIP image embedding空间,上面两个的差应该和下面两个的差相似。

 

ControlStyle (training-based)

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

ControlStyle

  1. ControlNet + DiffusionCLIP。

 

OSASIS (training-based)

One-Shot Structure-Aware Stylized Image Synthesis

OSASIS

  1. 给定IBstyle和在domain A上训练的diffusion model ϵθ和Diff-AE ϵA,fine-tune得到一个domain B上的ϵB

  2. 利用SDEdit方法,使用ϵθ生成IBstyle对应的IAstyle,再使用ϵθ生成任意的domain A的样本IAin,之后使用ϵAIAstyleIAin加噪到t0,之后复制ϵAϵB,两边同时生成,使用CLIP directional loss fine-tune ϵB,并使用style image的reconstruction loss作为正则。

  3. SPN:a structure-preserving network (SPN), which utilizes a 1×1 convolution that effectively preserves the spatial information and structural integrity of IAinxtSPN=SPN(IAin)xt=xt+λxtSPNxt作为ϵB的输入。

 

CSGO (training-based)

CSGO: Content-Style Composition in Text-to-Image Generation

CSGO

  1. 构造数据集训练ControlNet。

 

VisualStylePrompt (training-free)

Visual Style Prompting with Swapping Self-Attention

VisualStylePrompt

  1. 生成时,将decoder某一层之后的所有self-attention的key和value替换为reference image生成时的相应位置的self-attention的key和value。

 

LAB (training-free)

Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

 

StyleID (training-free)

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

StyleID

  1. 类似VisualStylePrompt,self-attention的KV注入。

 

Portrait-Diffuion (training-free)

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

PortraitDiffusion

  1. 如果没有Ot,就是一个简单的self-attention的KV注入:分别对content image和style image进行DDIM Inversion后再重构,将重构过程中的style image的self-attention的KV注入到content image的self-attention中。

  2. 使用style guidance CFG的原因是让target image尽量偏离content image。

 

ZePo (training-free)

ZePo: Zero-Shot Portrait Stylization with Faster Sampling

ZePo

  1. 和PortraitDiffusion对比。

 

Ditail (training-free)

Diffusion Cocktail: Fused Generation from Diffusion Models

  1. 通常对于每个style会fine-tune得到一个模型,使用任意一对模型做any-to-any style transfer,将一个模型生成的图像作为content,用另一个模型对其进行style transfer。

  2. 做法类似PnP,做feature和self-attention map的注入,但不同的是,由于保存原图的feature和self-attention map太消耗存储,所以本文提出只保存原图生成过程中的latent,在style transfer时由当前模型再推理一次得到feature和self-attention map,效果和使用原模型的feature和self-attention map相差无几。

 

DiffStyle (training-free)

Training-free Content Injection using h-space in Diffusion Models

DiffStyle

 

CartoonDiff (training-free)

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

  1. 不做任何训练,只对预测的ϵθ做一个normalization实现卡通化,normalization can suppress the generation of fine texture details。

 

FreeStyle (training-free)

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

FreeStyle

  1. 受FreeU的启发,将原图(content image)送入UNet encoder+decoder得到的feature作为backbone feature,含有大量低频信息(content),给其乘一个系数;xt送入UNet encoder得到的feature作为skip feature,含有大量高频信息(style),给其FFT乘一个系数再iFFT。

 

ASI (training-free)

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

ASI

  1. 使用P2P,直接将style prompt拼在original prompt(即content prompt),之后进行text-guided style transfer,但这种方式会破坏原图信息,如头发等。

  2. 使用content prompt和style prompt分别进行cross-attention得到feature FcFs。根据FcFs之间的distribution difference计算一个mask,差异度较大的地方设为1,其余为0;使用阈值法计算一个mask,Fc中数值较大的地方设为0,其余为1。两个mask取OR,其中为1的地方就是要改变的地方,为0的地方就是要保持的地方,使用类似AdaIN的技术,σ(Fs)(Fcμ(Fc)σ(Fc)+μ(Fs))M+Fc(1M)

 

MagicStyle (training-free)

MagicStyle: Portrait Stylization Based on Reference Image

MagicStyle

  1. AdaIN技术。

 

STRDP (training-free)

Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer

STRDP

  1. AdaIN技术。

 

TI

InST (training-based)

Inversion-Based Creativity Transfer with Diffusion Models

  1. StableDiffusion

  2. 用CLIP编码reference image,训练一个网络,根据image embedding预测一个text token embedding(非CLIP编码),输入到预训练好的StableDiffusion(先过CLIP),用TI方法训练这个网络。

 

StyleBooth (training-based)

StyleBooth: Image Style Editing with Multimodal Instruction

StyleBooth

  1. InstructPix2Pix,构造数据集训练W同时fine-tune InstructPix2Pix。

 

DomainGallery

DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

DomainGallery

  1. Given a few-shot target dataset of a specific domain such as sketches painted by an artist, we expect to generate images that fall into the domain.

 

Pair-Customization (training-based)

Customizing Text-to-Image Models with a Single Image Pair

Pair-Customization

 

ArtBank (training-based)

ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank

ArtBank

  1. ISPB:每个style对应一个learnable parameter matrix,由该style专属的SSAM转化为一个token embedding,使用该style的一些images进行TI训练,只优化ISPB。

  2. Stochastic Inversion:Random noise is hard to predict, and incorrectly predicted noise can cause a content mismatch between the stylized image and the content image. To this end, we first add random noise to the content image and use the denoising U-Net in the diffusion model to predict the noise in the image. The predicted noise is used as the initial input noise during inference to preserve content structure.

 

LSAST (training-based)

Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt

LSAST

  1. 类似ProSpect,将1000步均分为10个阶段,再将UNet分为3个部分,每个阶段每个部分使用一个单独的token embedding,使用一些style images进行TI训练。

  2. 生成时,除了DDIM Inversion,还使用一个预训练的edge的ControlNet保持content image的结构。

 

Others

Stylebreeder

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

  1. art generation dataset,包含prompt,negative prompt和生成的图像。

 

HiCAST

HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

 

RB-Modulation

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

 

AttnMod

AttnMod: Attention-Based New Art Styles

  1. modify attention for creating new unpromptable art styles out of existing diffusion models

 

Inverse Problem

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

  1. y=Hx+ϵ, guidance:xtW(yHx^0)

  2. DPS前身。

 

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

Diffusion Posterior Proximal Sampling for Image Restoration

Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data

Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction

Consistency Models Improve Diffusion Inverse Solvers

Deep Data Consistency: a Fast and Robust Diffusion Model-based Solver for Inverse Problems

Learning Diffusion Priors from Observations by Expectation Maximization

Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems

Prototype Clustered Diffusion Models for Versatile Inverse Problems

Reducing the cost of Posterior Sampling in Linear Inverse Problems via task-dependent Score Learning

Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

Online Posterior Sampling with a Diffusion Prior

Variational Diffusion Posterior Sampling with Midpoint Guidance

Free Hunch: Denoiser Covariance Estimation for Diffusion Models Without Extra Costs

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

DPS

  1. y=Hx+ϵ, guidance:xtyHx^0

 

DEFT

DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised h-transform

  1. 类似PDAE,使用inverse problem的pair data训练一个gradient estimator进行指导采样。

 

PGDM

Pseudoinverse-Guided Diffusion Models for Inverse Problems

 

MAP

Inverse Problems with Diffusion Models: A MAP Estimation Perspective

 

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

 

CoSIGN

CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

  1. 类似CCM,为CM训练ControlNet,只用很少的步数求解Inverse Problem。

 

STSL

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

  1. 之前的方法都用first order Tweedie's formula计算x^0,本论文使用second order Tweedie's formula。

 

DMPlug

DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models

DMPlug

  1. 左边是类似DPS的方法,使用DDIM每一步预测的x^0与measurement计算loss进行约束,DMPlug将DDIM视为一个函数R,不断生成x0=R(xT)与measurement计算loss优化xT

 

CI2RM

Fast Samplers for Inverse Problems in Iterative Refinement Models

  1. Conditional Conjugate Integrators

 

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

 

FDEM

Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution

  1. real world:y=Hx+ϵ,只知道yϵ的noisy level,求xH,这是就有未观测变量H,所以可以使用EM算法求解。

  2. 或许也可以用Variational Inference

 

LatentDEM

Blind Inversion using Latent Diffusion Priors

  1. 利用EM算法。

 

EMDiffusion

An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

  1. 利用EM算法。

 

Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models

  1. y=Hx+n,训练时,利用某个预训练的autoencoder,先用encoder编码xz,再新训练一个网络编码yz,目标是让z靠近z,并且decoder能从z复原x

  2. 推理时仅有y,利用Langevin采样得到z,decoder解码得到x。具体做法是利用ztH(D(z^0(zt)))y作为drift进行Langevin采样,其中D为预训练autoencoder的decoder。

 

BCDM

Bayesian Conditioned Diffusion Models for Inverse Problems

 

APS

Amortized Posterior Sampling with Diffusion Prior Distillation

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

  1. amortized variational inference

 

Restoration

Non-Blind

DDRM

Denoising Diffusion Restoration Models

 

DDNM

Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model

 

DDPG

Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance

 

IR-SDE

Image Restoration with Mean-Reverting Stochastic Differential Equations

IR-SDE

  1. ShiftDDPMs中的PriorShift的SDE。

 

DeqIR

Deep Equilibrium Diffusion Restoration with Parallel Sampling

  1. DEQ-based

 

RCM

Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

  1. We propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs.

 

Blind

BlindDPS

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

 

FAG-Diff

Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

 

LADiBI

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

 

GDP

Generative Diffusion Prior for Unified Image Restoration and Enhancement

GDP

 

DiffusionVI

Diffusion Priors for Variational Likelihood Estimation and Image Denoising

 

BIRD

Blind Image Restoration via Fast Diffusion Inversion

BIRD

  1. 类似DMPlug,we aim to find the initial noise sample that can generate the image when applied to DDIM.

  2. η is a vector with all the parameters that define the degradation operator HLIR=yHη(x0)2ηxT一起优化。

 

FlowIE

FlowIE: Efficient Image Enhancement via Rectified Flow

FlowIE

  1. 直接使用Flow建模两个分布间的path,可以适用于多种任务,如inpainting,colorization和super resolution。

 

AutoDIR

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

  1. 训练一个能handle different degradations的image restoration网络。

  2. 训练一个网络识别输入图像属于哪种预定义的degradation(如blur),填入某个template形成prompt(如"a photo needs {blur} artifact reduction")。

  3. 使用多种预定义的degradation的数据进行训练一个LDM,原图concat在zt上,prompt作为条件,训练复原原数据。

  4. 使用时,将原图输入2,得到prompt,再一起输入LDM进行restoration。

 

UIR-LoRA

UIR-LoRA: Achieving Universal Image Restoration through Multiple Low-Rank Adaptation

UIR-LoRA

 

DiffBIR

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

DiffBIR

 

PromptIR

PromptIR: Prompting for All-in-One Blind Image Restoration

PromptIR

 

ZeroAIR

Exploiting Diffusion Priors for All-in-One Image Restoration

 

Diff-Plugin

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

  1. 类似AutoDIR

 

TIP

TIP: Text-Driven Image Processing with Semantic and Restoration Instructions

  1. ControlNet,ControlNet输入degration指令,StableDiffusion输入prompt,自监督训练。

 

Decorruptor

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Decorruptor

  1. create pairs of (clean, corrupted) images and utilize them for fine-tuning to enable the recovery of corrupted images to their clean states.

 

PromptFix

PromptFix: You Prompt and We Fix the Photo

PromptFix

  1. We compile approximately two million raw data points across eight tasks: image inpainting, object creation, image dehazing, image colorization, super-resolution, low-light enhancement, snow removal, and watermark removal. For each low-level task, we utilized GPT-4 to generate diverse training instruction prompts Pinstruction. These prompts include task-specific and general instructions. The task-specific prompts, exceeding 250 entries, clearly define the task objectives. For example, "Improve the visibility of the image by reducing haze" for dehazing.

  2. For watermark removal, super-resolution, image dehazing, snow removal, low-light enhancement, and iimage colorization tasks, we also generate "auxiliary prompts" for each instance. These auxiliary prompts describe the quality issues for the input image and provide semantic captions.

 

SUPIR

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

  1. 借助MLLM生成prompt,ControlNet引入LQ,送入SDXL生成HQ。

 

Diff-Restorer

Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

  1. 类似SUPIR。

 

InstantIR

InstantIR: Blind Image Restoration with Instant Generative Reference

InstantIR

 

ReFIR

ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

InstantIR

 

DP-IR

A Modular Conditional Diffusion Framework for Image Reconstruction

DP-IR

 

BIR-D

Taming Generative Diffusion Prior for Universal Blind Image Restoration

  1. guidance

 

Face/Human

PGDiff

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

  1. partial guidance,类似GradPaint,整理了很多任务使用统一框架。

 

PFStorer

PFStorer: Personalized Face Restoration and Super-Resolution

PFStorer

  1. 有reference image的restoration,LQ和StableSR引入方式一样,reference image以类似ControlNet的方式引入。

 

RestorerID

RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

 

CLR-Face

CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models

 

DiffBody

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

DiffBody

  1. ControlNet

 

DTBFR

Towards Unsupervised Blind Face Restoration using Diffusion Prior

 

AuthFace

AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

 

DR-BFR

DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration

 

OSDFace

OSDFace: One-Step Diffusion Model for Face Restoration

 

Super Resolution

将LR图像上采样到HR的resolution,该问题就可以转化为LQ到HQ的restoration问题。

 

SRDiff

SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models

SRDiff

  1. diffusion model以LR为condition建模HR与upsample(LR)之间的residual。

 

SR3

Image Super-Resolution via Iterative Refinement

  1. 低分辨率图像上采样到高分辨率,concat在xt​上进行训练,和GLIDE的inpainting model类似。

 

StableSR

Exploiting Diffusion Prior for Real-World Image Super-Resolution

StableSR

  1. LR上采样到HR的resolution,经过VAE encoder编码后,输入一个可训练的time-aware encoder,得到multi-scale feature,再训练一个小卷积网络(SFT),根据feature预测scale and shift去affine StableDiffusion对应的feature,只训练encoder和SFT。

  2. color correction:预测结果每个通道减去自己的均值再除以自己的标准差,之后乘以LR在该通道的标准差再加上均值。

  3. 训练一个CFW模块,利用VAE encoder的feature Fe,修改VAE decoder的feature Fd,MSE训练,只训练CFW。

 

ResShift

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting

  1. super resolution,扩散过程以HR为起点LR为终点,不断增加LR和HR的residual,类似ShiftDDPMs推导后验公式,建模逆向过程。

 

SinSR

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

  1. 将ResShift扩展到DDIM的确定性采样,之后蒸馏为一步。

 

DoSSR

Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs

DoSSR

 

TASR

TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

TASR

 

PatchScaler

PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution

PatchScaler

  1. confidence-driven loss:LGRM=EyLR[yHRxHR12+λ(CyHRxHR22ηlogC)]​。xHR是ground truth HR feature。

  2. 使用xHR训练一个DiT。

  3. GRM得到coarse HR feature yHR后,对其进行patchify,根据每个patch内像素的平均confidence score决定该patch的难易程度,根据难易程度选择一个时间步t(越难越大),加噪到yt,使用DiT去噪,类似SDEdit。

 

Treg

Regularization by Texts for Latent Diffusion Inverse Solvers

使用text引导的超分和去模糊。

 

PromptSR

Image Super-Resolution with Text Prompt Diffusion

LR上采样后concat到xt上重新训练一个带cross-attention的diffusion model,使用预训练text encoder对prompt进行编码送入cross-attention,prompt都是一些指令,如deblur,resize等。

Text-guided Explorable Image Super-resolution

 

CoSeR

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

类似PromptSR,根据LR生成一些粗略的HR的reference image和prompt,两者作为条件训练diffusion model进行超分生成。

 

FaithDiff

FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

FaithDiff

 

CasSR

CasSR: Activating Image Power for Real-World Image Super-Resolution

CasSR

根据LR生成一些粗略的HR的reference image,和LR一起作为条件训练diffusion model进行超分生成。

 

SeeSR

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

SeeSR

 

PASD

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

PASD

 

XPSR

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

XPSR

类似SUPIR借助MLLM生成prompt。

 

SAM-DiffSR

SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

  1. SAM辅助。

 

SegSR

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

Seg

 

SkipDiff

SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution

SkipDiff

  1. action 0 is to perform the reverse diffusion process with the current state, while action 1 is to skip the diffusion process.

 

ECDP

Effcient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution

ECDP

  1. 使用两个loss重新训练一个分数模型。

  2. Lscore即为diffusion loss。

  3. 每次训练时,先根据LR用分数模型的PF ODE采样得到结果,与HR计算perceptual loss,即为Lquality,ODE不需要存中间结果也可以反向传播(Neural Ordinary Differential Equations)。

 

FDDif

Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution

 

BlindDiff

BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution

  1. most methods are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adapt ability to real-world applications that involve complex unknown degradations.

  2. 引入对degradation level的估计。

 

CDFormer

CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution

  1. Blind Image Super-Resolution

 

DiffFNO

DiffFNO: Diffusion Fourier Neural Operator

 

BFSR

RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

 

Acceleration

OSEDiff

One-Step Effective Diffusion Network for Real-World Image Super-Resolution

OSEDiff

  1. 类似Diff-Instruct的思想。

 

InvSR

Arbitrary-steps Image Super-resolution via Diffusion Inversion

InvSR

 

AdaDiffSR

AdaDiffSR: Adaptive Region-aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution

AdaDiffSR

 

S3Diff

Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

S3Diff

 

TDDSR

TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution

TDDSR

 

AdcSR

Adversarial Diffusion Compression for Real-World Image Super-Resolution

AdcSR

 

HF-Diff

HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution

HF-Diff

 

TSD-SR

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

TSD-SR

 

OFTSR

OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

  1. 步数蒸馏加速。

 

Inpainting

Blended Diffusion

Blended Diffusion for Text-driven Editing of Natural Images

Blended Latent Diffusion

  1. training-free,text-free + text-guided

  2. pre-trained unoncditional diffusion model + pre-trained CLIP as guidance。

  3. 类似于inpainting,每一步采样结果的unmask部分用q(xt|x0)取代。

  4. extending augmentations

 

LatentPaint

LatentPaint: Image Inpainting in Latent Space with Diffusion Models

  1. training-free,text-free

  2. 对latent representation(比如h-space)做blended

 

RePaint

RePaint: Inpainting using Denoising Diffusion Probabilistic Models

resample

  1. training-free,text-free

  2. resample technique

 

TD-Paint

TD-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning

TD-Paint

  1. training-free,text-free

  2. 节省了RePaint的resample过程,加速。

 

CoPaint

Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models

  1. training-free,text-free

  2. 对随机DDIM每步采样结果进行梯度更新,loss为其x^0​和原图的unmasked region之间的MSE。

  3. 同样使用resample technique。

 

GradPaint

GradPaint: Gradient-Guided Inpainting with Diffusion Models

  1. training-free,text-free

  2. CoPaint的梯度guidance版。计算每步生成结果和原图的unmasked region之间的MSE,求梯度作为guidance,类似Posterior Sampling。

 

Tiramisu

Image Inpainting via Tractable Steering of Diffusion Models

Tractable Probabilistic Models

 

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  1. training-based,text-guided

  2. 参考Text-Guided Inpainting Model。

 

Imagenator

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

  1. training-based,text-guided

  2. Imagen版本的GLIDE的Text-Guided Inpainting Model,直接降采样并concat会导致mask边缘不和谐,所以训练一个encoder进行降采样。

Imagenator

 

StableInpainting

High-Resolution Image Synthesis with Latent Diffusion Models

  1. training-based,text-guided

  2. StableDiffusion版本的GLIDE的Text-Guided Inpainting Model,在LAION上使用随机mask进行训练,masked image也用VAE encoder编码,mask降采样到和zt​一样的尺寸。

 

CAT-Diffusion

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

CAT-Diffusion

  1. training-based,text-guided

  2. StableInpainting有两个缺点:Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model.

  3. pre-inpainting

 

SmartBrush

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

  1. training-based,text-guided

  2. xt中只有前景被加噪,背景仍然是原图。

  3. self supervised learning using panoptic segmentation dataset

  4. mask augmentation + background preservation with mask prediction

  5. 编辑时还可以通过mask指定shape。

SmartBrush

 

PowerPaint

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

  1. training-based,text-guided

  2. 和StableInpainting一样的训练方法,额外在文本中引入可训练的prompt,作为该任务的prompt。

PowerPaint

 

ControlNet-Inpainting

Adding Conditional Control to Text-to-Image Diffusion Models

  1. training-based,text-guided

  2. zt + masked image + mask作为condition输入ControlNet。

 

BrushNet

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

  1. training-based,text-guided

  2. ControlNet改进版,新加的网络去掉了cross-attention layer,只处理图像。

BrushNet

 

Brush2Prompt

Brush2Prompt: Contextual Prompt Generator for Object Inpainting

  1. 根据unmask部分的内容及mask的形状,自动生成用于inpainting的prompt,之后使用text-guided inpainting model进行inpainting。

 

LoMOE

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

LoMOE

  1. training-free,text-guided。

  2. 使用BLIP生成图像的prompt,使用regularized DDIM Inversion得到xT​。

  3. 因为是多区域编辑,所以使用基于mask的MultiDiffusion,每个区域使用自己的edit prompt单独去噪一次,然后根据mask加在一起。

  4. 经典的two branch方法,两个branch之间计算loss梯度下降优化yt,cross-attention map之间的MSE loss用于保持编辑物体的位置和结构,background pixel MSE loss保持背景。

 

HD-Painter

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

  1. 基于StableInpainting的training-free方法,text-guided。

  2. 使用上述训练好的StableInpainting模型,将所有self-attention layer替换为Prompt-Aware Introverted Attention layer(PAIntA layer),其也是self-attention的计算方式,但对每个masked pixel对应的self-attention map做修改,给masked pixel对应的self-attention map中的unmasked pixel的响应值乘一个系数,该系数等于该unmasked pixel与所有word的cross-attention map的响应值的和,目的是让masked pixel更加注重那些与text有关的unmasked pixel。由于StableInpainting中所有self-attention layer(即PAIntA layer)都在cross-attention layer前,所以计算时借用下一个cross-attention layer的参数。

  3. Reweighting Attention Score Guidance:计算每个word对应的cross-attention map,根据mask计算交叉熵, maximize the cross-attention scores in the masked region and minimize the cross-attention scores in the unmasked region,所有word的计算结果求和,求梯度作为guidance。一般的guidance会使采样结果偏离,导致采样质量下降,这里将guidance除以其标准差,将随机版本的DDIM采样公式中的噪声替换为guidance,因为随机版本的DDIM采样公式是可以保证采样结果不偏离的,因为其噪声是标准正态分布,所以这里将guidance除以其标准差以匹配单位方差,但保持了其均值以实现guidance。

  4. 训练一个超分LDM,对上述inpainting结果进行超分。

 

MagicRemover

MagicRemover: Tuning-free Text-guided Image inpainting with Diffusion Models

MagicRemover

  1. training-free,text-guided,是专门做object removal的,text为想要remove的object。

  2. optimizing zt towards the direction where the cross-attention map response of the k-th word (swan) is zero naturally leads to the erasure of the object corresponding to the k-th word. cross-attention map响应值由高到低分别对应物体、影子和背景,定义函数g(t,k,λ)=CAMt,k[min(CAMt,k)+λ(max(CAMt,k)min(CAMt,k))]CAMt,k1,并使用CAMt,kztg(t,k,λ)作为guidance,使用h-space的非对称采样方法。

  3. 将reconstructive generative trajectory的self-attention的KV注入inpainting generative trajectory,但可以使用MasaCtrl的思路,使用reconstructive generative trajectory的cross-attention估计一个object的mask,让inpainting generative trajectory的self-attention中的object区域只参考reconstructive generative trajectory的KV中mask之外的区域。

 

AttentiveEraser

Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance

AttentiveEraser

  1. training-free,需要提供要remove的object的mask。

  2. AAS: 提高mask区域与background区域的self-attention map的weight,让remove的地方更符合background;降低mask区域与mask区域的self-attention map的weight,因为现在要inpaint mask区域,mask区域没有可参考的东西;降低background区域与mask区域的self-attention map的weight,防止background区域被影响,因为mask区域没有可参考的东西。

  3. SARG: 类似PAG的attention guidance,ϵθ(zt)+s(ASS(ϵθ(zt))ϵθ(zt))

 

MagicEraser

MagicEraser: Erasing Any Objects via Semantics-Aware Control

MagicEraser

  1. TI学习一个形容词,加在背景相关词的前面,如这里的$\text{A photo of } S_{\star} \text{ sky}.,同时LoRA fine-tune diffusion model,在构造的数据集上训练。

  2. 提高与mask region相关region的self-attention map的weight,降低不相关region的self-attention map的weight。

  3. zero-shot推理时,用含有S的prompt进行erase。

 

PILOT

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

PILOT

  1. training-free,text-guided

  2. Lbg=(1m)z~0,tu(1m)zin22LS=(1m)z~0,tu(1m)z~0,tc22z~0,tuz~0,tc22z~0,tuz~0,tc尽可能有区别但mask区域外尽可能保持一致,鼓励mask区域内与text对齐。

  3. We believe that the early stage of the reverse process determines the semantics of the generated image,因此只在前γT进行优化;为了加速,每隔τ步才做一次优化。

 

Uni-paint

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

UniPaint

  1. training-based,text-free + text-guided

  2. We found blended is insufficient since the known information is inserted externally rather than generated by the model itself, the model lacks full context awareness, potentially causing incoherent semantic transitions near hole boundary. 只需要masked finetune一下模型,继续使用blended生成。

  3. 进一步才用masked attention,对于cross-attention,只让text和masked区域内的pixel做attention,对于self-attention,只让masked区域内的pixel之间互相做attention。

 

MaGIC

Multi-modality Guided Image Completion

  1. training-based,text-based,StableInpainting Model。

  2. 每个模态训练一个encoder提取模态的feature,feature是multi-scale的,每个scale注入到UNet的encoder对应scale的feature上。对于structure-form(如segmentation,edge等),直接相加;对于context-form(如text,style等),将feature进行pool后注入cross-attention作为context vector。训练时freeze StableDiffusion Inpainting Model,只训练模态encoder,且每个模态单独训练。有点类似ControlNet。

  3. 采样时可以多个模态encoder一起用,不过不能再用上面的注入方式了(因为feature不具备可加性),而是使用StableDiffusion Inpainting Model的UNet的multi-scale feature和引入单个模态encoder后得到的multi-scale feature计算MSE loss,求梯度作为guidance,因为梯度具有可加性,这样可以实现多模态的guidance而不用重新训练多模态。

 

InpaintAnything

Inpaint Anything: Segment Anything Meets Image Inpainting

SAM + 任意inpainting model

 

StrDiffusion

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

StrDiffusion

  1. training-based

  2. IR-SDE的公式,masked image作为μ

  3. sparse structure:例如the grayscale map和edge map。

 

ByteEdit

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

ByteEdit

  1. StableInpainting with feedback learning

  2. 假设diffusion model使用20步生成,训练时只加噪到15步,之后在[1,10]之间随机采样一个时间步t,不记录梯度从15生成到t,之后记录梯度从t生成到0,优化从t生成到0的生成链。

 

SketchInpainting

Sketch-guided Image Inpainting with Partial Discrete Diffusion Process

SketchInpainting

  1. 只对mask部分的token进行discrete diffusion,构造数据集进行自监督训练。

 

LazyDiffusion

Lazy Diffusion Transformer for Interactive Image Editing

LazyDiffusion

  1. 使用Pixel-α​的patch方法对masked image进行patchify,输入一个transfomer context encoder,之后只保留mask覆盖区域的token作为global context。

  2. 使用Pixel-α初始化,对原图加噪并patchify,只将mask覆盖区域的token输入模型,global context token concat在输入上,prompt输入cross-attention,端到端训练所有参数进行去噪。

  3. 类似SmartBrush构造数据集进行自监督训练。

 

AsyncDSB

AsyncDSB: Schedule-Asynchronous Diffusion Schrödinger Bridge for Image Inpainting

 

Outpainting

Ten

Generative Powers of Ten

zoom stack

 

PQDiff

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

PQDiff

  1. 随机crop两个view,一个archor view一个target view,resize到相同shape,用左上角坐标计算RPE,以archor view为条件训练生成target view。

 

PBG

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

PBG

  1. We use Stable Inpainting as a base model and add the ControlNet model on top to adapt it to the salient object outpainting task.

 

Representation Learning

Diff-AE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

 

SODA

SODA: Bottleneck Diffusion Models for Representation Learning

SODA

UNet有2m+1层,将z分为m个部分{zi}i=1m,Adaptive GroupNorm时,encoder和decoder相同resolution使用相同的zi,同时随机zero out {zi}i=1m的一部分,可以classifier-free guidance生成,也能提高z之间的解耦性。

 

PDAE

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

 

DBAE

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

DBAE

  1. 编码器编码x0,得到z,根据z解码得到xT,利用DDBM建模x0xT之间的bridge。

  2. 相比于Diff-AE和PDAE,数据的信息分别存储与zxT,但DBAE的xT完全依赖于z,因此所有信息都存储于z,使用确定性采样算法时不存在stochastic variations现象,另外DBAE的xT的推理速度也更快。

 

HDAE

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

 

DiffuseGAE

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

在Diff-AE的隐空间上学习解耦表征。

 

DisDiff

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models

 

EncDiff

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

 

CL-Dis

Closed-Loop Unsupervised Representation Disentanglement with beta-VAE Distillation and Diffusion Probabilistic Feedback

 

FDAE

Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning

 

DiTi

Exploring Diffusion Time-steps for Unsupervised Representation Learning

 

CausalDiffAE

Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models

 

Object-Centric Learning

Object-Centric Slot Diffusion

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

 

DDAESSL

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

 

DiffMAE

Diffusion Models as Masked Autoencoders

 

UMD

Unified Auto-Encoding with Masked Diffusion

UMD

  1. mask区域使用MAE loss,noisy部分使用diffusion loss,一起训练。

  2. 额外引入t=0时间步,使用high ratio mask,不加噪,只使用MAE loss。

 

MDM

Masked Diffusion as Self-supervised Representation Learner

MDM

动态mask ratio版的MAE。

 

StableRep

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

根据文本生成图像作为数据。

Can Generative Models Improve Self-Supervised Representation Learning?

根据原图生成图像作为数据,instance-guided generation作为一种augmentation进行SSL。

Unlike StableRep, we do not replace a real dataset with a synthetic one. Instead, we leverage conditional generative models to enrich augmentations for self-supervised learning. In addition, our method does not require text prompts and directly uses images as input to the generative model.

 

PersonalizedRep

Personalized Representation from Personalized Generation

PersonalizedRep

 

CLSP

Contrastive Learning with Synthetic Positives

CLSP

 

GenPoCCL

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

GenPoCCL

 

GenView

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

GenView

 

SynCLR-SynCLIP

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

SynCLR-SynCLIP

 

DreamDA

DreamDA: Generative Data Augmentation with Diffusion Models

DreamDA

  1. 给h-space的feature加一个高斯噪声用于预测x^0,原来的feature用于预测direction,DDIM采样,可以生成原图的variations。

 

DALDA

DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling

 

l-DAE

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

 

ADDP

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

 

InfoDiffusion

InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models

 

RepFusion

Diffusion Model as Representation Learner

Distill the intermediate representation from a pre-trained diffusion model to a recognition student.

After the distillation phase, the student is reapplied as a feature extractor and fine-tuned with the task label.

Reinforced Time Selection for Distillation.

 

De-Diffusion

De-Diffusion Makes Text a Strong Cross-Modal Interface

text as representation, encoder is a captioning model, decoder is a text2img model

gumbel softmax

 

DiffSSL

Do text-free diffusion models learn discriminative visual representations?

利用UNet intermediate feature maps做判别。

 

DIVA

Diffusion Feedback Helps CLIP See Better

DIVA

  1. DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text).

  2. CLIP image embedding和text embedding在同一空间,所以可以作为条件输入StableDiffusion。

 

Free-ATM

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Free-ATM

 

 

Other Tasks

hybrid

UniDiffuser

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

 

OneDiffusion

One Diffusion to Generate Them All

 

InstructDiffusion

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

  1. 不同任务转变为不同instruction,原图、instruction、目标图像作为数据,instruction作为文本输入,训练一个StableDiffusion生成目标图像,原图concat到zt上。

  2. 利用InstructPix2Pix的数据训练也可以做编辑。

 

DreamOmni

DreamOmni: Unified Image Generation and Editing

DreamOmni

 

InstructCV

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

InstructCV

  1. instruction作为StableDiffusion的文本输入,原图x编码后concat在zt上进行训练。

 

PixWizard

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

PixWizard-1

PixWizard-2

 

DiffusionGeneralist

Toward a Diffusion-Based Generalist for Dense Vision Tasks

DiffusionGeneralist

 

DiffX

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

 

Classification

RDC

Robust Classification via a Single Diffusion Model

miny1Tt=1TE[ϵθ(xt,t,y)ϵ22]

 

NDC

Diffusion Models are Certifiably Robust Classifiers

 

TiF

Few-shot Learner Parameterization by Diffusion Time-steps

在few-shot dataset上LoRA fine-tune StableDiffusion,prompt用"a photo of [C]",使用类似上述RDC的公式推理,但给公式引入了一个时间步weight,并指明这个weight很重要。

 

CiP

Image Captions are Natural Prompts for Text-to-Image Models

对于只有类别标注的图像数据集,如ImageNet,利用预训练caption模型,对某个图像生成一个caption,拼在"a photo of class"之后组成一个prompt,再利用预训练StableDiffusion生成这个prompt的图像,用这个生成的图像替代原图,组成数据集。最终合成的数据集与原数据集大小相同。用合成的数据集训练分类器,效果更好。

 

CDN

Classification-Denoising Networks

 

HDC

Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning

 

GDC

A Simple and Efficient Baseline for Zero-Shot Generative Classification

 

Retrieval

DiffusionRet

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

 

TIGeR

Unified Text-to-Image Generation and Retrieval

 

Object Detection

DiffusionDet

DiffusionDet: Diffusion Model for Object Detection

 

CamoDiffusion

CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models

 

FocusDiffuser

FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection

 

DiffRef3D

DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object

 

DiffuBox

DiffuBox: Refining 3D Object Detection with Point Diffusion

 

MonoDiff

Monocular: 3D Object Detection and Pose Estimation with Diffusion Models

 

SDDGR

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

 

CLIFF

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

 

Edge Detection

DiffusionEdge

DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

 

Depth

D4RD

Digging into contrastive learning for robust depth estimation with diffusion models

 

BetterDepth

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

 

PriorDiffusion

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

 

Lotus

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

 

FiffDepth

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

 

SharpDepth

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

 

SEDiff

SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models

 

DepathAnyVideo

Depth Any Video with Scalable Synthetic Data

 

Optical Flow

FlowDiffuser

FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models

 

Segmentation

DFormer

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

 

OVDiff

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

 

GCDP

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

  1. image和segmentation在channel维拼成一条数据训练text-guided diffusion model,使用Gaussian-Categorical distribution新公式,可以根据text同时生成image和segmentation的互相生成。

 

SemFlow

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

  1. We train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks.

 

UniGS

UniGS: Unified Representation for Image Generation and Segmentation

 

DiffDASS

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

Domain Adaptive Semantic Segmentation,利用image translation做分割模型的迁移。

需要source domain的图像和分割图,target domain的图像。

使用source domain的图像和分割图训练一个分割模型,使用target domain的图像训练一个扩散模型,对source domain的图像做SDEdit,分割模型和分割图计算loss做梯度修正,生成该分割图在target domain对应的图像,然后使用它们fine-tune source domain的分割模型,就可以得到target domain的分割模型。

 

DGInStyle

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

 

LDMSeg

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

LDMSeg

 

pix2gestalt

pix2gestalt: Amodal Segmentation by Synthesizing Wholes

 

 

Correspondence

DiffMatch

Diffusion Model for Dense Matching

 

Caption

CLIP-Diffusion-LM

Apply Diffusion Model on Image Captioning

 

DiffCap

DiffCap: Exploring Continuous Diffusion on Image Captioning

 

Text-only Image Captioning

Text-Only Image Captioning with Multi-Context Data Generation

 

Prefix-Diffusion

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

 

LaDiC

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

LaDiC

 

Visual Grounding

PVD

Parallel Vertex Diffusion for Unified Visual Grounding

 

DiffusionVG

Language-Guided Diffusion Model for Visual Grounding

 

DiffusionVG

Exploring Iterative Refinement with Diffusion Models for Video Grounding

 

Visual Prediction

DDP

DDP: Diffusion Model for Dense Visual Prediction

 

Action Anticipation

DIFFANT

DIFFANT: Diffusion Models for Action Anticipation

 

 

Temporal Action Detection

DiffTAD

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

 

EffiDiffAct

Faster Diffusion Action Segmentation

 

ActFusion

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

 

Object Tracking

DiffusionTrack

DiffusionTrack: Diffusion Model For Multi-Object Tracking

 

DINTR

DINTR: Tracking via Diffusion-based Interpolation

 

DiffusionTrack

DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking

 

DeTrack

DeTrack: In-model Latent Denoising Learning for Visual Object Tracking

 

Video Moment Retrieval

MomentDiff

MomentDiff: Generative Video Moment Retrieval from Random to Real

 

Video Question Answering

DiffAns

Conditional Diffusion Model for Open-ended Video Question Answering

 

Sound Event Detection

DiffSED

DiffSED: Sound Event Detection with Denoising Diffusion

 

Knowledge Distillation

DM-KD

Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

用diffusion model生成的数据作为训练集,输入到pre-trained teacher网络,蒸馏到student网络。这样不需要限制在真实数据集上,效果好,用生成的low fidelity的图像(减少采样步数等方法)效果更好。

 

DiffKD

Knowledge Diffusion for Distillation

用teacher网络提取到的feature作为数据训练一个diffusion model,将student网络提取到的feature作为teacher的feature的noisy version进行去噪,去噪后的feature和teacher的feature计算KL loss,优化student网络。

 

 

Data Attribution

data attribution: for a generated image, which training data contribute to it much?

Evaluating Data Attribution for Text-to-Image Models

Intriguing Properties of Data Attribution on Diffusion Models

Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond

Efficient Shapley Values for Attributing Global Properties of Diffusion Models to Data Group

SemGIR: Semantic-Guided Image Regeneration based method for AI-generated Image Detection and Attribution

MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models

Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

 

Dataset Distillation

LD3M

Latent Dataset Distillation with Diffusion Models

  1. Dataset distillation aims to generate a small set of representative synthetic samples from the original training set.

 

D4M

D4M: Dataset Distillation via Disentangled Diffusion Model

 

Image Quality Assessment

PFD-IQA

Feature Denoising Diffusion Model for Blind Image Quality Assessment

 

eDifFIQA

eDifFIQA: Towards Efficient Face Image Quality Assessment Based On Denoising Diffusion Probabilistic Models

 

DP-IQA

DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

 

NR-IQA

Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

 

Generative Understanding

利用预训练生成模型网络辅助理解模型,或者使用生成数据提升理解模型。

 

hybrid

DatasetDM

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

DatasetDM

 

CleanDIFT

CleanDIFT: Diffusion Features without Noise

CleanDIFT

 

AnySynth

AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

AnySynth

 

DMaaPx

Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models

 

DMP

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

 

Syn-Rep-Learn

Scaling Laws of Synthetic Images for Model Training

 

Vermouth

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Vermouth

  1. To effectively transfer learned features to discriminative tasks while ensuring compatibility, an intuitive approach is to introduce the prior knowledge of the recognition model. 使用预训练好的ResNet-18引入判别先验Fexp,ResNet本身就能生成多分辨率的feature,和UNet生成的feature可以concat在一起。

  2. U-head有两个flow,down-sample flow生成global feature,用于分类等任务,up-sample flow生成spatial feature,用于分割等任务。

 

SDP

Scaling Properties of Diffusion Models for Perceptual Tasks

 

Diff-2-in-1

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

 

GDF

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

 

GATE

Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

 

GenPercept

Diffusion Models Trained with Large Data Are Transferable Visual Models

  1. We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data.

  2. 使用预训练好的diffusion model,输入原图,时间步为1,fine-tune diffusion model预测target,如depth等。

 

AddSD

Add-SD: Rational Generation without Manual Reference

  1. 利用diffusion model编辑图像添加物体,解决下游的分类、分割、检测等任务的类别长尾分布问题。

 

 

Data Mining

Diffusion Models as Data Mining Tools

 

Classification

Diffusion Classification

Diffusion Models Beat GANs on Image Classification

UNet feature + classification head

 

FGDS

Feedback-Guided Data Synthesis for Imbalanced Classification

FGDS

 

Analyzing and Explaining Image Classifiers via Diffusion Guidance

 

Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

 

Active Generation for Image Classification

 

Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model

 

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

 

Image Retrieval

Zero-Shot Sketch-based Image Retrieval

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

 

Object Detection

DiffusionEngine

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

  1. 利用StableDiffusion制造detection数据,与Attention as Annotation类似。

  2. 先用已有的detection数据只加一步噪声输入StableDiffusion,训练一个可以根据UNet feature map pyramid生成bounding box的Detection Adaptor。之后固定Detection Adaptor,构造一些简单通用的prompt,给已有的detection数据图片加噪声再生成(类似SDEdit),将最后一步的feature map pyramid输入Detection Adaptor,输出作为生成图片的bounding box标注。

 

T2I-for-Detection

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

利用StableDiffusion制造detection数据,方法是分别生成前景和背景,再拼接粘合。

 

Data Augmentation for Object Detection via Controllable Diffusion Models

 

Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data

 

3DiffTection

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

 

DetDiffusion

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

 

DDT

Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector

 

NADA

No Annotations for Object Detection in Art through Stable Diffusion

 

Sketch

DiffSketch

Representative Feature Extraction During Diffusion Process for Sketch Extraction with One Example

 

Depth and Saliency

Diffusion Scene Representation

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

  1. 用预训练网络标注StableDiffusion生成图像的depth and salient数据集。

  2. Extract the intermediate output of some self-attention layer at some sampling step. Interpolate lower resolution predictions to the size of synthesized images. A linear classifier is trained on it to predict the pixel-level logits.

  3. 之后就可以使用StableDiffusion和linear classifier进行预测。

 

JointNet

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

JointNet

  1. a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps).

 

ECoDepth

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

ECoDepth

 

PrimeDepth

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

 

Segmentation

DDPMSeg

Label-Efficient Semantic Segmentation With Diffusion Models

  1. 对数据进行加噪,输入到预训练好的DDPM的UNet中,用decoder各层输出的feature map上采样到图片尺寸后concat起来,每个pixel对应一个vector,将其输入到MLP中进行标签预测,进行训练。

  2. 经过实验,选取B=5,6,7,8,12层decoder输出的feature map和t=50,150,250的加噪数据,全部concat在一起,并训练多个独立的MLP,预测时采取投票制决定分类。

 

EmerDiff

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

  1. we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generatin fine-grained segmentation maps without any additional training.

 

MaskDiffusion

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

MaskDiffusion-seg

 

OVAM

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

OVAM-1

  1. 选多个时间步的不同层的attribution prompt的cross-attention map估计segmentation。

OVAM-2

  1. attribution prompt不一定是最好的描述,借用TI的思想,可以用一些数据做token optimization,即优化attribution prompt的token embedding,效果很更好。

 

DiG

Diffusion-Guided Weakly Supervised Semantic Segmentation

 

FreeSeg-Diff

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

FreeSeg-Diff

  1. training-free

 

DatasetDiffusion

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

  1. AsτAcτ是一个指数,用于增强self-attention map,之后使用self-attention map增强cross-attention map,类似卷积,把每个像素的self-attention map(HW)看作一个卷积核,cross-attention map看作feature,当某个像素的self-attention map和cross-attention map有相同的较高响应值的区域,那么AsτAc中该像素就有较高的结果。

 

MADM

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

 

DIFF

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

DIFF

 

VPD

Unleashing Text-to-Image Diffusion Models for Visual Perception

  1. cross-attention map

 

EVP

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

  1. enhanced VPD

 

Meta-Prompt

Harnessing Diffusion Models for Visual Perception with Meta Prompts

 

ODISE

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

ODISE

 

DiffSeg

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation using Stable Diffusion

  1. DiffSeg utilizes a pre-trained StableDiffusion model and specifically its self-attention layers to produce high quality segmentation masks.

 

DiffSegmenter

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

  1. 设计prompt,和图像一起送进StableDiffusion,利用某个单词的cross-attention map得到该物体的大概的分割图,再利用self-attention map进行调整和补全。

 

AaA

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

  1. 不再依赖人工标注,利用StableDiffusion生成大量图像和分割图(cross-attention map)的数据训练分割模型

 

MaskFactory

MaskFactory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

MaskFactory

 

SegGen

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

  1. 使用现有的分割数据集(image, segmentation map),将segmentation map编码成三通道的图像的形式,使用BLIP2得到image的caption,fine-tune SDXL(caption segmentation map),得到一个Text2Mask模型。

  2. 使用(image, segmentation map)训练一个ControlNet,得到一个Mask2Img模型。

  3. 之后可以使用这两个网络生成新的分割训练数据:使用现有分割数据集的某张原图,使用BLIP2得到原图的caption,输入到Text2Mask模型中,得到一系列segmentation map,再输入到Mask2Img模型,得到segmentation map对应的原图,组成数据对。

  4. 对于相同的分割模型,在现有的分割数据集基础上,额外使用生成的数据进行训练,效果比只使用现有的分割数据集的模型有明显提升。

 

FoBaDiffusion

Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models

 

SSSS

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

  1. 用scribble训练一个ControlNet,生成分割训练数据数据。

 

Outline

Outline-Guided Object Inpainting with Diffusion Models

  1. 利用少量的instance segmentation数据,使用StableInpainting对这些数据的做object variation,扩增数据。

 

LDM-Seg

Explore In-Context Segmentation via Latent Diffusion Models

 

ScribbleGen

ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

ScribbleGen

 

Grounding

Peekaboo

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

给定一张图片和一句描述图片中某个object的text,利用预训练好的text2img模型StableDiffusion预测目标object的mask。

 

Grounded Diffusion

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

利用预训练好的text2img模型StableDiffusion,在根据text输出image的同时,还会输出image对应的segmentation mask。

先用StableDiffusion生成图片,再用预训练好的object detector生成这些图片的segmentation mask,构建了一个数据集,再使用这个数据集训练grounding module,方法也类似Label-Efficient Semantic Segmentation With Diffusion Models。

grounding

GenPromp

Generative Prompt Model for Weakly Supervised Object Localization

 

DiffPNG

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

 

Semantic Correspondence

SD-DINO

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

  1. exploit Stable Diffusion features for semantic and dense correspondence

 

DIFT

Emergent Correspondence from Image Diffusion

  1. 不需要训练,直接使用Stable Diffusion features做匹配即可。

 

SD4Match

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

SD4Match

  1. prompt tuning

 

Diffusion-Hyperfeatures

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

Diffusion-Hyperfeatures

  1. for a given feature map r we upsample it to a standard resolution, pass through a bottleneck layer B to a standard channel count, and weight it with a mixing weight ω. 最终的descriptor map为s=0Sl=1Lωs,lBl(rs,l),其中S是DDIM generation或Inversion的步数,L为UNet层数,Bl是所有时间步共用的可训练网络,ωs,l是可训练的weight。

  2. DDIM generation和Inversion效果相似,所以既适用于synthetic images也适用于real images。

  3. For semantic correspondence, we flatten the descriptor maps for a pair of images and compute the cosine similarity between every possible pair of points. We then supervise with the labeled corresponding keypoints using a symmetric cross entropy loss in the same fashion as CLIP.

 

DiffGlue

DiffGlue: Diffusion-Aided Image Feature Matching

 

VLM

SynthVLM

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

  1. SynthVLM is a novel data synthesis pipeline for VLLMs.

  2. Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs.

 

Multi-Object Tracking

TrackDiffusion

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

使用layout-to-image model,根据tracklet生成video sequence作为训练MOT的数据。

 

DiffMOT

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

 

Human-Object-Interaction

CycleHOI

CycleHOI: Improving Human-Object Interaction Detection with Cycle Consistency of Detection and Generation

CycleHOI

 

Unifying Generative and Understanding

EGC

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

energy function,需要求二阶导优化。

类似Denoising Likelihood Score Matching for Conditional Score-based Data Generation

 

DiffDis

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.

 

Factorized Diffusion

Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation

类似MAGE的生成与理解统一的模型。

复制一个UNet decoder作为Mask Generator,每一步生成K个mask,每个mask代表一个segmentation区域,每个mask乘在encoder的skip-connection feature上,原本的UNet decoder再根据masked skip-connection feature输出K个predicted noise,再乘上各自的mask,K个masked predicted noise的和作为最终的predicted noise计算diffusion loss,这样生成时可以图像和分割图一起生成。

也可以做real image segmentation,只需要加噪一步再去噪一步即可。

 

 

Other Interesting Paper

UnseenDiffusion

Unseen Image Synthesis with Diffusion Models

使用某个域内预训练的diffusion model生成域外的样本。

DDIM inverse 2k个OOD样本到500步得到2k个x500,计算均值和方差,从这个高斯分布中采样,进行生成,可以得到OOD样本。

 

IMPUS

IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

目标:给定两张图像,做插值。

  1. SD上分别作TI,得到两张图对应的text embedding。

  2. 用上述两个text embedding LoRA fine-tune SD。

  3. ϕ LoRA fine-tune SD。

  4. 对text embedding进行插值,cfg生成。

 

DiffMorpher

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

DiffMorpher

 

DreamMover

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

  1. 生成插值序列(视频)

 

AID

AID: Attention Interpolation of Text-to-Image Diffusion

  1. 对两个图像生成过程的cross-attention的KV进行插值,替代当前插值点生成过程中的cross-attention中的KV。

 

NoiseDiffusion

NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation

对diffusion model generated images来说DDIM Inversion + slerp插值法效果很好,但对real images效果就不好,通过一些方法矫正noise可以解决这一问题。

 

BlackScholesDiffusion

Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

  1. prompt插值生成。

 

Concept-centric Personalization

Concept-centric Personalization with Large-scale Diffusion Priors

新任务,将StableDiffusion个性化为专门生成某个概念图像的模型,和TI的区别是,该任务专注于某个更抽象的concept而非reference images中的concept,比如人脸,emphasizes fidelity and diversity in the generative results,所以需要提供至少上k的该concept的图像。

做法是将concept和其它控制条件分离,在提供的concept数据集上fine-tune StableDiffusion (全部使用null text),得到concept-centric diffusion model,使用CFG进行生成,其它控制条件也可以通过CFG引入,比如text和ControlNet。

Concept-centric-Personalization

Neural Network Diffusion

Neural Network Diffusion

parameter autoencoder + latent diffusion model

Diffusion-based Neural Network Weights Generation

Conditional LoRA Parameter Generation

 

FineDiffusion

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

FineDiffusion

  1. Fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories.

  2. 新型CFG:训练和采样时使用superclass label embedding替代null embedding。

 

FactorizedDiffusion

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

FactorizedDiffusion

  1. 生成illusion。

  2. 通过不同的decomposition方法(如高低频,颜色,运动等),将图像x分解为不同的分量fi(x),即x=ifi(x)。采样时对xt使用不同的prompt进行预测,提取预测结果的分量,将不同prompt想要的那个分量相加,得到最终的ϵ~