Oğuzhan Ercan

x.com/oguzhannercan

Diffusion and Flow Models

Control, Optimization, Quality Enhancement and more…

Oğuzhan Ercan

x.com/oguzhannercan

Topics

Topics discussed

-Diffusion Models

-Diffusion Architectures

-Swap-Inpainting Models

-Output Control Techniques

-Inference Time Optimization

-Quality Enhancement

-Video Generation

-Face Models

Prerequisites

•Probability

•Statistic

•Linear Algebra

•Calculus

•Deep Learning

•Differential Equations

Oğuzhan Ercan

x.com/oguzhannercan

Diffusion Models

Diffusion models consist of two interconnected processes: forward and backward. The forward diffusion process gradually corrupts the data by

interpolating between a sampled data point x0 and Gaussian noise. The formulation described placed below:

informations here mostly taken from Imagine Flash paper. https://arxiv.org/pdf/2405.05224

where alfa t and sigma t define the signal-to-noise ratio (SNR) of the stochastic interpolant X_t. In the following, they opt for coefficients (alfa t,

sigma t) that result in a variance-preserving process. . When viewed in the continuous time limit, the forward process in Eq above can be

expressed as the stochastic differential equation which is:

where f (x, t) : Rd →Rd is a vector valued drift coefficient, g(t) : R →R is the diffusion coefficient, and wt denotes the Brownian motion at time t.

Inversely, the backward diffusion process is intended to undo the noising process and generate samples. According to Anderson’s theorem, the

forward SDE introduced earlier satisfies a reverse-time diffusion equation, which can be reformulated using the Fokker-Planck equations to have a

deterministic counterpart with equivalent marginal probability densities, known as the probability flow ODE :

Oğuzhan Ercan

x.com/oguzhannercan

This allows to estimate formulation above, usually parameterized by a time-conditioned neural network. Given these estimates, they can sample

using an iterative numerical solver f:

To end up with first-order solvers like DDIM:

where the sample data estimate ˆx0 at time-step t is computed as:

Oğuzhan Ercan

x.com/oguzhannercan

Diffusion Architectures

Following slides includes:

-Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack 27 September 2023

-Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding 14

May 2024

-Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion

Transformers 9 May 2024

- Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 29 Dec 2023

- And more

Oğuzhan Ercan

x.com/oguzhannercan

Emu: Enhancing Image Generation Models Using Photogenic Needles in

a Haystack 27 September 2023 ( Image quality should always be prioritized over quantity.)

Their key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the

generation quality. Effective fine-tuning of LLMs can be achieved with a relatively small but high-quality fine-tuning dataset, e.g., using 27K

prompts.

They increase the channel of AE from 4 to higher dimension. They use additional adversarial loss for reconstruction and also they apply a non

learnable preprocessing step to RGB images using a fourier feature transform to lift the input channel dimension.

They use a large U-Net with 2.8B trainable parameters. They increase the channel size and number of stacked residual blocks in each stage for

larger model capacity. They use text embeddings from both CLIP ViT-L and T5-XXL as the text conditions.

They pre-train the model with 1.1B images. They train the model with progressively increasing resolutions. This approach improve finer details at

higher resolutions.

https://arxiv.org/pdf/2309.15807

Oğuzhan Ercan

x.com/oguzhannercan

DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches. The first layer of DiT is “patchify,” which converts the spatial

Input into a sequence of T tokens, each of dimension d,by linearly embedding each patch in the input, following patchify, frequency based

positional embeddings applied. Number off embeddings determined by patch size(p), note that changing p has no meaningful impact on

downstream parameter counts.

Following patchify, the input tokens are processed by a sequence of transformer blocks. In addition to noised image inputs, diffusion models

sometimes process additional conditional information such as noise timesteps t, class labels c, natural language. There is 4 different version of

conditioning. In-context conditioning,imply append the vector embeddings of tand cas two additional tokens in the input sequence. This approach

introduces negligible new Gflops to the model. Cross-attention block. Theyconcatenate the embeddings of t and c into a length-two sequence,

separate from the image token sequence. The transformer block is modified to include an additional multi-head cross-attention layer following the

multi-head self-attentionblock. Cross-attention adds the most Gflops to the model, roughly a 15% overhead.Adaptive layer norm They explore

replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN). Rather than directly learn dimension-wise scale and

shift parameters γand β, they regress them from the sum of the embedding vectors of t and c. adaLN adds the least Gflops and is thus the most

compute-efficient. Also they explore that a modification of the adaLN DiT block which does zero init accelerates large-scale training.

Scalable Diffusion Models with Transformers 2 Mar 2023

https://arxiv.org/pdf/2212.09748

Oğuzhan Ercan

x.com/oguzhannercan

They used 2 text encoder, bilingual t5 and clip. They perform post-training optimization in the inference stage to lower the

deployment cost of Hunyuan-DiT. They used VAE in SDXL which is fine-tuned on 512x512 images from the VAE in SD1.5. They say

SDXL VAE is improves clarity, alleviated over-saturation and reduced distortions.

They found the Adaptive Layer Norm used in class-conditional DiT performs unsatisfactorily to enforce fine-grained text conditions, so

they used cross-attention. Hunyuan-DiT has two types of transformer blocks, the encoder block and the decoder block. Both of them

contain three modules - self-attention, cross-attention, and feed-forward network (FFN). The text information is fused in the cross-

attention module. The decoder block additionally contains a skip module, which adds the information from the encoder block in the

decoding stage. The skip module is similar to the long skip-connection in U-Nets, but there are no upsampling or downsampling

modules in Hunyuan-DiT due to our transformer structure. Finally, the tokens are reorganized to recover the two-dimensional spatial

structure. For training, they find using v-prediction (velocity -> instead of noise, predict the rate of change) gives better empirical

result.

They used 2 dimensional Rotary Positional Embedding (RoPE) for positional encoding to encode the absolute position and relative

position dependency. To be able to generate images with multiple resolutions, they tried to use extended positional encoding and

centralized interpolative positional encoding. They find that CIPE converges faster and generalizes better.

To stabilize the training, they used QK-Norm. They add layer normalization after the skip module in the decoder blocks to avoid loss

explosion during training.

They found certain operations, e.g., layer normalization, tend to overflow with FP16 so they

specifically switch them to FP32 to avoid numerical errors.

Due to the large number of model parameters in Hunyuan-DiT and the massive amount of image data required for training, they

adopted ZeRO, flash-attention , multi-stream asynchronous execution, activation checkpointing, kernel fusion to enhance the training

speed. Deploying Hunyuan-DiT for the users is expensive, they adopt multiple engineering optimization strategies to improve the

inference efficiency, including ONNX graph optimization, kernel optimization, operator fusion, precomputation, and GPU memory

reuse.

They find that adversarial training tends to collapse, and best way to accelerate the model speed at inference time is progressive

distillation.

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with

Fine-Grained Chinese Understanding 14 May 2024

https://arxiv.org/pdf/2405.08748

Oğuzhan Ercan

x.com/oguzhannercan

Lumina-T2X: Transforming Text into Any Modality, Resolution, and

Duration via Flow-based Large Diffusion Transformers 9 May 2024

Lumina-T2X family –a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework

designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. RoPE, RMSNorm, and flow

matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend

the context window to 128K tokens. Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs

of a 600-million-parameter naive DiT (PixArt-α), indicating that increasing the number of parameters significantly accelerates convergence of

generative models without compromising visual quality. Lumina-T2X tokenizes images, videos, multi-views of 3D objects, and spectrograms into

one-dimensional sequences, similar to the way LLMs process natural language. By incorporating learnable placeholders such as [nextline] and

[nextframe] tokens, Lumina-T2X can seamlessly encode any modality -regardless of resolution, aspect ratio, or even temporal duration - into a

unified 1-D token sequence. The empirical observations indicate that employing larger models, high-resolution images, and longer-duration video

clips can significantly accelerate the convergence speed of diffusion transformers.

https://arxiv.org/pdf/2405.05945

Oğuzhan Ercan

x.com/oguzhannercan

PIXART-α: FAST TRAINING OF DIFFUSION TRANSFORMER FOR

PHOTOREALISTIC TEXT-TO-IMAGE SYNTHESIS 29 Dec 2023

They propose a way to train text to image diffusion model with low computational cost (but still 753 A100 GPU days and

$28,400).They decompose the intricate text-to-image generation task into three streamlined subtasks: (1) learning the

pixel distribution of natural images, (2) learning text-image alignment, and (3) enhancing the aesthetic quality of images.

The T2I generation task can be decomposed into three aspects: Capturing pixel dependency - Alignment between text and

Image and high aesthetic quality.

To generate captions with high information density, they leverage state-of-the-art vision-language models LLaVA.

Employing the prompt, “Describe this image and its style in a very detailed manner”, they have significantly improved the

quality of captions.

Based on the Diffusion Transformer (DiT) they incorporate cross-attention modules to inject text conditions and streamline

the computation-intensive class-condition branch to improve efficiency.

https://arxiv.org/pdf/2310.00426

Oğuzhan Ercan

x.com/oguzhannercan

PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution, which evolves from the ‘weaker’ baseline to a ‘stronger’ model via

incorporating higher quality data, a process They term “weak-to-strong training”. Also they propose a novel attention module within the DiT framework that

compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. To enhance PixArt-α, they expand its

generation resolution from 1K to 4K. Generating images at high resolutions introduces a significant increase in the number of tokens, so computational demand. They

introduced a self-attention module with key and value token compression tailored to the DiT framework. Additionally, they employ a specialized weight initialization

scheme, allowing for a smooth adaptation from a pre-trained model without KV compression. This design effectively reduces training and inference time by ∼34% for

high-resolution image generation. They utilize only 9% of the GPU days required by PixArt-α to achieve a strong 1K high resolution image generation model. They

changed LLaVa with share-captioner for preventing hallucinations.

to mitigate the potential information loss caused by KV compression in self-attention computation, they opt to retain all the tokens of queries (Q). This strategy allows

us to utilize KV compression effectively while mitigating the risk of losing crucial information. They utilize group convolutions with a stride of 2 for local aggregation of

keys and values as compression function. They design a specialized convolution kernel initialization “Conv Avg Init” that utilizes group convolution and initializes the

weights w = 1/R**2 ,equivalent to an average operator.

They changed the pixart alfa’s vae which is SD1.5 vae with SDXL vae and finetuned it. 2K training steps is enough for this finetuning. While fine tuning LR to HR

model, they see a performance degradation caused by discrepancies in positional embeddings (PE) between different resolutions. To solve it, they initialized the HR

model’s PE by interpolating the LR model’s PE. The fine-tuning quickly converges at 1K steps. Theycan use KV compression directly when fine-tuning from LR pre-

trained models without KV compression and this reduce ∼34% of the training and inference time.

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-

Image Generation 17 Mar 2024

https://arxiv.org/pdf/2403.04692

Oğuzhan Ercan

x.com/oguzhannercan

WÜRSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE

TEXT-TO-IMAGE DIFFUSION MODELS 29 SEP 2023

Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for

large-scale text-to-image diffusion models. A key contribution of their work is to develop a latent diffusion technique in which they learn a detailed

but extremely compact semantic image representation used to guide the diffusion process.

They first trained VQGAN (Stage A) then Latent image decoder (Stage B) and then text conditional latent image

generation model. For image generation, they first generate a latent image at a strong compression ratio using a

text-conditional LDM (Stage C). Subsequently, this representation is transformed to a less-compressed latent

space by the means of a secondary model which is tasked for this reconstruction (Stage B). Finally, the tokens

that comprise the latent image in this intermediate resolution are decoded to yield the output image (Stage A).

They initialized the Semantic Compressor with weights pre-trained on ImageNet, which, however, does

not capture the broad distribution of images present in large text-image datasets and is not well-suited

for semantic image projection, since it was trained with an objective to discriminate the ImageNet

categories. So they updated the weights of the Semantic Compressor during training, establishing a

latent space with high-precision semantic information. During training Stage B, thet intermittently add noise to

the Semantic Compressor’s embeddings, to teach the model to understand non-perfect embeddings, which is likely

to be the case when generating these embeddings with Stage C. At stage C training, they follow a standard

diffusion process, applied in the latent space of the finetuned Semantic Compressor.

https://arxiv.org/pdf/2405.16759

Oğuzhan Ercan

x.com/oguzhannercan

They introduce a novel architecture, Shallow-UViT, which allows one to pretrain the pixel space diffusion models core layers on huge datasets of

text-image data eliminating the need to train at the entire model with high resolution images. one can significantly improve different image quality

metrics by leveraging the representation pretrained at low- resolution, while growing model resolution in a greedy fashion. They sim-

plify the UNet’s conventional hierarchical structure, which operates on multiple resolutions, and define the Shallow-UViT (SU), a simplified

architecture comprising a shallow encoder and decoder operating on a fixed spatial grid.

Bad paper. Do not read it.

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

28 May 2024

https://arxiv.org/pdf/2405.16759

Oğuzhan Ercan

x.com/oguzhannercan

They identify two crucial requirements for text encoders: character awareness and alignment with glyphs, solution involves crafting a series of customized text encoder:

Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. They created high-quality text-glyph data by

establishing a scalable pipeline capable of generating virtually unlimited paired data based on graphic rendering. They employed an innovative box-level contrastive loss

and fine-tune ByT5 into a series of customized text encoder for glyph generation, named Glyph-ByT5, then integrated it to SDXL using an efficient regionwise cross-

attention mechanism.Beside words, paragraph is a challenging task since it is not fit into single line. So they define a ‘paragraph’ as a block of text content that cannot be

accommodated within a single line, typically consisting of more than 10 words or 100 characters. They empirically demonstrate that the diffusion model can effectively

plan multi-line arrangements and adjust the line or word spacing according to the given text box, regardless of its size or aspect ratios. Unlike conventional CLIP,

which applies contrastive loss to the entire image, they pro pose applying a box-level contrastive loss that treats each text box and its corresponding text prompt as an

instance. Based on the number of characters or words within the text box, they can categorize them into either a word text box, a sentence text box, or a paragraph text

box.

They designed box level contrastive loss and they compute the box embedding and sub-text embedding of image-text pair, embeddings comes from text encoder and

visual encoder. They introduce a region-wise multi-head cross-attention mechanism to seamlessly fuse the glyph knowledge encoded in our customized text encoder

within the target typography boxes and the prior knowledge carried by the original text encoders in the regions outside of typography boxes. At region-wise multi-head

cross-attention mechanism they first partition the input pixel embeddings (Query) into multiple groups. These groups correspond to the target text boxes, which can be

either specified by the user or auto-matically predicted by leveraging the planning capability of GPT-4. Simultaneously, they divide the text prompts (Key-Value) into

corresponding sub-sections, which include a global prompt and several groups of glyph-specific prompts.They specifically direct the pixel embeddings within the target

text boxes to attend only to the glyph text embeddings extracted with Glyph-ByT5. Similarly, pixel embeddings outside the text boxes are made to attend exclusively to

the global prompt embeddings extracted with the original two CLIP text encoders.

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text

Rendering 14 Mar 2024

https://arxiv.org/pdf/2403.09622

Oğuzhan Ercan

x.com/oguzhannercan

To address the dense prompt understanding limitations of CLIP-based diffusion models, they propose a novel approach named ELLA, which incorporates powerful LLM

in a lightweight and efficient manner, proposed method is Timestep-Aware Semantic Connector (TSC), on text-image pair data rich in information density. Diffusion

models, at first stages of denoising process, predict low frequency features, then predict high frequency features, so they anticipate the TSC to do same way, for this

reason it is timestamp-aware. The architectural design of our TSC is based on the resampler and it instills temporal dependency by integrating the timestep in the

Adaptive Layer Normalization. TSC can integrate community models and downstream tools such as LoRA and ControlNet.

imestep-Aware Semantic Connector (TSC) receives the text feature with arbitrary length as well as the timestep embedding, and outputs fixed-length semantic

queries. These semantic queries are used to condition noisy latent prediction of the pre-trained U-Net through cross-attention. To improve the compatibility and

minimize the training parameters, they leave both the text encoder of Large Language Models as well as the U-Net and VAE components frozen. The only trainable

component is consequently our lightweight TSC module.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

8 Mar 2024

https://arxiv.org/pdf/2406.09162

Oğuzhan Ercan

x.com/oguzhannercan

realistic data distributions are typically high-dimensional, complex and often multimodal. Directly encoding such data into a single unimodal Gaussian distribution and

learning a corresponding reverse noise-to-data mapping is challenging. The mapping, or generative ODE, necessarily needs to be highly complex, with strong curvature,

and one may consider it unnatural to map an entire data distribution to a single Gaussian distribution. In practice, conditioning information, such as class labels or text

prompts, often helps to simplify the complex mapping by offering the DM’s denoiser additional cues for more accurate denoising. However, such conditioning information

is typically of a semantic nature and, even given a class or text prompt, the mapping remains highly complex. They propose Discrete-Continuous Latent Variable

Diffusion Models (DisCo-Diff), DMs augmented with additional discrete latent variables that encode additional high-level information about the data and can be used by

the main DM to simplify its denoising task. These discrete latents are inferred through an encoder network and learnt end-to-end together with the DM. Thereby, the

discrete latents directly learn to encode information that is beneficial for reducing the DM’s score matching objective and making the DM’s hard task of mapping simple

noise to complex data easier.

DisCo-Diff’s training process is divided into two stages. In the first stage, the denoiser Dθ and the encoder Eϕ are co-optimized in an end-to-end fashion. This is achieved

by extending the denoising score matching objective to include learnable discrete latents z associated with each data y. The denoiser network Dθ can better cap-

ture the time-dependent score (i.e., achieving a reduced loss) if the score for each sub-distribution p(x|z; σ(t)) is simpli-fied. Therefore, the encoder Eϕ, which has

access to clean input data, is encouraged to encode useful information into the discrete latents and help the denoiser to more accurately reconstruct the data. Naively

backpropagating gradients into the encoder through the sampling of the discrete latent variables z is not possible. Hence, during training they rely on a continuous

relaxation based on the Gumbel-Softmax distribution. In the second stage, they train the autoregressive model Aψ to capture the distribution of the discrete latent

variables pϕ(z) defined by pushing the clean data through the trained encoder.

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

3 Jul 2024

https://arxiv.org/pdf/2407.03300

Oğuzhan Ercan

x.com/oguzhannercan

OneDiffusion, a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional

generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as

depth estimation and segmentation.

Read rest of it……

One Diffusion to Generate Them All 25 Nov 2024

https://arxiv.org/pdf/2411.16318

Oğuzhan Ercan

x.com/oguzhannercan

They introduce a convolution-like local at- tention strategy termed CLEAR, which limits feature inter- actions to a local window around each query token, and thus

achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self- generated samples for 10K iterations, they can

effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. They find that while

formulation variation strategies have proven effective in attention-based UNets [38] and DiTs trained from scratch [62], they do not yield similar success with pre-trained

DiTs. Key-value compression often leads to distorted details, and key-value sampling highlights the necessity of local tokens for each query to generate visually coherent

results. They say that four elements crucial for for linearizing pre-trained DiTs, including locality, formulation consistency, high-rank attention maps, and feature integrity.

Their proposal is each query interacts only with tokens within a predefined distance r. Since the number of key-value tokens interacting with each query is fixed, the

resulting DiT achieves linear complexity with respect to image resolution.

LinFusion has shown that linear attention approaches like linear attention achiev epromising results in attention-based UNets. However, they find that it is not the case

for pre-trained DiTs. They speculate that it is due to attention layers being the only modules for token interactions in DiTs, unlike the case in U-Nets. Substituting all of

them would have a substantial impact on the final outputs. Other formulations likeSigmoid Attention fails to converge within a limited number of iterations. High-Rank

Attention Maps means that attention maps calculated by efficient attention alternatives should be suf- ficient to capture the intricate token-wise relationships. Extensive

attention scores are concentrated along the diagonal, indicating that the attention maps do not exhibit the low-rank property assumed by many prior works. That is why

methods like linear attention and Swin Transformer largely produce blocky patterns. Feature Integrity implies that raw query, key, and value features are more favorable

than the compressed ones. Al- though PixArt-Sigma has demonstrated that applying KV compression on deep layers would not hurt the performance much, this approach

is not suitable for completely lineariz- ing pre-trained DiTs. Methods based on KV compression, such as PixArt-Sigma and Agent At- tention, tend to produce distorted

textures compared to the results from Swin Transformer and Neighborhood Attention, which highlights the necessity to preserve the integrity of the raw query, key, and

value tokens.

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers

Up 20 Dec 2024

https://arxiv.org/pdf/2412.16112

CLEAR adopts circular windows, where key-value tokens within a Euclidean distance less than a

predefined radius r are considered for each query. Comparing with corresponding square windows,

the computation overhead introduced by this design is ∼π 4 times. Although each query only has

access to tokens within a local window, stacking multiple Transformer blocks en- ables each token to

gradually capture holistic information—similar to the way convolutional neural networks operate. To

promote functional consistency between models before and after fine-tuning, they employ a

knowledge distillation objective during the fine-tuning process.Since attention is confined to a local

window around each query, CLEAR offers greater efficiency for multi-GPU patch-wise parallel

inference compared to the full attention in the original DiTs, which is particularly valuable for

generating ultra-high-resolution images. Specifically, each GPU is responsible for processing an image

patch, and the GPU communication is only required in the boundary areas.

Oğuzhan Ercan

x.com/oguzhannercan

The intrinsic distinction between AR and diffusion models lies in their approach to data distribution factorization. AR models treat data as an ordered sequence,

factorizing it along the sequential axis, where the probability of each token is conditioned on all preceding tokens. This factorization enables the AR paradigm to

generalize effectively and efficiently across arbitrary number of tokens, making it well-suited for long-sequence reasoning and in-context generation. In contrast,

diffusion models factorize data along the noise-level axis, where the tokens at each step are a refined (denoised) version of themselves from the previous step. As a

result, the diffusion paradigm is generalizable to arbitrary number of data refinement steps, enabling iterative quality improvement with scaled inference compute.

CausalFusion is designed to predict any number of tokens at any AR step, with any predefined sequence order and any level of inference compute, thereby minimizing

the inductive biases presented in existing generative models. As shown in Figure 1, this approach provides a broad spectrum between the AR and diffusion paradigms,

allowing smooth interpolation within two endpoints during both training and inference. Starting from the DiT architecture, they gradually convert it into a decoder- only

transformer compatible with existing AR models like GPT and LLaMA. Given a sample of training images X, AR models split X along the spatial dimensions into a

sequence of tokens, X = {x1, . . . , xL}, where L is the number of tokens. Diffusion models gradually add random noise (typically Gaussian) to X in a so-called forward

pro- cess. It is a Markov chain along the noise level, where each noisy version xt is conditioned on the previous state.

Causal Diffusion Transformers for Generative Modeling 17 Dec 2024

https://arxiv.org/pdf/2412.12095

Oğuzhan Ercan

x.com/oguzhannercan

OneDiffusion, a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional

generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as

depth estimation and segmentation.

Read rest of it……

One Diffusion to Generate Them All 25 Nov 2024

https://arxiv.org/pdf/2411.16318

Oğuzhan Ercan

x.com/oguzhannercan

Guidance Techniques

Following slides includes:

-SWAPANYTHING Enabling Arbitrary Object Swapping in Personalized Visual Editing 6 May 2024

Oğuzhan Ercan

x.com/oguzhannercan

No Training, No Problem: Rethinking Classifier-Free

Guidance for Diffusion Models 2 Jul 2024

İndependent condition guidance (ICG) and time step guidance (TSG). The main idea behind ICG is that by using a conditioning vector independent of the input data,

the conditional score function becomes equivalent to the unconditional score. Time-step guidance aims to improve the accuracy of denoising at each sampling step by

leveraging the time-step information learned by the diffusion model to steer sampling trajectories toward better noise-removal paths.

https://arxiv.org/pdf/2407.02687

Oğuzhan Ercan

x.com/oguzhannercan

Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation

https://proceedings.mlr.press/v202/song23k/song23k.pdf

Oğuzhan Ercan

x.com/oguzhannercan

TraDiffusion: Trajectory-Based Training-Free Image Generation

19 Agust 2024

This method allows users to effortlessly guide image generation via mouse trajectories. To achieve precise control, they design a distance awareness energy function

to effectively guide latent variables, ensuring that the focus of generation is within the areas defined by the trajectory. The energy function encompasses a control

function to draw the generation closer to the specified trajectory and a movement function to diminish activity in areas distant from the trajectory. Due to the sparsity

of the trajectories, it is difficult to directly combine backward guidance. A natural idea is to get the prior structure of an object through the attention maps of cross

attention layers, rather than directly using the trajectories to achieve backward guidance.

They propose to use a distance awareness energy function for guidance, first apply a control function to guide the object to approach a given trajectory, which is

formulated as:

where Dμi is a distance matrix computed by the OpenCV (Bradski 2000) function “distanceTransform”,in which each value denotes the distance from each location

μ of the attention map to the given trajectory. However, this does not effectively inhibit the attention response of the object in irrelevant regions far from the trajectory.

So, they add a movement function to suppress the attention response from irrelevant regions far from the trajectory of the object accordingly. The movement function

is formulated as:

https://arxiv.org/pdf/2407.02687

Oğuzhan Ercan

x.com/oguzhannercan

Output Control Techniques

Following slides includes:

-Adding Conditional Control to Text-to-Image Diffusion Models 26 Nov 2023

-ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback 11 Apr 2024

- CTRLororALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models 13 May 2024

- And more

Oğuzhan Ercan

x.com/oguzhannercan

Controlnet is a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion

models. Zero-Initialized Layers are used by ControlNet for connecting network blocks. The reason for initializing weights as

zero instead of gaussian is that progressively grow the parameters from zero and ensure that no harmful noise could affect

the finetuning. To add a ControlNet to such a pre-trained neural block, they lock (freeze) the parameters Θ of the original

block and simultaneously clone the block to a trainable copy with parameters. The trainable copy takes an external

conditioning vector c as input. The trainable copy is connected to the locked model with zero 1x1 convolution layers

In the training process, they randomly replace 50% text prompts with empty strings. This approach increases ControlNet’s

ability to directly recognize semantics in the input conditioning images (e.g., edges, poses, depth, etc.) as a replacement for

the prompt. They see that the model converges suddenly, not progressively.

When a conditioning image is added via ControlNet, it can be added to both ϵuc and ϵc, or only to the ϵc. Their solution is

to first add the conditioning image to ϵc and then multiply a weight wi to each connection between Stable Diffusion and

ControlNet according to the resolution of each block.

To apply multiple conditioning images (e.g., Canny edges, and pose) to a single instance of Stable Diffusion, they can

directly add the outputs of the corresponding ControlNets to the Stable Diffusion model.

Adding Conditional Control to Text-to-Image Diffusion Models

26 Nov 2023

https://arxiv.org/pdf/2302.05543

Oğuzhan Ercan

x.com/oguzhannercan

For an input conditional control, they use a pre-trained discriminative reward model to extract the corresponding condition of the generated images,

and then optimize the consistency loss between the input conditional control and extracted condition. They introduce an efficient reward strategy

that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning.

model performs T denoising steps to generate the image x′0 from random noise xT an abstract metric function that can take on different concrete

forms for different visual conditions. L takes the condition and output of supervisor model which takes the output of diffusion model. For example, in

the context of using segmentation mask as the input conditional control, L could be the per-pixel cross-entropy loss.

Since to achieve the pixel-space consistency loss it requires x0, the final diffused image which requires 20-50 steps, which requires too much

computation so they propose one step efficient reward strategy. At eq below, they show that algebraic manipulation can be used for single step

noise prediction.

The timestep threshold, which is a hyper-parameter used to determine whether a noised image xt should be utilized for reward fine-tuning. They

note that a small noise ϵ (i.e., a relatively small timestep t) can disturb the consistency and lead to effective reward fine-tuning. During the reward

fine-tuning phases, they freeze the pre-trained discriminative reward model and text-to-image model, and only update the ControlNet module

following its original implementation, which ensures its generative capabilities are not compromised.

ControlNet++: Improving Conditional Controls

with Efficient Consistency Feedback 11 Apr 2024

https://arxiv.org/pdf/2404.07987

Oğuzhan Ercan

x.com/oguzhannercan

CTRLororALTer: Conditional LoRAdapter for Efficient

0-Shot Control & Altering of T2I Models 13 May 2024

Ctrlororalter, formulating such a unified approach for conditioning on global controls like style and on local controls like structure, in an efficient and

generic manner remains a key open problem. LoRAdapter is a novel approach to adding conditional information to LoRAs, enabling zero-shot

generalization and making them applicable for both structure and style and possibly many other conditioning types. They propose a LoRA-based

conditioning mechanism whose behavior changes based on conditioning provided at inference time, enabling zero-shot generalization.

They add a condition to matrix A: from to

where conditioning is as ⊙denoting an elementwise (Hadamard) product and γ and β referring to

the scale and shift factors.

https://arxiv.org/pdf/2405.07913

Oğuzhan Ercan

x.com/oguzhannercan

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion

Model with Any Condition 12 Dec 2023

FreeControl, a novel method for training-free controllable T2I generation via modeling the linear subspace of intermediate diffusion features and employing guidance

in this subspace during the generation process. Given a text prompt c and a guidance image Ig of any modality, FreeControl directs a pre-trained T2I diffusion model

ϵθ to comply with c while also respecting the semantic structure provided by Ig throughout the sampling process of an output image I. key finding is that the leading

principal components of self-attention block features inside a pre-trained diffusion model provide a strong and surprisingly consistent representation of semantic

structure across a broad spectrum of image modalities.

FreeControl is a two-stage pipeline and t begins with an analysis stage, where diffusion features of seed images undergo principal component analysis (PCA), with the

leading PCs forming the time-dependent bases Bt as our semantic structure representation. Ig subsequently undergoes DDIM inversion with its diffusion features

projected onto Bt, yielding their semantic coordinates Sgt . In the synthesis stage, structure guidance encourages I to develop the same semantic structure as Ig by

attracting St to Sgt . In the meantime, appearance guidance promotes appearance similarity between I and ¯I by penalizing the difference in their feature statistics.

Their key observation is that the leading PCs form a semantic basis; It exhibits a strong correlation with object pose, shape, and scene composition across diverse

image modalities.

https://arxiv.org/pdf/2406.01300

Oğuzhan Ercan

x.com/oguzhannercan

ControlNeXt: Powerful and Efficient Control for Image and

Video Generation 15 Agust 2024

They design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a

concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for

training, They reduce up to 90% of learnable parameters compared to the alternatives. They propose another method called Cross Normalization (CN) as a replacement

for “zero-convolution” to achieve fast and stable training convergence. They say that zero convolution also increases training challenges, slows convergence, and results

in the “sudden convergence phenomenon”. As for the training, they fine-tune the base model by freezing most of its modules and selectively training a much smaller

subset of the pretrained parameters. In this paper, they propose that the key reason for training collapse is the new initialized parameters sharing a different data

distribution in terms of mean and standard deviation compared to the pre-trained parameters, and introduce cross normalization to align data distributions.

They say that the controls typically have a simple form or maintain a high level of consistency with the denoising features, eliminating the need to insert controls at

multiple stages. They integrate the controls into the denoising branch at a single selected middle block by directly adding them to the denoising features after

normalization through Cross Normalization.

Cross Normalization: They calculate the mean and variance of main branch, which is pretrained diffusion model. Then normalizes the control branches features. Cross

Normalization aligns the distributions of the denoising and control features, serving as a bridge to connect the diffusion and control branches. It accelerates the training

process, ensures the effectiveness of the control on generation even at the beginning of training, and reduces sensitivity to the initialization of network weights.

https://arxiv.org/pdf/2408.06070

Oğuzhan Ercan

x.com/oguzhannercan

RB-Modulation: Training-Free Personalization of

Diffusion Models using Stochastic Optimal Control 27 May 2024

RB-Modulation is built on a stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost and it eliminates the need

for training or finetuning diffusion models. The model comes with with a new Attention Feature Aggregation (AFA) module to maintain high fidelity to the reference

image while adhering to the given prompt.

https://arxiv.org/pdf/2405.17401

Oğuzhan Ercan

x.com/oguzhannercan

They propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of

multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20 ∼100 samples) instead of full- parameter tuning with large dataset.

https://arxiv.org/pdf/2410.23775

IN-CONTEXT LORA FOR DIFFUSION TRANSFORMERS 31 Oct 2024

Oğuzhan Ercan

x.com/oguzhannercan

They introduce an over-parameterized approach that accelerates training without increasing inference costs. This method reparameterized low-rank adaptation by

employing a separate MLP and learned embedding for each layer. The learned embedding is input to the MLP, which generates the adapter parameters. Such

overparameterization has been shown to implicitly function as an adaptive learning rate and momentum, accelerating optimization. At inference time, theMLP can be

discarded, leaving behind a standard low-rank adapter.

They avoid directly optimizing A and B by introducing a two-layer MLP which takes as input z and predicts the entries of A and B (Low Rank Matrices). More formally,

A,B = W2(ReLU(W1z + b1)) + b2 where z is the learned input vector to the MLP, W and b correspond to learned weights biases, and A ∈Rr×d and B ∈Rd×r are

the generated low-rank matrices. Once fine-tuning is complete, the MLP can be discarded, retaining only the low-rank matrices A and B for inference. Although the

MLP is compact in depth, it predicts a high-dimensional output —the size of the LoRA parameters —which makes it overparameterized. For instance, an MLP with a

hidden dimension of 32 scales the number of trainable parameters by approximately 32. This makes OP-LoRA particularly advantageous in settings where inference

resources are constrained, but sufficient memory is available during training.

https://arxiv.org/pdf/2412.10362v1

OP-LoRA: The Blessing of Dimensionality 13 Dec 2024

Oğuzhan Ercan

x.com/oguzhannercan

OminiControl, a parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. OminiControl leverages a parameter

reuse mechanism, enabling the DiT to encode image conditions using itself (VAE of it) as a powerful backbone and process them with its flexible multi-modal attention

processors. Omini-Control effectively and efficiently incorporates injected image conditions with only 0.1% additional parameters, and can be used with control cases like

subject-driven generation and spatially-aligned conditions such as edges, depth, and more. These capabilities are achieved by training on images generated by the DiT

itself.

Following the same token processing pipeline as noisy image tokens, they augment the encoded features with learnable position embeddings. After that, these tokens

added the sequence along noisy image tokens and text tokens, processes condition image tokens uniformly with text and noisy image tokens, integrating them into a

unified sequence. Z = [X,C_text,C_image].where Z represents the concatenated sequence of noisy im-

age tokens X, text tokens CT, and condition image tokens CI. This unified approach enables direct participation in multi-modal attention without specialized processing

pathways. This sequence design allows for flexible integration of condition image tokens, but this requires incorporating positional information to ensure effective

interaction with noisy image tokens.

In FLUX.1’s Transformers, each token is assigned a corresponding position index to encode spatial information. For a 512×512 target image, the VAE encoder first projects

it into the latent space, then the latent representation is divided into a 32×32 grid of tokens, where each token is assigned a unique two-dimensional position index (i, j)

with i, j ∈[0, 31]. This indexing scheme preserves the spatial structure of the original image in the latent space, while text tokens maintain a fixed position index of (0, 0).

For spatially aligned tasks, our initial approach was to assign condition tokens the same position embeddings as their corresponding tokens in the noisy image. However,

for non-spatially aligned tasks such as subject-driven generation, our experiments revealed that shifting the position indices of condition tokens leads to faster

convergence. Specifically, they shift the condition image tokens to indices (i, j) where i ∈[0, 31] and j ∈[32, 64], ensuring no spatial overlap with the original image

tokens X.

To achieve controllability, they introduce a bias term into the original MM-Attention operation. Biasγ is designed to adjust the attention weights between condition tokens

and other tokens based on the strength factor γ. The bias term is constructed as a (M +2N ) × (M + 2N ) matrix, where M is the number of text tokens, and N is the

number of noisy image tokens and condition image tokens each. The matrix has the following structure:

OminiControl: Minimal and Universal Control for Diffusion Transformer 25

Nov 2024

https://arxiv.org/pdf/2405.17401

Oğuzhan Ercan

x.com/oguzhannercan

Steering Rectified Flow Models in the Vector Field

for Controlled Image Generation 27 Nov 2024

FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation

tasks, facilitated by gradient skipping. They discover that in nonlinear ODEs with stochasticity or trajectory

crossovers, error terms emerge that hinder convergence due to inaccuracies in estimating denoised samples or

improper gradient approximations. They say that RFMs can achieve higher convergence rates without

additional computational overhead by capitalizing on this key property. They say that inversion is unnecessary,

even for RF-Inversion, making RF-Inversion a special case of FlowChef, where starting noise originates from

an inverted target image rather than random noise, as in FlowChef.

Rectified flow models inherently allow error dynamics to converge even with gradient approximations due to

their straight-line trajectories and smooth vector fields, as discussed previously. Hence, vector field uθ (xt, t) is

trained to be smooth, and this smoothness implies that uθ changes gradually w.r.t. xt. A key feature of

FlowChef

is that it starts from any random noise xT ∼N (0, I) and still converges to the desired distribution or sample

without inversion. At each timestep t, they first estimate the ˆx0. Then calculate the loss L(ˆx0, xref0 ). At

last, they directly optimize xt using the gradient ∇ˆx0 L.

https://arxiv.org/pdf/2412.00100

Let p1 ∼N (0, I) and p0 be

distributions, with x1 ∼p1 as initial

noise, x0_ref as the target sample,

x0 as the denoised sample from

x1, and x1_ref as the specific noise

leading to x0_ref .

Oğuzhan Ercan

x.com/oguzhannercan

They introduce a single parameter ω, to effectively control granularity in diffusion-based synthesis. The approach does not require model retraining, architectural

modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial

masks or denoising schedules with varying ω values can be applied to achieve region-specific or timestep-specific granularity control.

A general form of a single denoising step with Omegance is formulated as follows:

Omegance retains the standard denoising schedule, leaving the amount of noise removed from zt unchanged. The SNR schedule aligns with the forward process. This

setting produces a balanced output with standard levels of detail and texture across the entire image, aligning with the expected granularity of the original noise

schedule. • When ω < 1, SNR(t − 1)′ < SNR(t − 1). The noise prediction is scaled down, leading to a less aggressive de- noising towards z0. Therefore, the latent state

z′ t−1 retains additional high-frequency information. With the noise component dominating, the model “justifies” this residual noise by generating more intricate

structures and richer textures, enhancing visual complexity in the output. When ω > 1, SNR(t − 1)′ > SNR(t − 1). The denoising schedule becomes more aggressive.

This amplified noise reduction diminishes high-frequency information in the latent z′ t−1. With the signal now dominating, the model interprets the reduced residual noise

as a cue to simplify textures and details, yielding smoother and less intricate visual outputs.

The omega mask ωi,j = M(i, j) introduces a spatially varying control over the granularity within a single image by allowing different regions to have distinct ω values dur-

ing the denoising process. This spatial control leverages the locality of the denoising process, ensuring that adjustments to ω in one region do not affect the SNR or visual

properties of neighboring areas. Such flexibility is valuable for applications requiring region-specific detail control within a single image, enabling fine-grained textures in

focal regions while maintaining smoothness elsewhere.

The omega schedule ωt = S(t) provides a mechanism for controlling granularity across different stages of the denoising process by dynamically adjusting ω values over

time. By introducing ω at specific stages in the reverse diffusion process, the omega schedule allows targeted influence on both the broad layout and fine-grained details

within the generated image. This temporal control is aligned with the denoising dynamics: early denoising stages primarily re- construct the general structure and layout,

while later stages refine finer details.

Omegance: A Single Parameter for Various Granularities in Diffusion-

Based Synthesis 26 Nov 2024

https://arxiv.org/pdf/2411.17769

Oğuzhan Ercan

x.com/oguzhannercan

Training Time Optimization

Following slides includes:

- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024

-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024

- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024

- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18

March 2024

- Distilling Diffusion Models into Conditional GANs 9 May 2024

- Cross-Attention Makes Inference Cumbersome

- in Text-to-Image Diffusion Models 3 April 2024

- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024

Oğuzhan Ercan

x.com/oguzhannercan

Immiscible Diffusion: Accelerating Diffusion Training with Noise

Assignment 18 Jun 2024

Diffusion models mimic the reverse thermodynamic diffusion phenomenon [ 34] to ease the denoising process. However, when the sources are miscible they end up

messily mixed. Predicting the reversal process from such a random mixture encounters significant difficulties, and unfortunately, this is a problem diffusion model

always facing during denoising. They notice that the mixing can also be organized when the sources are immiscible. Under that circumstance, the sources would take

different continuous areas after the diffusion, while the whole diffuse-able area remains the same.

In order to achieve this, minimize the total distance of image-noise pairs in a batch during the assignment. After assignment, the noise is still Gaussian, while each

noise is assigned to nearer images like what happens in the immiscible phenomenon, which significantly eases the difficulties for the denoising. For implementation, all

needed to be done is to perform a linear assignment(hungarian match) between the batch of images and noises according to their distances.

https://arxiv.org/pdf/2406.12303

Oğuzhan Ercan

x.com/oguzhannercan

Stretching Each Dollar: Diffusion Training from Scratch on

a Micro-Budget 22 Jul 2024

As the computational cost of transformers increases with the number of patches in each image, they propose to randomly mask up to 75% of the image patches

during training. To mitigate the massive performance degradation with masking, they propose a deferred masking strategy where all patches are preprocessed by a

lightweight patch-mixer before being transferred to the diffusion transformer.

https://arxiv.org/pdf/2407.15811

Oğuzhan Ercan

x.com/oguzhannercan

Inference Time Optimization

Following slides includes:

- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024

-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024

- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024

- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18

March 2024

- Distilling Diffusion Models into Conditional GANs 9 May 2024

- Cross-Attention Makes Inference Cumbersome

- in Text-to-Image Diffusion Models 3 April 2024

- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024

- And more

Oğuzhan Ercan

x.com/oguzhannercan

SDXS: Real-Time One-Step Latent Diffusion Models

with Image Conditions 25 Mar 2024

They introduce a dual approach involving model miniaturization and a reduction in sampling steps. The methodology leverages knowledge distillation

to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature

matching and score distillation.

VAE Decoder Optimization: Utilizing a pretrained diffusion model F to sample latent codes z and a pretrained VAE decoder to reconstruct images x,

they introduce a VAE Distillation (VD) loss for training a tiny image decoder G. they only build the G with CNN blocks to eliminate complex

components like attention and normalization (I do not know why they think that norm layers are computationally overwhelming.)

U-Net Optimization: They selectively removed residual and Transformer blocks from the U-Net, aiming to train a more compact model that can still

reproduce the original model’s intermediate feature maps and outputs effectively. Initializing noises and sampling images with an ODE to get noise

image pairs results in low quality images. So they use Rectified Flow tackles this challenge by straightening the sampling trajectories. Using MSE

loss causes the model tends to output the average of multiple feasible solutions. So they use SSIM. They also straighten the model’s trajectories to

narrow the range of feasible outputs using existing finetuning methods like LCM.

One Step Training: They first trained model for feature matching with SSIM Loss as warm-up. While at this stage, they sample noise - image pairs.

As trajectory of ODE’s (For example DDPM) are not straight, they used LCM-Lora for rectifying the flow. They say that warm-up training results are

good at image quality but do not capture the data distribution . For this reason they use score distillation sampling with learned manifold corrective.

https://arxiv.org/pdf/2403.16627

Oğuzhan Ercan

x.com/oguzhannercan

To achieve better quality at low step size, they propose to distill along the student’s backward path instead of the forward path. Put differently,

rather than having the student mimic the teacher, They use the teacher to improve the student based on its current state of knowledge. They

propose a Shifted Reconstruction Loss that dynamically adapts the knowledge transfer from the teacher model. Specifically, the loss is designed to

distill global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency

components at lower time steps. They propose noise correction, a training free inference time modification that enhances sample quali0ty.

Commonly chosen s.t. xT is not pure noise during training, but rather contains low-frequency information leaked from x0. xt = αtx0 + σtxT, here

the leakage reason, any stochastic interpolant xt, t < T still contains information from the ground-truth sample via the first summand αtx0.

backward distillation eliminates information leakage at all time steps t, preventing the model from relying on a ground-truth signal. This is achieved

by simulating the inference process during training, which can also be interpreted as calibrating the student on its own upstream backward path.

They first perform backward iterations of the student model to obtain , then use this as input for

both the student and teacher models during training.

For distillation loss, they define shifted reconstruction loss which is designed such that for higher values of t, the target produced by the teacher

model displays global content similarity with the student output but with improved semantic text alignment; and for lower values of t, the target

image features enhanced fine-grained details while maintaining the same overall structure as the student’s prediction.

When t=T, which is pure noise, at that time step predicting the noise is not informative. So existing works propose predicting the velocity which is

the rate of change. Unfortunately, converting a model to velocity prediction requires extra training efforts. They present training free method, by

treating t = T as a unique case and replacing ϵΘ with the true noise xT , the update f is corrected.

Shifted Reconstruction Loss

Imagine Flash: Accelerating Emu Diffusion Models with Backward

Distillation 18 Apr 2024

https://arxiv.org/pdf/2405.05224

Oğuzhan Ercan

x.com/oguzhannercan

PeRFlow: Piecewise Rectified Flow as Universal Plug-and-

Play Accelerator 13 May 2024

PeRFlow divides the sampling process of generative flows into several time windows

and straightens the trajectories in each interval via the reflow operation, thereby

approaching piecewise linear flows. Specifically, they attempt to straighten the

trajectories of the original PF-ODEs via a piecewise reflow operation. By solving the

ODEs in the shortened time interval, PeRFlow avoids simulating the entire ODE

trajectory for preparing the training data.Through such a divide-and-conquer strategy,

PeRFlow can straighten the sampling trajectories with large-scale real training data.

Diffusion models are usually trained with ϵ-prediction, but flow-based generative

models generate data by following the velocity field.They derive the correspondence

between ϵ-prediction and the velocity field of flow, thus narrowing the gap between the

pretrained diffusion models and the student PeRFlow model.

They divide the ODE trajectories into multiple time windows and straighten the

trajectories in each time window via the reflow operation. The pretrained diffusion

models are usually trained by two parameterization tricks, namely ϵ-prediction and

velocity-prediction. To inherit knowledge from the pretrained network, they

parameterize the PeRFlow model as the same type of diffusion and initialize network θ

from the pretrained diffusion model ϕ.

https://arxiv.org/pdf/2405.07510

Oğuzhan Ercan

x.com/oguzhannercan

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion

Distillation 18 March 2024

Authors says that Adversarial Diffusion Distillation was a big move but usage of the fixed and pretrained DINOv2 network restricts the discriminator’s

training resolution to 518 × 518 pixels. Also there is no straightforward way to control the feedback level of the discriminator. Plus as Yann

lecun said that “need to decode to RGB space” is a problem. They say smaller discriminator

feature networks often offer better performance than their larger counterparts.

They distill generative features of a pretrained diffusion model instead of DINOv2. By targeted sampling of the noise levels during training, it

can be bias the discriminator features towards more global (high noise level) or local (low noise level) behavior.

many distillation techniques attempt to learn “simpler” differential equations that result in the same distribution at t=0 however with

“straighter”, more linear, trajectories which allows for larger step sizes and therefore less evaluations of the network.

LADD introduces two main modifications: the unification of discriminator and teacher model, and the adoption of synthetic data for training.

They first generate an image with teacher model. Then they add noise to the generated image, after that they denoise the image with both

teacher and student networks. They calculate the loss over these latent space representations.

They also fed the students output to the teacher model. After each layer of teacher model which is a transformer, they add a discriminator

head and calculate the adversarial loss with these heads.

in one-shot scenarios, CFG simply oversaturates samples rather than improving text-alignment. This observation suggests that CFG works

best in settings with multiple steps, allowing for corrections of oversaturation issues ins most case. Also they see that while distillation loss

benefits training with real data, it offers no advantage for synthetic data. Thus, training on synthetic data can be effectively conducted using

only an adversarial loss.

https://arxiv.org/pdf/2403.12015

Oğuzhan Ercan

x.com/oguzhannercan

Cross-Attention Makes Inference Cumbersome

in Text-to-Image Diffusion Models 3 April 2024

https://arxiv.org/pdf/2404.02747

Oğuzhan Ercan

x.com/oguzhannercan

Improved Distribution Matching Distillation for Fast Image Synthesis

23 May 2024

https://arxiv.org/pdf/2405.14867

Oğuzhan Ercan

x.com/oguzhannercan

EM Distillation for One-step Diffusion Models 27 May 2024

https://arxiv.org/pdf/2405.16852

Oğuzhan Ercan

x.com/oguzhannercan

Boosting Latent Diffusion with Flow Matching 28 Mar 2024

https://arxiv.org/pdf/2312.07360

Oğuzhan Ercan

x.com/oguzhannercan

SVDQUANT: ABSORBING OUTLIERS BY LOW-RANK

COMPONENTS FOR 4-BIT DIFFUSION MODELS 8 Nov 2024

https://arxiv.org/pdf/2411.05007

They aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and

activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become

insufficient. Different from smoothing which redistributes outliers between weights and activations, their approach absorbs these outliers

using a low-rank branch. Theyfirst consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-

rank branch to take in the weight outliers with Singular Value Decomposition (SVD).

Core insight is to introduce a 16-bit low-rank branch and further migrate the weight quantization difficulty to this branch. Compared to

direct 4-bit quantization, i.e., Q( ˆX)Q(W ), our method first computes the low-rank branch ˆXL1L2 in 16-bit precision, and then

approximates the residual ˆXR with 4-bit quantization. This LoRa is negligible in context of additional params-computation, but they

proposed Nunchaku kernel fusion for eliminate this situation, which helps to save %50 extra computational cost mostly comes from memory

access.

Oğuzhan Ercan

x.com/oguzhannercan

1.58-bit FLUX 24 Dec 2024

https://arxiv.org/pdf/2412.18653v1

The first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e.,

values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 × 1024 images. Notably, the quantization method

operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, they develop a custom

kernel optimized for 1.58- bit operations, achieving a 7.7× reduction in model storage, a 5.1× reduction in inference memory, and improved

inference latency.

The quantization reduces the weights of all linear layers in the FluxTransformerBlock and FluxSingleTrans- formerBlock of FLUX to 1.58 bits,

covering 99.5% of the model’s total parameters.

Oğuzhan Ercan

x.com/oguzhannercan

Quality Enhancement

Following slides includes:

-Align Your Steps, 22 april 2024

- And more

Oğuzhan Ercan

x.com/oguzhannercan

Align Your Steps, 22 april 2024

Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They

propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.

SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.

https://arxiv.org/pdf/2404.14507

Oğuzhan Ercan

x.com/oguzhannercan

Step-aware Preference Optimization: Aligning Preference with Denoising

Performance at Each Step 6 Jun 2024

https://arxiv.org/pdf/2406.04314

Oğuzhan Ercan

x.com/oguzhannercan

ReNO: Enhancing One-step Text-to-Image Models

through Reward-based Noise Optimization 6 Jun 2024

They explore optimizing the initial random noise during inference without adapting any of the model’s parameters. The initial noise is updated based

on the signal from a reward model evaluated on the generated image. Backpropagating the gradient through the denoising steps can lead to

exploding/vanishing gradients, so they used well calibrated one step diffusion model. Naively optimizing the initial latent for an arbitrary objective

can lead to collapse due to reward hacking. To mitigate this, they propose the use of a combination of reward objectives to not overfit to any single

reward. Also they propose an optimization scheme with limited steps, regularization of the noise to stay in-distribution, and gradient clipping.

Backpropagating through C(Gθ (ε, p)) (Criterion on some generation) is non-trivial as current Text-to-Image models are based on the simulation of

ODEs or SDEs.One important consideration, is that it is desirable for ε(noise) to stay within the proximity of the initial noise distribution N (0, I) as

otherwise Gθ might provide unwanted generations. This can be realized by including a regularization term inside of C. They maximize the log-

likelihood of the norm of a noise sample.

They propose to use a weighted combination of a number n of pre-trained reward models as the criterion function. Employing a combination of

reward models can help prevent “reward-hacking” and allow capturing various aspects of image quality and prompt adherence, as different reward

models are trained on different prompt and preference sets.

https://arxiv.org/pdf/2403.17377

Oğuzhan Ercan

x.com/oguzhannercan

Face Models

Following slides includes:

-InstantID: Zero-shot Identity-Preserving Generation in Seconds 2 Feb 2024

- And more

Oğuzhan Ercan

x.com/oguzhannercan

InstantID: Zero-shot Identity-Preserving Generation in Seconds

2 Feb 2024

They used a pre-trained face model to detect and extract face ID embedding from the reference facial image, providing us with strong identity

features to guide the image generation process.

Image Adapter: a lightweight adaptive module with decoupled cross-attention is introduced to support images as prompts. However, They diverge

by employing ID embedding as their image prompt, as opposed to the coarse-aligned CLIP embedding. This choice is aimed at achieving a more

nuanced and semantically rich prompt integration.

Directly adding the text and image tokens in cross-attention tends to weaken the control exerted by text tokens. So they adapt a controlnet, named

as IdentityNet. At this net, they use 5 landmarks instead of 68 (openpose) and instead of text embedding they use Arcface identities at cross

attention layer.

https://arxiv.org/pdf/2401.07519

Oğuzhan Ercan

x.com/oguzhannercan

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

23 May 2024

They propose a well-known methodology as classifier guidance which modifies an existing denoising process using the gradient from a pre-trained

classifier. The rationale behind their exploitation is twofold: first, it directly harnesses the discriminator’s domain knowledge for identity preservation,

which may be a cost-effective substitute for training on domain-specific datasets; secondly, keeping the diffusion model intact allows for plug-and-

play combination with different discriminators. This work builds on a recent framework named rectified flow featuring strong theoretical properties,

e.g. the straightness of its sampling trajectory. By approximating the rectified flow to be ideally straight, the original classifier guidance is

reformulated as a simple fixed-point problem concerning only the trajectory endpoints, thus naturally overcoming its reliance on a special noise-

aware classifier. This allows flexible reuse of image discriminators for identity preservation in personalization tasks.

Rectified flow recap: They aim to learn a velocity field v that maps random noise z0 ∼π0 to samples from a complex distribution z1 ∼πdata

via an ordinary differential equation (ODE) Instead of directly solving the ODE (Chen et al., 2018), rectified flow (Liu et al., 2023a) simply learns

a linear interpolation between the two distributions by minimizing the following objective:

Classifier Guidance recap: a test-time mechanism to adjust the predicted noise ϵ(zt, t) based on the guidance from a classifier. Given condition c

and classifier output p(c|zt), the adjustment is formulated as:

They combine the recrifed flow and classifier guidance at inference time. They control the flow with classifier guidance in order to achieve desired

output. Since the theoretical foundation of this paper is too heavy, the details will not be included in this slides.

https://arxiv.org/pdf/2405.14677

Oğuzhan Ercan

x.com/oguzhannercan

UniPortrait: A Unified Framework for Identity-Preserving Single- and

Multi-Human Image Personalization 12 Agust 2024

UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-

form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The

ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The

ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image.

Unlike most preceding approaches that harness the final global features of a face recognition backbone for face ID representation, they utilize features from the

penultimate layer (prior to the fully connected layer). This adjustment aims to preserve an enhanced degree of spatial information pertaining to ID features. They say

that using clip features may couple with other ID-irrelevant facial informations (light, occlusions etc.). Given the typically scarce and non-diverse nature of personalization

training data—in which the training reference and target faces often come from the same or similar images—these irrelevant features risk leading to model overfitting on

non-essential facial details. In order to solve these problems, they initially integrate the shallow features from the face recognition model to augment the structural

representation of the face. Subsequently, they apply a strong dropping regularization to the structure feature branch to decouple it from the intrinsic ID branch. The

shallow features of the backbone are empirically low-level, containing more texture details, and they are more ID-relevant, facilitating us to generate higher-fidelity

portraits.

https://arxiv.org/pdf/2408.05939

Oğuzhan Ercan

x.com/oguzhannercan

Theyfirst flatten and apply a Multilayer Perceptron (MLP) to the penultimate layer’s features of the face recognition model to obtain the intrinsic ID features Fr ∈Rmr ×dr

. They then interpolate the shallow features, i.e., the 1/2, 1/4, and 1/8 feature maps from the face backbone and concatenate them with CLIP local features to derive the

face structure features Fs ∈Rms×ds through another MLP. Next, they introduce a l layer Q-Former with m learnable queries to aggregate Fr and Fs. Each layer of the Q-

Former comprises two attention blocks and one Feed-Forward Network (FFN), with the attention blocks respectively attending to the intrinsic ID information and face

structure representations. In the input and output of the second attention block, they further introduce DropToken and DropPath as means of decoupling face structure

from intrinsic ID representation. The final output from the Q-Former, denoted as Fid ∈Rm×d, is then employed as the ID embedding and aligned into the context space

of U-Net. Here, they use decoupled cross attention to inject the ID information into U-Net.

In this work, they introduce a position-wise ID routing module integrated within each cross-attention layer to adaptively route and assign a unique ID to each potential

face area in the latent features, thereby effectively mitigating the problem of identity blending. UniPortrait can work with N identities, not limit with specific number of

input. The structure of position-wise ID routing can be seen at the figure (stage2), placed at previous page. The idea behind the ID routing is that each face in an im-

age is associated with at most one ID feature. By confining each position to cross-attend to solely one ID information, the blending problem between IDs is efficaciously

circumvented.

They use CurricularFace for face recognition network. They sample the dataset from many different open dataset.

UniPortrait: A Unified Framework for Identity-Preserving Single- and

Multi-Human Image Personalization 12 Agust 2024 (Page 2)

https://arxiv.org/pdf/2408.05939

Oğuzhan Ercan

x.com/oguzhannercan

Image Editing

Following slides includes:

-D-Flow: Differentiating through Flows for Controlled Generation

- And more

Oğuzhan Ercan

x.com/oguzhannercan

POSTEDIT: POSTERIOR SAMPLING FOR EFFICIENT

ZERO-SHOT IMAGE EDITING 8 Oct 2024

PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the

initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. To reconstruct and edit an image x0, they

make use of a measurement term y which contains the features of the initial image, and supervise the editing process by the posterior log-likelihood density ∇xt log

p(xt|y).

read rest of it.

https://arxiv.org/pdf/2410.04844

Oğuzhan Ercan

x.com/oguzhannercan

n this paper, they propose merging these two fields by utilizing image-to-video models for image editing. They reformulate image editing as a temporal process, using

pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring

consistent edits while preserving the original image’s key aspects. They implement the proposed approach through a structured pipeline called Frame2Frame (F2F). First,

they transform the edit instruction into a Temporal Editing Caption a scenario describing how the edit should naturally evolve over time—using a pretrained Vision-

Language Model (VLM). Next, a state-of-the-art image-to-video model generates a temporally coherent sequence guided by the temporal caption. Finally, they identify

the frame that best realizes the desired edit with the assistance of a VLM. The framework shows promising results in more classical computer vision problems such as

de-blurring, de-noising, and relighting.

Since video generation is a temporal process, the editing caption must be temporal. So they use ChatGPT4-o, it s instructed to produce a concise video scenario that

highlights how elements within the image change or move over time. For video generation, they used CogVideoX. They observed that the optimal number of frames

required for an edit can vary—small changes may be completed in fewer frames, while more extensive transformations often necessitate additional ones. herefore, we

aim to identify the optimal edited frame, denoted ft∗, which corresponds to the earliest timestep t that achieves the desired edit.

To automate the selection of t∗and avoid manual frame-by-frame review, they employ an automated approach. After generating the sequence V , they sample every

fourth frame, imprinting each with a unique identifier and assembling them into an image collage alongside Is. Inspired by ”An image grid can be worth a video” which

introduces a novel approach to video comprehension by transforming videos into image grids, they use GPT-4o, to assist in selecting t∗by providing it with the collage

and the editing prompt c. The VLM is tasked with identifying the frame that best fulfills the editing intent, evaluating each frame’s alignment with c and fidelity to Is. The

model is instructed to select the optimal frame with the lowest index that completes the edit.

Within manifold on the right,, there is a clear semantic progression: images of people with ’AI’ shirts

(green cluster) are close to images of people with ’AI’ shirts making a heart shape (purple cluster),

which are adjacent to images of people only making a heart shape (red cluster). Thus, transitioning

smoothly along the manifold allows a person with an ’AI’ shirt to perform a heart shape with their

hands while preserving the shirt’s text.

Pathways on the Image Manifold: Image Editing via Video Generation 27

Nov 2024

https://arxiv.org/pdf/2411.16819

Oğuzhan Ercan

x.com/oguzhannercan

They everage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT

lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, they propose an automatic method to identify “vital

layers” within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object

addition, using the same mechanism. These flow-based models are based on optimal transport conditional probability paths, resulting in faster training and sampling,

compared to diffusion models. This is attributed to the fact that they follow straight line trajectories, rather than curved paths. One of the known consequences of this

difference, however, is that these models exhibit lower diversity than previous diffusion models. While reduced diversity is generally considered an undesirable

characteristic, in this paper,they suggest leveraging it for the task of training-free image editing. Specifically, they explore image editing via parallel generation [4, 18,

91], where features from the generative trajectory of the source (reference) image are injected into the trajectory of the edited image. They say that there is no simple

relationship between the vitality of a layer and its position in the architecture, i.e., the vital layers are spread across the transformer.

They bypass layers one by one, and see how ith layer effects the generation. To assess the impact of each layer, they measured the perceptual similarity between G_ref

and Gℓ and using DINOv2. Results show that removing certain layers significantly affects the generated images, while others have minimal impact. Importantly, influential

layers are distributed across the trans- former rather than concentrated in specific regions. They adapt the self-attention injection mechanism, previously shown effective

for image and video editing in UNet-based diffusion models, to the DiT-based FLUX architecture. Since each DiT layer processes a sequence of image and text

embeddings, they propose generating both x and ˆx in parallel while selectively replacing the image embeddings of ˆx with those of x, but only within the vital layers set

V .

To edit real images, they first invert them into the latent space, transforming samples from p1 to p0. We initially implemented an inverse Euler ODE solver for FLUX by

reversing the vector field prediction.

Stable Flow: Vital Layers for Training-Free Image Editing 21 Nov 2024

https://arxiv.org/pdf/2411.14430

This approach proves insufficient for FLUX, resulting in corrupted image reconstructions and

unintended modifications during editing. We hypothesize that the assumption u(zt) ≈ u(zt−1)

does not hold, which causes the model to significantly alter the image during the forward

process. To address this, they introduce latent nudging: multiplying the initial latent z0 by a

small scalar λ = 1.15 to slightly offset it from the training distribution.

Oğuzhan Ercan

x.com/oguzhannercan

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

10 Dec 2024

This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while

extending its capabilities to accurate inversion and editing in 8 steps.

Read the rest of it,

https://arxiv.org/pdf/2412.07517

Oğuzhan Ercan

x.com/oguzhannercan

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

10 Dec 2024

This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while

extending its capabilities to accurate inversion and editing in 8 steps.

Read the rest of it,

https://arxiv.org/pdf/2412.07517

Oğuzhan Ercan

x.com/oguzhannercan

Guidance

Following slides includes:

-D-Flow: Differentiating through Flows for Controlled Generation

- And more

Oğuzhan Ercan

x.com/oguzhannercan

Derivative-Free Guidance in Continuous and Discrete

Diffusion Models with Soft Value-Based Decoding 1 Aug 2024

Rather than merely -generating designs that are natural, they often aim to optimize downstream reward functions while preserving the naturalness

of these design spaces. Proposed algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how

intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models.

https://arxiv.org/pdf/2407.02398

Oğuzhan Ercan

x.com/oguzhannercan

Flow Control

Following slides includes:

-D-Flow: Differentiating through Flows for Controlled Generation

- And more

Oğuzhan Ercan

x.com/oguzhannercan

Consistency Flow Matching:

Defining Straight Flows with Velocity Consistency 2 Jul 2024

https://arxiv.org/pdf/2407.02398

Oğuzhan Ercan

x.com/oguzhannercan

Diffusion Solvers

Following slides includes:

-D-Flow: Differentiating through Flows for Controlled Generation

Oğuzhan Ercan

x.com/oguzhannercan

GENERALIZATION IN DIFFUSION MODELS ARISES FROM

GEOMETRY-ADAPTIVE HARMONIC REPRESENTATIONS 12 April

2024

Recent reports of memorization of the training set raise the question of whether these networks are learning the “true” continuous density of the

data. This shows that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density,

when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the

training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density.

They find that DNN denoisers trained on photographic images perform a shrinkage operation in an orthonormal basis consisting of harmonic

functions that are adapted to the geometry of features in the underlying image. Theyrefer to these as geometry-adaptive harmonic bases (GAHBs).

This observation, taken together with the generalization performance of DNN denoisers, suggests that optimal bases for denoising photographic

images are GAHBs and, moreover, that inductive biases of DNN denoisers encourage such bases. They say that an optimal denoiser (for small

noise) should project a noisy image on the tangent space of the manifold.

DNNs are susceptible to overfitting, because the number of training examples is typically small relative to the model capacity. Since density

estimation, in particular, suffers from the curse of dimensionality, overfitting is of more concern in the context of generative models. An overfitted

denoiser performs well on training images but fails to generalize to test images, resulting in low-diversity generated images. Consistent with this,

several papers have reported that diffusion models can memorize their training data.

https://arxiv.org/pdf/2310.02557

Oğuzhan Ercan

x.com/oguzhannercan

DIFFUSION POSTERIOR SAMPLING FOR GENERAL

NOISY INVERSE PROBLEMS 20 May 2024

https://arxiv.org/pdf/2209.14687