Oğuzhan Ercan
x.com/oguzhannercan
Diffusion and Flow Models
Control, Optimization, Quality Enhancement and more…
Oğuzhan Ercan
x.com/oguzhannercan
Topics
Topics discussed
-Diffusion Models
-Diffusion Architectures
-Swap-Inpainting Models
-Output Control Techniques
-Inference Time Optimization
-Quality Enhancement
-Video Generation
-Face Models
Prerequisites
.
Probability
Statistic
Linear Algebra
Calculus
Deep Learning
Differential Equations
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Models
Diffusion models consist of two interconnected processes: forward and backward. The forward diffusion process gradually corrupts the data by
interpolating between a sampled data point x0 and Gaussian noise. The formulation described placed below:
informations here mostly taken from Imagine Flash paper. https://arxiv.org/pdf/2405.05224
where alfa t and sigma t define the signal-to-noise ratio (SNR) of the stochastic interpolant X_t. In the following, they opt for coefficients (alfa t,
sigma t) that result in a variance-preserving process. . When viewed in the continuous time limit, the forward process in Eq above can be
expressed as the stochastic differential equation which is:
where f (x, t) : Rd Rd is a vector valued drift coefficient, g(t) : R R is the diffusion coefficient, and wt denotes the Brownian motion at time t.
Inversely, the backward diffusion process is intended to undo the noising process and generate samples. According to Anderson’s theorem, the
forward SDE introduced earlier satisfies a reverse-time diffusion equation, which can be reformulated using the Fokker-Planck equations to have a
deterministic counterpart with equivalent marginal probability densities, known as the probability flow ODE :
Oğuzhan Ercan
x.com/oguzhannercan
This allows to estimate formulation above, usually parameterized by a time-conditioned neural network. Given these estimates, they can sample
using an iterative numerical solver f:
To end up with first-order solvers like DDIM:
where the sample data estimate ˆx0 at time-step t is computed as:
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Architectures
Following slides includes:
-Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack 27 September 2023
-Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding 14
May 2024
-Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion
Transformers 9 May 2024
- Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 29 Dec 2023
- And more
Oğuzhan Ercan
x.com/oguzhannercan
Emu: Enhancing Image Generation Models Using Photogenic Needles in
a Haystack 27 September 2023 ( Image quality should always be prioritized over quantity.)
Their key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the
generation quality. Effective fine-tuning of LLMs can be achieved with a relatively small but high-quality fine-tuning dataset, e.g., using 27K
prompts.
They increase the channel of AE from 4 to higher dimension. They use additional adversarial loss for reconstruction and also they apply a non
learnable preprocessing step to RGB images using a fourier feature transform to lift the input channel dimension.
They use a large U-Net with 2.8B trainable parameters. They increase the channel size and number of stacked residual blocks in each stage for
larger model capacity. They use text embeddings from both CLIP ViT-L and T5-XXL as the text conditions.
They pre-train the model with 1.1B images. They train the model with progressively increasing resolutions. This approach improve finer details at
higher resolutions.
https://arxiv.org/pdf/2309.15807
Oğuzhan Ercan
x.com/oguzhannercan
DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches. The first layer of DiT is patchify,” which converts the spatial
Input into a sequence of T tokens, each of dimension d,by linearly embedding each patch in the input, following patchify, frequency based
positional embeddings applied. Number off embeddings determined by patch size(p), note that changing p has no meaningful impact on
downstream parameter counts.
Following patchify, the input tokens are processed by a sequence of transformer blocks. In addition to noised image inputs, diffusion models
sometimes process additional conditional information such as noise timesteps t, class labels c, natural language. There is 4 different version of
conditioning. In-context conditioning,imply append the vector embeddings of tand cas two additional tokens in the input sequence. This approach
introduces negligible new Gflops to the model. Cross-attention block. Theyconcatenate the embeddings of t and c into a length-two sequence,
separate from the image token sequence. The transformer block is modified to include an additional multi-head cross-attention layer following the
multi-head self-attentionblock. Cross-attention adds the most Gflops to the model, roughly a 15% overhead.Adaptive layer norm They explore
replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN). Rather than directly learn dimension-wise scale and
shift parameters γand β, they regress them from the sum of the embedding vectors of t and c. adaLN adds the least Gflops and is thus the most
compute-efficient. Also they explore that a modification of the adaLN DiT block which does zero init accelerates large-scale training.
Scalable Diffusion Models with Transformers 2 Mar 2023
https://arxiv.org/pdf/2212.09748
Oğuzhan Ercan
x.com/oguzhannercan
They used 2 text encoder, bilingual t5 and clip. They perform post-training optimization in the inference stage to lower the
deployment cost of Hunyuan-DiT. They used VAE in SDXL which is fine-tuned on 512x512 images from the VAE in SD1.5. They say
SDXL VAE is improves clarity, alleviated over-saturation and reduced distortions.
They found the Adaptive Layer Norm used in class-conditional DiT performs unsatisfactorily to enforce fine-grained text conditions, so
they used cross-attention. Hunyuan-DiT has two types of transformer blocks, the encoder block and the decoder block. Both of them
contain three modules - self-attention, cross-attention, and feed-forward network (FFN). The text information is fused in the cross-
attention module. The decoder block additionally contains a skip module, which adds the information from the encoder block in the
decoding stage. The skip module is similar to the long skip-connection in U-Nets, but there are no upsampling or downsampling
modules in Hunyuan-DiT due to our transformer structure. Finally, the tokens are reorganized to recover the two-dimensional spatial
structure. For training, they find using v-prediction (velocity -> instead of noise, predict the rate of change) gives better empirical
result.
They used 2 dimensional Rotary Positional Embedding (RoPE) for positional encoding to encode the absolute position and relative
position dependency. To be able to generate images with multiple resolutions, they tried to use extended positional encoding and
centralized interpolative positional encoding. They find that CIPE converges faster and generalizes better.
To stabilize the training, they used QK-Norm. They add layer normalization after the skip module in the decoder blocks to avoid loss
explosion during training.
They found certain operations, e.g., layer normalization, tend to overflow with FP16 so they
specifically switch them to FP32 to avoid numerical errors.
Due to the large number of model parameters in Hunyuan-DiT and the massive amount of image data required for training, they
adopted ZeRO, flash-attention , multi-stream asynchronous execution, activation checkpointing, kernel fusion to enhance the training
speed. Deploying Hunyuan-DiT for the users is expensive, they adopt multiple engineering optimization strategies to improve the
inference efficiency, including ONNX graph optimization, kernel optimization, operator fusion, precomputation, and GPU memory
reuse.
They find that adversarial training tends to collapse, and best way to accelerate the model speed at inference time is progressive
distillation.
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding 14 May 2024
https://arxiv.org/pdf/2405.08748
Oğuzhan Ercan
x.com/oguzhannercan
Lumina-T2X: Transforming Text into Any Modality, Resolution, and
Duration via Flow-based Large Diffusion Transformers 9 May 2024
Lumina-T2X family a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework
designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. RoPE, RMSNorm, and flow
matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend
the context window to 128K tokens. Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs
of a 600-million-parameter naive DiT (PixArt-α), indicating that increasing the number of parameters significantly accelerates convergence of
generative models without compromising visual quality. Lumina-T2X tokenizes images, videos, multi-views of 3D objects, and spectrograms into
one-dimensional sequences, similar to the way LLMs process natural language. By incorporating learnable placeholders such as [nextline] and
[nextframe] tokens, Lumina-T2X can seamlessly encode any modality -regardless of resolution, aspect ratio, or even temporal duration - into a
unified 1-D token sequence. The empirical observations indicate that employing larger models, high-resolution images, and longer-duration video
clips can significantly accelerate the convergence speed of diffusion transformers.
https://arxiv.org/pdf/2405.05945
Oğuzhan Ercan
x.com/oguzhannercan
PIXART-α: FAST TRAINING OF DIFFUSION TRANSFORMER FOR
PHOTOREALISTIC TEXT-TO-IMAGE SYNTHESIS 29 Dec 2023
They propose a way to train text to image diffusion model with low computational cost (but still 753 A100 GPU days and
$28,400).They decompose the intricate text-to-image generation task into three streamlined subtasks: (1) learning the
pixel distribution of natural images, (2) learning text-image alignment, and (3) enhancing the aesthetic quality of images.
The T2I generation task can be decomposed into three aspects: Capturing pixel dependency - Alignment between text and
Image and high aesthetic quality.
To generate captions with high information density, they leverage state-of-the-art vision-language models LLaVA.
Employing the prompt, “Describe this image and its style in a very detailed manner”, they have significantly improved the
quality of captions.
Based on the Diffusion Transformer (DiT) they incorporate cross-attention modules to inject text conditions and streamline
the computation-intensive class-condition branch to improve efficiency.
https://arxiv.org/pdf/2310.00426
Oğuzhan Ercan
x.com/oguzhannercan
PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution, which evolves from the ‘weaker’ baseline to a ‘stronger’ model via
incorporating higher quality data, a process They term “weak-to-strong training”. Also they propose a novel attention module within the DiT framework that
compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. To enhance PixArt-α, they expand its
generation resolution from 1K to 4K. Generating images at high resolutions introduces a significant increase in the number of tokens, so computational demand. They
introduced a self-attention module with key and value token compression tailored to the DiT framework. Additionally, they employ a specialized weight initialization
scheme, allowing for a smooth adaptation from a pre-trained model without KV compression. This design effectively reduces training and inference time by 34% for
high-resolution image generation. They utilize only 9% of the GPU days required by PixArt-α to achieve a strong 1K high resolution image generation model. They
changed LLaVa with share-captioner for preventing hallucinations.
to mitigate the potential information loss caused by KV compression in self-attention computation, they opt to retain all the tokens of queries (Q). This strategy allows
us to utilize KV compression effectively while mitigating the risk of losing crucial information. They utilize group convolutions with a stride of 2 for local aggregation of
keys and values as compression function. They design a specialized convolution kernel initialization “Conv Avg Init” that utilizes group convolution and initializes the
weights w = 1/R**2 ,equivalent to an average operator.
They changed the pixart alfa’s vae which is SD1.5 vae with SDXL vae and finetuned it. 2K training steps is enough for this finetuning. While fine tuning LR to HR
model, they see a performance degradation caused by discrepancies in positional embeddings (PE) between different resolutions. To solve it, they initialized the HR
model’s PE by interpolating the LR model’s PE. The fine-tuning quickly converges at 1K steps. Theycan use KV compression directly when fine-tuning from LR pre-
trained models without KV compression and this reduce 34% of the training and inference time.
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-
Image Generation 17 Mar 2024
https://arxiv.org/pdf/2403.04692
Oğuzhan Ercan
x.com/oguzhannercan
WÜRSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE
TEXT-TO-IMAGE DIFFUSION MODELS 29 SEP 2023
Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for
large-scale text-to-image diffusion models. A key contribution of their work is to develop a latent diffusion technique in which they learn a detailed
but extremely compact semantic image representation used to guide the diffusion process.
They first trained VQGAN (Stage A) then Latent image decoder (Stage B) and then text conditional latent image
generation model. For image generation, they first generate a latent image at a strong compression ratio using a
text-conditional LDM (Stage C). Subsequently, this representation is transformed to a less-compressed latent
space by the means of a secondary model which is tasked for this reconstruction (Stage B). Finally, the tokens
that comprise the latent image in this intermediate resolution are decoded to yield the output image (Stage A).
They initialized the Semantic Compressor with weights pre-trained on ImageNet, which, however, does
not capture the broad distribution of images present in large text-image datasets and is not well-suited
for semantic image projection, since it was trained with an objective to discriminate the ImageNet
categories. So they updated the weights of the Semantic Compressor during training, establishing a
latent space with high-precision semantic information. During training Stage B, thet intermittently add noise to
the Semantic Compressor’s embeddings, to teach the model to understand non-perfect embeddings, which is likely
to be the case when generating these embeddings with Stage C. At stage C training, they follow a standard
diffusion process, applied in the latent space of the finetuned Semantic Compressor.
https://arxiv.org/pdf/2405.16759
Oğuzhan Ercan
x.com/oguzhannercan
They introduce a novel architecture, Shallow-UViT, which allows one to pretrain the pixel space diffusion models core layers on huge datasets of
text-image data eliminating the need to train at the entire model with high resolution images. one can significantly improve different image quality
metrics by leveraging the representation pretrained at low- resolution, while growing model resolution in a greedy fashion. They sim-
plify the UNet’s conventional hierarchical structure, which operates on multiple resolutions, and define the Shallow-UViT (SU), a simplified
architecture comprising a shallow encoder and decoder operating on a fixed spatial grid.
Bad paper. Do not read it.
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
28 May 2024
https://arxiv.org/pdf/2405.16759
Oğuzhan Ercan
x.com/oguzhannercan
They identify two crucial requirements for text encoders: character awareness and alignment with glyphs, solution involves crafting a series of customized text encoder:
Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. They created high-quality text-glyph data by
establishing a scalable pipeline capable of generating virtually unlimited paired data based on graphic rendering. They employed an innovative box-level contrastive loss
and fine-tune ByT5 into a series of customized text encoder for glyph generation, named Glyph-ByT5, then integrated it to SDXL using an efficient regionwise cross-
attention mechanism.Beside words, paragraph is a challenging task since it is not fit into single line. So they define a ‘paragraph’ as a block of text content that cannot be
accommodated within a single line, typically consisting of more than 10 words or 100 characters. They empirically demonstrate that the diffusion model can effectively
plan multi-line arrangements and adjust the line or word spacing according to the given text box, regardless of its size or aspect ratios. Unlike conventional CLIP,
which applies contrastive loss to the entire image, they pro pose applying a box-level contrastive loss that treats each text box and its corresponding text prompt as an
instance. Based on the number of characters or words within the text box, they can categorize them into either a word text box, a sentence text box, or a paragraph text
box.
They designed box level contrastive loss and they compute the box embedding and sub-text embedding of image-text pair, embeddings comes from text encoder and
visual encoder. They introduce a region-wise multi-head cross-attention mechanism to seamlessly fuse the glyph knowledge encoded in our customized text encoder
within the target typography boxes and the prior knowledge carried by the original text encoders in the regions outside of typography boxes. At region-wise multi-head
cross-attention mechanism they first partition the input pixel embeddings (Query) into multiple groups. These groups correspond to the target text boxes, which can be
either specified by the user or auto-matically predicted by leveraging the planning capability of GPT-4. Simultaneously, they divide the text prompts (Key-Value) into
corresponding sub-sections, which include a global prompt and several groups of glyph-specific prompts.They specifically direct the pixel embeddings within the target
text boxes to attend only to the glyph text embeddings extracted with Glyph-ByT5. Similarly, pixel embeddings outside the text boxes are made to attend exclusively to
the global prompt embeddings extracted with the original two CLIP text encoders.
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text
Rendering 14 Mar 2024
https://arxiv.org/pdf/2403.09622
Oğuzhan Ercan
x.com/oguzhannercan
To address the dense prompt understanding limitations of CLIP-based diffusion models, they propose a novel approach named ELLA, which incorporates powerful LLM
in a lightweight and efficient manner, proposed method is Timestep-Aware Semantic Connector (TSC), on text-image pair data rich in information density. Diffusion
models, at first stages of denoising process, predict low frequency features, then predict high frequency features, so they anticipate the TSC to do same way, for this
reason it is timestamp-aware. The architectural design of our TSC is based on the resampler and it instills temporal dependency by integrating the timestep in the
Adaptive Layer Normalization. TSC can integrate community models and downstream tools such as LoRA and ControlNet.
imestep-Aware Semantic Connector (TSC) receives the text feature with arbitrary length as well as the timestep embedding, and outputs fixed-length semantic
queries. These semantic queries are used to condition noisy latent prediction of the pre-trained U-Net through cross-attention. To improve the compatibility and
minimize the training parameters, they leave both the text encoder of Large Language Models as well as the U-Net and VAE components frozen. The only trainable
component is consequently our lightweight TSC module.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
8 Mar 2024
https://arxiv.org/pdf/2406.09162
Oğuzhan Ercan
x.com/oguzhannercan
realistic data distributions are typically high-dimensional, complex and often multimodal. Directly encoding such data into a single unimodal Gaussian distribution and
learning a corresponding reverse noise-to-data mapping is challenging. The mapping, or generative ODE, necessarily needs to be highly complex, with strong curvature,
and one may consider it unnatural to map an entire data distribution to a single Gaussian distribution. In practice, conditioning information, such as class labels or text
prompts, often helps to simplify the complex mapping by offering the DM’s denoiser additional cues for more accurate denoising. However, such conditioning information
is typically of a semantic nature and, even given a class or text prompt, the mapping remains highly complex. They propose Discrete-Continuous Latent Variable
Diffusion Models (DisCo-Diff), DMs augmented with additional discrete latent variables that encode additional high-level information about the data and can be used by
the main DM to simplify its denoising task. These discrete latents are inferred through an encoder network and learnt end-to-end together with the DM. Thereby, the
discrete latents directly learn to encode information that is beneficial for reducing the DM’s score matching objective and making the DM’s hard task of mapping simple
noise to complex data easier.
DisCo-Diff’s training process is divided into two stages. In the first stage, the denoiser Dθ and the encoder Eϕ are co-optimized in an end-to-end fashion. This is achieved
by extending the denoising score matching objective to include learnable discrete latents z associated with each data y. The denoiser network Dθ can better cap-
ture the time-dependent score (i.e., achieving a reduced loss) if the score for each sub-distribution p(x|z; σ(t)) is simpli-fied. Therefore, the encoder Eϕ, which has
access to clean input data, is encouraged to encode useful information into the discrete latents and help the denoiser to more accurately reconstruct the data. Naively
backpropagating gradients into the encoder through the sampling of the discrete latent variables z is not possible. Hence, during training they rely on a continuous
relaxation based on the Gumbel-Softmax distribution. In the second stage, they train the autoregressive model Aψ to capture the distribution of the discrete latent
variables pϕ(z) defined by pushing the clean data through the trained encoder.
DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
3 Jul 2024
https://arxiv.org/pdf/2407.03300
Oğuzhan Ercan
x.com/oguzhannercan
OneDiffusion, a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional
generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as
depth estimation and segmentation.
Read rest of it……
One Diffusion to Generate Them All 25 Nov 2024
https://arxiv.org/pdf/2411.16318
Oğuzhan Ercan
x.com/oguzhannercan
They introduce a convolution-like local at- tention strategy termed CLEAR, which limits feature inter- actions to a local window around each query token, and thus
achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self- generated samples for 10K iterations, they can
effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. They find that while
formulation variation strategies have proven effective in attention-based UNets [38] and DiTs trained from scratch [62], they do not yield similar success with pre-trained
DiTs. Key-value compression often leads to distorted details, and key-value sampling highlights the necessity of local tokens for each query to generate visually coherent
results. They say that four elements crucial for for linearizing pre-trained DiTs, including locality, formulation consistency, high-rank attention maps, and feature integrity.
Their proposal is each query interacts only with tokens within a predefined distance r. Since the number of key-value tokens interacting with each query is fixed, the
resulting DiT achieves linear complexity with respect to image resolution.
LinFusion has shown that linear attention approaches like linear attention achiev epromising results in attention-based UNets. However, they find that it is not the case
for pre-trained DiTs. They speculate that it is due to attention layers being the only modules for token interactions in DiTs, unlike the case in U-Nets. Substituting all of
them would have a substantial impact on the final outputs. Other formulations likeSigmoid Attention fails to converge within a limited number of iterations. High-Rank
Attention Maps means that attention maps calculated by efficient attention alternatives should be suf- ficient to capture the intricate token-wise relationships. Extensive
attention scores are concentrated along the diagonal, indicating that the attention maps do not exhibit the low-rank property assumed by many prior works. That is why
methods like linear attention and Swin Transformer largely produce blocky patterns. Feature Integrity implies that raw query, key, and value features are more favorable
than the compressed ones. Al- though PixArt-Sigma has demonstrated that applying KV compression on deep layers would not hurt the performance much, this approach
is not suitable for completely lineariz- ing pre-trained DiTs. Methods based on KV compression, such as PixArt-Sigma and Agent At- tention, tend to produce distorted
textures compared to the results from Swin Transformer and Neighborhood Attention, which highlights the necessity to preserve the integrity of the raw query, key, and
value tokens.
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers
Up 20 Dec 2024
https://arxiv.org/pdf/2412.16112
CLEAR adopts circular windows, where key-value tokens within a Euclidean distance less than a
predefined radius r are considered for each query. Comparing with corresponding square windows,
the computation overhead introduced by this design is π 4 times. Although each query only has
access to tokens within a local window, stacking multiple Transformer blocks en- ables each token to
gradually capture holistic informationsimilar to the way convolutional neural networks operate. To
promote functional consistency between models before and after fine-tuning, they employ a
knowledge distillation objective during the fine-tuning process.Since attention is confined to a local
window around each query, CLEAR offers greater efficiency for multi-GPU patch-wise parallel
inference compared to the full attention in the original DiTs, which is particularly valuable for
generating ultra-high-resolution images. Specifically, each GPU is responsible for processing an image
patch, and the GPU communication is only required in the boundary areas.
Oğuzhan Ercan
x.com/oguzhannercan
The intrinsic distinction between AR and diffusion models lies in their approach to data distribution factorization. AR models treat data as an ordered sequence,
factorizing it along the sequential axis, where the probability of each token is conditioned on all preceding tokens. This factorization enables the AR paradigm to
generalize effectively and efficiently across arbitrary number of tokens, making it well-suited for long-sequence reasoning and in-context generation. In contrast,
diffusion models factorize data along the noise-level axis, where the tokens at each step are a refined (denoised) version of themselves from the previous step. As a
result, the diffusion paradigm is generalizable to arbitrary number of data refinement steps, enabling iterative quality improvement with scaled inference compute.
CausalFusion is designed to predict any number of tokens at any AR step, with any predefined sequence order and any level of inference compute, thereby minimizing
the inductive biases presented in existing generative models. As shown in Figure 1, this approach provides a broad spectrum between the AR and diffusion paradigms,
allowing smooth interpolation within two endpoints during both training and inference. Starting from the DiT architecture, they gradually convert it into a decoder- only
transformer compatible with existing AR models like GPT and LLaMA. Given a sample of training images X, AR models split X along the spatial dimensions into a
sequence of tokens, X = {x1, . . . , xL}, where L is the number of tokens. Diffusion models gradually add random noise (typically Gaussian) to X in a so-called forward
pro- cess. It is a Markov chain along the noise level, where each noisy version xt is conditioned on the previous state.
Causal Diffusion Transformers for Generative Modeling 17 Dec 2024
https://arxiv.org/pdf/2412.12095
Oğuzhan Ercan
x.com/oguzhannercan
OneDiffusion, a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional
generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as
depth estimation and segmentation.
Read rest of it……
One Diffusion to Generate Them All 25 Nov 2024
https://arxiv.org/pdf/2411.16318
Oğuzhan Ercan
x.com/oguzhannercan
Guidance Techniques
Following slides includes:
-SWAPANYTHING Enabling Arbitrary Object Swapping in Personalized Visual Editing 6 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
No Training, No Problem: Rethinking Classifier-Free
Guidance for Diffusion Models 2 Jul 2024
İndependent condition guidance (ICG) and time step guidance (TSG). The main idea behind ICG is that by using a conditioning vector independent of the input data,
the conditional score function becomes equivalent to the unconditional score. Time-step guidance aims to improve the accuracy of denoising at each sampling step by
leveraging the time-step information learned by the diffusion model to steer sampling trajectories toward better noise-removal paths.
https://arxiv.org/pdf/2407.02687
Oğuzhan Ercan
x.com/oguzhannercan
Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation
https://proceedings.mlr.press/v202/song23k/song23k.pdf
Oğuzhan Ercan
x.com/oguzhannercan
TraDiffusion: Trajectory-Based Training-Free Image Generation
19 Agust 2024
This method allows users to effortlessly guide image generation via mouse trajectories. To achieve precise control, they design a distance awareness energy function
to effectively guide latent variables, ensuring that the focus of generation is within the areas defined by the trajectory. The energy function encompasses a control
function to draw the generation closer to the specified trajectory and a movement function to diminish activity in areas distant from the trajectory. Due to the sparsity
of the trajectories, it is difficult to directly combine backward guidance. A natural idea is to get the prior structure of an object through the attention maps of cross
attention layers, rather than directly using the trajectories to achieve backward guidance.
They propose to use a distance awareness energy function for guidance, first apply a control function to guide the object to approach a given trajectory, which is
formulated as:
where Dμi is a distance matrix computed by the OpenCV (Bradski 2000) function “distanceTransform”,in which each value denotes the distance from each location
μ of the attention map to the given trajectory. However, this does not effectively inhibit the attention response of the object in irrelevant regions far from the trajectory.
So, they add a movement function to suppress the attention response from irrelevant regions far from the trajectory of the object accordingly. The movement function
is formulated as:
https://arxiv.org/pdf/2407.02687
Oğuzhan Ercan
x.com/oguzhannercan
Output Control Techniques
Following slides includes:
-Adding Conditional Control to Text-to-Image Diffusion Models 26 Nov 2023
-ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback 11 Apr 2024
- CTRLororALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models 13 May 2024
- And more
Oğuzhan Ercan
x.com/oguzhannercan
Controlnet is a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion
models. Zero-Initialized Layers are used by ControlNet for connecting network blocks. The reason for initializing weights as
zero instead of gaussian is that progressively grow the parameters from zero and ensure that no harmful noise could affect
the finetuning. To add a ControlNet to such a pre-trained neural block, they lock (freeze) the parameters Θ of the original
block and simultaneously clone the block to a trainable copy with parameters. The trainable copy takes an external
conditioning vector c as input. The trainable copy is connected to the locked model with zero 1x1 convolution layers
In the training process, they randomly replace 50% text prompts with empty strings. This approach increases ControlNet’s
ability to directly recognize semantics in the input conditioning images (e.g., edges, poses, depth, etc.) as a replacement for
the prompt. They see that the model converges suddenly, not progressively.
When a conditioning image is added via ControlNet, it can be added to both ϵuc and ϵc, or only to the ϵc. Their solution is
to first add the conditioning image to ϵc and then multiply a weight wi to each connection between Stable Diffusion and
ControlNet according to the resolution of each block.
To apply multiple conditioning images (e.g., Canny edges, and pose) to a single instance of Stable Diffusion, they can
directly add the outputs of the corresponding ControlNets to the Stable Diffusion model.
Adding Conditional Control to Text-to-Image Diffusion Models
26 Nov 2023
https://arxiv.org/pdf/2302.05543
Oğuzhan Ercan
x.com/oguzhannercan
For an input conditional control, they use a pre-trained discriminative reward model to extract the corresponding condition of the generated images,
and then optimize the consistency loss between the input conditional control and extracted condition. They introduce an efficient reward strategy
that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning.
model performs T denoising steps to generate the image x′0 from random noise xT an abstract metric function that can take on different concrete
forms for different visual conditions. L takes the condition and output of supervisor model which takes the output of diffusion model. For example, in
the context of using segmentation mask as the input conditional control, L could be the per-pixel cross-entropy loss.
Since to achieve the pixel-space consistency loss it requires x0, the final diffused image which requires 20-50 steps, which requires too much
computation so they propose one step efficient reward strategy. At eq below, they show that algebraic manipulation can be used for single step
noise prediction.
The timestep threshold, which is a hyper-parameter used to determine whether a noised image xt should be utilized for reward fine-tuning. They
note that a small noise ϵ (i.e., a relatively small timestep t) can disturb the consistency and lead to effective reward fine-tuning. During the reward
fine-tuning phases, they freeze the pre-trained discriminative reward model and text-to-image model, and only update the ControlNet module
following its original implementation, which ensures its generative capabilities are not compromised.
ControlNet++: Improving Conditional Controls
with Efficient Consistency Feedback 11 Apr 2024
https://arxiv.org/pdf/2404.07987
Oğuzhan Ercan
x.com/oguzhannercan
CTRLororALTer: Conditional LoRAdapter for Efficient
0-Shot Control & Altering of T2I Models 13 May 2024
Ctrlororalter, formulating such a unified approach for conditioning on global controls like style and on local controls like structure, in an efficient and
generic manner remains a key open problem. LoRAdapter is a novel approach to adding conditional information to LoRAs, enabling zero-shot
generalization and making them applicable for both structure and style and possibly many other conditioning types. They propose a LoRA-based
conditioning mechanism whose behavior changes based on conditioning provided at inference time, enabling zero-shot generalization.
They add a condition to matrix A: from to
where conditioning is as denoting an elementwise (Hadamard) product and γ and β referring to
the scale and shift factors.
https://arxiv.org/pdf/2405.07913
Oğuzhan Ercan
x.com/oguzhannercan
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion
Model with Any Condition 12 Dec 2023
FreeControl, a novel method for training-free controllable T2I generation via modeling the linear subspace of intermediate diffusion features and employing guidance
in this subspace during the generation process. Given a text prompt c and a guidance image Ig of any modality, FreeControl directs a pre-trained T2I diffusion model
ϵθ to comply with c while also respecting the semantic structure provided by Ig throughout the sampling process of an output image I. key finding is that the leading
principal components of self-attention block features inside a pre-trained diffusion model provide a strong and surprisingly consistent representation of semantic
structure across a broad spectrum of image modalities.
FreeControl is a two-stage pipeline and t begins with an analysis stage, where diffusion features of seed images undergo principal component analysis (PCA), with the
leading PCs forming the time-dependent bases Bt as our semantic structure representation. Ig subsequently undergoes DDIM inversion with its diffusion features
projected onto Bt, yielding their semantic coordinates Sgt . In the synthesis stage, structure guidance encourages I to develop the same semantic structure as Ig by
attracting St to Sgt . In the meantime, appearance guidance promotes appearance similarity between I and ¯I by penalizing the difference in their feature statistics.
Their key observation is that the leading PCs form a semantic basis; It exhibits a strong correlation with object pose, shape, and scene composition across diverse
image modalities.
https://arxiv.org/pdf/2406.01300
Oğuzhan Ercan
x.com/oguzhannercan
ControlNeXt: Powerful and Efficient Control for Image and
Video Generation 15 Agust 2024
They design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a
concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for
training, They reduce up to 90% of learnable parameters compared to the alternatives. They propose another method called Cross Normalization (CN) as a replacement
for “zero-convolution” to achieve fast and stable training convergence. They say that zero convolution also increases training challenges, slows convergence, and results
in the “sudden convergence phenomenon”. As for the training, they fine-tune the base model by freezing most of its modules and selectively training a much smaller
subset of the pretrained parameters. In this paper, they propose that the key reason for training collapse is the new initialized parameters sharing a different data
distribution in terms of mean and standard deviation compared to the pre-trained parameters, and introduce cross normalization to align data distributions.
They say that the controls typically have a simple form or maintain a high level of consistency with the denoising features, eliminating the need to insert controls at
multiple stages. They integrate the controls into the denoising branch at a single selected middle block by directly adding them to the denoising features after
normalization through Cross Normalization.
Cross Normalization: They calculate the mean and variance of main branch, which is pretrained diffusion model. Then normalizes the control branches features. Cross
Normalization aligns the distributions of the denoising and control features, serving as a bridge to connect the diffusion and control branches. It accelerates the training
process, ensures the effectiveness of the control on generation even at the beginning of training, and reduces sensitivity to the initialization of network weights.
https://arxiv.org/pdf/2408.06070
Oğuzhan Ercan
x.com/oguzhannercan
RB-Modulation: Training-Free Personalization of
Diffusion Models using Stochastic Optimal Control 27 May 2024
RB-Modulation is built on a stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost and it eliminates the need
for training or finetuning diffusion models. The model comes with with a new Attention Feature Aggregation (AFA) module to maintain high fidelity to the reference
image while adhering to the given prompt.
https://arxiv.org/pdf/2405.17401
Oğuzhan Ercan
x.com/oguzhannercan
They propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of
multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20 100 samples) instead of full- parameter tuning with large dataset.
https://arxiv.org/pdf/2410.23775
IN-CONTEXT LORA FOR DIFFUSION TRANSFORMERS 31 Oct 2024
Oğuzhan Ercan
x.com/oguzhannercan
They introduce an over-parameterized approach that accelerates training without increasing inference costs. This method reparameterized low-rank adaptation by
employing a separate MLP and learned embedding for each layer. The learned embedding is input to the MLP, which generates the adapter parameters. Such
overparameterization has been shown to implicitly function as an adaptive learning rate and momentum, accelerating optimization. At inference time, theMLP can be
discarded, leaving behind a standard low-rank adapter.
They avoid directly optimizing A and B by introducing a two-layer MLP which takes as input z and predicts the entries of A and B (Low Rank Matrices). More formally,
A,B = W2(ReLU(W1z + b1)) + b2 where z is the learned input vector to the MLP, W and b correspond to learned weights biases, and A Rr×d and B Rd×r are
the generated low-rank matrices. Once fine-tuning is complete, the MLP can be discarded, retaining only the low-rank matrices A and B for inference. Although the
MLP is compact in depth, it predicts a high-dimensional output the size of the LoRA parameters which makes it overparameterized. For instance, an MLP with a
hidden dimension of 32 scales the number of trainable parameters by approximately 32. This makes OP-LoRA particularly advantageous in settings where inference
resources are constrained, but sufficient memory is available during training.
https://arxiv.org/pdf/2412.10362v1
OP-LoRA: The Blessing of Dimensionality 13 Dec 2024
Oğuzhan Ercan
x.com/oguzhannercan
OminiControl, a parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. OminiControl leverages a parameter
reuse mechanism, enabling the DiT to encode image conditions using itself (VAE of it) as a powerful backbone and process them with its flexible multi-modal attention
processors. Omini-Control effectively and efficiently incorporates injected image conditions with only 0.1% additional parameters, and can be used with control cases like
subject-driven generation and spatially-aligned conditions such as edges, depth, and more. These capabilities are achieved by training on images generated by the DiT
itself.
Following the same token processing pipeline as noisy image tokens, they augment the encoded features with learnable position embeddings. After that, these tokens
added the sequence along noisy image tokens and text tokens, processes condition image tokens uniformly with text and noisy image tokens, integrating them into a
unified sequence. Z = [X,C_text,C_image].where Z represents the concatenated sequence of noisy im-
age tokens X, text tokens CT, and condition image tokens CI. This unified approach enables direct participation in multi-modal attention without specialized processing
pathways. This sequence design allows for flexible integration of condition image tokens, but this requires incorporating positional information to ensure effective
interaction with noisy image tokens.
In FLUX.1’s Transformers, each token is assigned a corresponding position index to encode spatial information. For a 512×512 target image, the VAE encoder first projects
it into the latent space, then the latent representation is divided into a 32×32 grid of tokens, where each token is assigned a unique two-dimensional position index (i, j)
with i, j [0, 31]. This indexing scheme preserves the spatial structure of the original image in the latent space, while text tokens maintain a fixed position index of (0, 0).
For spatially aligned tasks, our initial approach was to assign condition tokens the same position embeddings as their corresponding tokens in the noisy image. However,
for non-spatially aligned tasks such as subject-driven generation, our experiments revealed that shifting the position indices of condition tokens leads to faster
convergence. Specifically, they shift the condition image tokens to indices (i, j) where i [0, 31] and j [32, 64], ensuring no spatial overlap with the original image
tokens X.
To achieve controllability, they introduce a bias term into the original MM-Attention operation. Biasγ is designed to adjust the attention weights between condition tokens
and other tokens based on the strength factor γ. The bias term is constructed as a (M +2N ) × (M + 2N ) matrix, where M is the number of text tokens, and N is the
number of noisy image tokens and condition image tokens each. The matrix has the following structure:
OminiControl: Minimal and Universal Control for Diffusion Transformer 25
Nov 2024
https://arxiv.org/pdf/2405.17401
Oğuzhan Ercan
x.com/oguzhannercan
Steering Rectified Flow Models in the Vector Field
for Controlled Image Generation 27 Nov 2024
FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation
tasks, facilitated by gradient skipping. They discover that in nonlinear ODEs with stochasticity or trajectory
crossovers, error terms emerge that hinder convergence due to inaccuracies in estimating denoised samples or
improper gradient approximations. They say that RFMs can achieve higher convergence rates without
additional computational overhead by capitalizing on this key property. They say that inversion is unnecessary,
even for RF-Inversion, making RF-Inversion a special case of FlowChef, where starting noise originates from
an inverted target image rather than random noise, as in FlowChef.
Rectified flow models inherently allow error dynamics to converge even with gradient approximations due to
their straight-line trajectories and smooth vector fields, as discussed previously. Hence, vector field uθ (xt, t) is
trained to be smooth, and this smoothness implies that uθ changes gradually w.r.t. xt. A key feature of
FlowChef
is that it starts from any random noise xT N (0, I) and still converges to the desired distribution or sample
without inversion. At each timestep t, they first estimate the ˆx0. Then calculate the loss L(ˆx0, xref0 ). At
last, they directly optimize xt using the gradient ˆx0 L.
https://arxiv.org/pdf/2412.00100
Let p1 N (0, I) and p0 be
distributions, with x1 p1 as initial
noise, x0_ref as the target sample,
x0 as the denoised sample from
x1, and x1_ref as the specific noise
leading to x0_ref .
Oğuzhan Ercan
x.com/oguzhannercan
They introduce a single parameter ω, to effectively control granularity in diffusion-based synthesis. The approach does not require model retraining, architectural
modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial
masks or denoising schedules with varying ω values can be applied to achieve region-specific or timestep-specific granularity control.
A general form of a single denoising step with Omegance is formulated as follows:
Omegance retains the standard denoising schedule, leaving the amount of noise removed from zt unchanged. The SNR schedule aligns with the forward process. This
setting produces a balanced output with standard levels of detail and texture across the entire image, aligning with the expected granularity of the original noise
schedule. • When ω < 1, SNR(t − 1)′ < SNR(t − 1). The noise prediction is scaled down, leading to a less aggressive de- noising towards z0. Therefore, the latent state
z′ t−1 retains additional high-frequency information. With the noise component dominating, the model “justifies” this residual noise by generating more intricate
structures and richer textures, enhancing visual complexity in the output. When ω > 1, SNR(t − 1)′ > SNR(t − 1). The denoising schedule becomes more aggressive.
This amplified noise reduction diminishes high-frequency information in the latent z′ t−1. With the signal now dominating, the model interprets the reduced residual noise
as a cue to simplify textures and details, yielding smoother and less intricate visual outputs.
The omega mask ωi,j = M(i, j) introduces a spatially varying control over the granularity within a single image by allowing different regions to have distinct ω values dur-
ing the denoising process. This spatial control leverages the locality of the denoising process, ensuring that adjustments to ω in one region do not affect the SNR or visual
properties of neighboring areas. Such flexibility is valuable for applications requiring region-specific detail control within a single image, enabling fine-grained textures in
focal regions while maintaining smoothness elsewhere.
The omega schedule ωt = S(t) provides a mechanism for controlling granularity across different stages of the denoising process by dynamically adjusting ω values over
time. By introducing ω at specific stages in the reverse diffusion process, the omega schedule allows targeted influence on both the broad layout and fine-grained details
within the generated image. This temporal control is aligned with the denoising dynamics: early denoising stages primarily re- construct the general structure and layout,
while later stages refine finer details.
Omegance: A Single Parameter for Various Granularities in Diffusion-
Based Synthesis 26 Nov 2024
https://arxiv.org/pdf/2411.17769
Oğuzhan Ercan
x.com/oguzhannercan
Training Time Optimization
Following slides includes:
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024
-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024
- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18
March 2024
- Distilling Diffusion Models into Conditional GANs 9 May 2024
- Cross-Attention Makes Inference Cumbersome
- in Text-to-Image Diffusion Models 3 April 2024
- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
Oğuzhan Ercan
x.com/oguzhannercan
Immiscible Diffusion: Accelerating Diffusion Training with Noise
Assignment 18 Jun 2024
Diffusion models mimic the reverse thermodynamic diffusion phenomenon [ 34] to ease the denoising process. However, when the sources are miscible they end up
messily mixed. Predicting the reversal process from such a random mixture encounters significant difficulties, and unfortunately, this is a problem diffusion model
always facing during denoising. They notice that the mixing can also be organized when the sources are immiscible. Under that circumstance, the sources would take
different continuous areas after the diffusion, while the whole diffuse-able area remains the same.
In order to achieve this, minimize the total distance of image-noise pairs in a batch during the assignment. After assignment, the noise is still Gaussian, while each
noise is assigned to nearer images like what happens in the immiscible phenomenon, which significantly eases the difficulties for the denoising. For implementation, all
needed to be done is to perform a linear assignment(hungarian match) between the batch of images and noises according to their distances.
https://arxiv.org/pdf/2406.12303
Oğuzhan Ercan
x.com/oguzhannercan
Stretching Each Dollar: Diffusion Training from Scratch on
a Micro-Budget 22 Jul 2024
As the computational cost of transformers increases with the number of patches in each image, they propose to randomly mask up to 75% of the image patches
during training. To mitigate the massive performance degradation with masking, they propose a deferred masking strategy where all patches are preprocessed by a
lightweight patch-mixer before being transferred to the diffusion transformer.
https://arxiv.org/pdf/2407.15811
Oğuzhan Ercan
x.com/oguzhannercan
Inference Time Optimization
Following slides includes:
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024
-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024
- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18
March 2024
- Distilling Diffusion Models into Conditional GANs 9 May 2024
- Cross-Attention Makes Inference Cumbersome
- in Text-to-Image Diffusion Models 3 April 2024
- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
- And more
Oğuzhan Ercan
x.com/oguzhannercan
SDXS: Real-Time One-Step Latent Diffusion Models
with Image Conditions 25 Mar 2024
They introduce a dual approach involving model miniaturization and a reduction in sampling steps. The methodology leverages knowledge distillation
to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature
matching and score distillation.
VAE Decoder Optimization: Utilizing a pretrained diffusion model F to sample latent codes z and a pretrained VAE decoder to reconstruct images x,
they introduce a VAE Distillation (VD) loss for training a tiny image decoder G. they only build the G with CNN blocks to eliminate complex
components like attention and normalization (I do not know why they think that norm layers are computationally overwhelming.)
U-Net Optimization: They selectively removed residual and Transformer blocks from the U-Net, aiming to train a more compact model that can still
reproduce the original model’s intermediate feature maps and outputs effectively. Initializing noises and sampling images with an ODE to get noise
image pairs results in low quality images. So they use Rectified Flow tackles this challenge by straightening the sampling trajectories. Using MSE
loss causes the model tends to output the average of multiple feasible solutions. So they use SSIM. They also straighten the model’s trajectories to
narrow the range of feasible outputs using existing finetuning methods like LCM.
One Step Training: They first trained model for feature matching with SSIM Loss as warm-up. While at this stage, they sample noise - image pairs.
As trajectory of ODE’s (For example DDPM) are not straight, they used LCM-Lora for rectifying the flow. They say that warm-up training results are
good at image quality but do not capture the data distribution . For this reason they use score distillation sampling with learned manifold corrective.
https://arxiv.org/pdf/2403.16627
Oğuzhan Ercan
x.com/oguzhannercan
To achieve better quality at low step size, they propose to distill along the student’s backward path instead of the forward path. Put differently,
rather than having the student mimic the teacher, They use the teacher to improve the student based on its current state of knowledge. They
propose a Shifted Reconstruction Loss that dynamically adapts the knowledge transfer from the teacher model. Specifically, the loss is designed to
distill global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency
components at lower time steps. They propose noise correction, a training free inference time modification that enhances sample quali0ty.
Commonly chosen s.t. xT is not pure noise during training, but rather contains low-frequency information leaked from x0. xt = αtx0 + σtxT, here
the leakage reason, any stochastic interpolant xt, t < T still contains information from the ground-truth sample via the first summand αtx0.
backward distillation eliminates information leakage at all time steps t, preventing the model from relying on a ground-truth signal. This is achieved
by simulating the inference process during training, which can also be interpreted as calibrating the student on its own upstream backward path.
They first perform backward iterations of the student model to obtain , then use this as input for
both the student and teacher models during training.
For distillation loss, they define shifted reconstruction loss which is designed such that for higher values of t, the target produced by the teacher
model displays global content similarity with the student output but with improved semantic text alignment; and for lower values of t, the target
image features enhanced fine-grained details while maintaining the same overall structure as the student’s prediction.
When t=T, which is pure noise, at that time step predicting the noise is not informative. So existing works propose predicting the velocity which is
the rate of change. Unfortunately, converting a model to velocity prediction requires extra training efforts. They present training free method, by
treating t = T as a unique case and replacing ϵΘ with the true noise xT , the update f is corrected.
Shifted Reconstruction Loss
Imagine Flash: Accelerating Emu Diffusion Models with Backward
Distillation 18 Apr 2024
https://arxiv.org/pdf/2405.05224
Oğuzhan Ercan
x.com/oguzhannercan
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-
Play Accelerator 13 May 2024
PeRFlow divides the sampling process of generative flows into several time windows
and straightens the trajectories in each interval via the reflow operation, thereby
approaching piecewise linear flows. Specifically, they attempt to straighten the
trajectories of the original PF-ODEs via a piecewise reflow operation. By solving the
ODEs in the shortened time interval, PeRFlow avoids simulating the entire ODE
trajectory for preparing the training data.Through such a divide-and-conquer strategy,
PeRFlow can straighten the sampling trajectories with large-scale real training data.
Diffusion models are usually trained with ϵ-prediction, but flow-based generative
models generate data by following the velocity field.They derive the correspondence
between ϵ-prediction and the velocity field of flow, thus narrowing the gap between the
pretrained diffusion models and the student PeRFlow model.
They divide the ODE trajectories into multiple time windows and straighten the
trajectories in each time window via the reflow operation. The pretrained diffusion
models are usually trained by two parameterization tricks, namely ϵ-prediction and
velocity-prediction. To inherit knowledge from the pretrained network, they
parameterize the PeRFlow model as the same type of diffusion and initialize network θ
from the pretrained diffusion model ϕ.
https://arxiv.org/pdf/2405.07510
Oğuzhan Ercan
x.com/oguzhannercan
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion
Distillation 18 March 2024
Authors says that Adversarial Diffusion Distillation was a big move but usage of the fixed and pretrained DINOv2 network restricts the discriminator’s
training resolution to 518 × 518 pixels. Also there is no straightforward way to control the feedback level of the discriminator. Plus as Yann
lecun said that “need to decode to RGB space” is a problem. They say smaller discriminator
feature networks often offer better performance than their larger counterparts.
They distill generative features of a pretrained diffusion model instead of DINOv2. By targeted sampling of the noise levels during training, it
can be bias the discriminator features towards more global (high noise level) or local (low noise level) behavior.
many distillation techniques attempt to learn “simpler” differential equations that result in the same distribution at t=0 however with
“straighter”, more linear, trajectories which allows for larger step sizes and therefore less evaluations of the network.
LADD introduces two main modifications: the unification of discriminator and teacher model, and the adoption of synthetic data for training.
They first generate an image with teacher model. Then they add noise to the generated image, after that they denoise the image with both
teacher and student networks. They calculate the loss over these latent space representations.
They also fed the students output to the teacher model. After each layer of teacher model which is a transformer, they add a discriminator
head and calculate the adversarial loss with these heads.
in one-shot scenarios, CFG simply oversaturates samples rather than improving text-alignment. This observation suggests that CFG works
best in settings with multiple steps, allowing for corrections of oversaturation issues ins most case. Also they see that while distillation loss
benefits training with real data, it offers no advantage for synthetic data. Thus, training on synthetic data can be effectively conducted using
only an adversarial loss.
https://arxiv.org/pdf/2403.12015
Oğuzhan Ercan
x.com/oguzhannercan
Cross-Attention Makes Inference Cumbersome
in Text-to-Image Diffusion Models 3 April 2024
https://arxiv.org/pdf/2404.02747
Oğuzhan Ercan
x.com/oguzhannercan
Improved Distribution Matching Distillation for Fast Image Synthesis
23 May 2024
https://arxiv.org/pdf/2405.14867
Oğuzhan Ercan
x.com/oguzhannercan
EM Distillation for One-step Diffusion Models 27 May 2024
https://arxiv.org/pdf/2405.16852
Oğuzhan Ercan
x.com/oguzhannercan
Boosting Latent Diffusion with Flow Matching 28 Mar 2024
https://arxiv.org/pdf/2312.07360
Oğuzhan Ercan
x.com/oguzhannercan
SVDQUANT: ABSORBING OUTLIERS BY LOW-RANK
COMPONENTS FOR 4-BIT DIFFUSION MODELS 8 Nov 2024
https://arxiv.org/pdf/2411.05007
They aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and
activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become
insufficient. Different from smoothing which redistributes outliers between weights and activations, their approach absorbs these outliers
using a low-rank branch. Theyfirst consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-
rank branch to take in the weight outliers with Singular Value Decomposition (SVD).
Core insight is to introduce a 16-bit low-rank branch and further migrate the weight quantization difficulty to this branch. Compared to
direct 4-bit quantization, i.e., Q( ˆX)Q(W ), our method first computes the low-rank branch ˆXL1L2 in 16-bit precision, and then
approximates the residual ˆXR with 4-bit quantization. This LoRa is negligible in context of additional params-computation, but they
proposed Nunchaku kernel fusion for eliminate this situation, which helps to save %50 extra computational cost mostly comes from memory
access.
Oğuzhan Ercan
x.com/oguzhannercan
1.58-bit FLUX 24 Dec 2024
https://arxiv.org/pdf/2412.18653v1
The first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e.,
values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 × 1024 images. Notably, the quantization method
operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, they develop a custom
kernel optimized for 1.58- bit operations, achieving a 7.7× reduction in model storage, a 5.1× reduction in inference memory, and improved
inference latency.
The quantization reduces the weights of all linear layers in the FluxTransformerBlock and FluxSingleTrans- formerBlock of FLUX to 1.58 bits,
covering 99.5% of the model’s total parameters.
Oğuzhan Ercan
x.com/oguzhannercan
Quality Enhancement
Following slides includes:
-Align Your Steps, 22 april 2024
- And more
Oğuzhan Ercan
x.com/oguzhannercan
Align Your Steps, 22 april 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2404.14507
Oğuzhan Ercan
x.com/oguzhannercan
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step 6 Jun 2024
https://arxiv.org/pdf/2406.04314
Oğuzhan Ercan
x.com/oguzhannercan
ReNO: Enhancing One-step Text-to-Image Models
through Reward-based Noise Optimization 6 Jun 2024
They explore optimizing the initial random noise during inference without adapting any of the model’s parameters. The initial noise is updated based
on the signal from a reward model evaluated on the generated image. Backpropagating the gradient through the denoising steps can lead to
exploding/vanishing gradients, so they used well calibrated one step diffusion model. Naively optimizing the initial latent for an arbitrary objective
can lead to collapse due to reward hacking. To mitigate this, they propose the use of a combination of reward objectives to not overfit to any single
reward. Also they propose an optimization scheme with limited steps, regularization of the noise to stay in-distribution, and gradient clipping.
Backpropagating through C(Gθ (ε, p)) (Criterion on some generation) is non-trivial as current Text-to-Image models are based on the simulation of
ODEs or SDEs.One important consideration, is that it is desirable for ε(noise) to stay within the proximity of the initial noise distribution N (0, I) as
otherwise Gθ might provide unwanted generations. This can be realized by including a regularization term inside of C. They maximize the log-
likelihood of the norm of a noise sample.
They propose to use a weighted combination of a number n of pre-trained reward models as the criterion function. Employing a combination of
reward models can help prevent “reward-hacking” and allow capturing various aspects of image quality and prompt adherence, as different reward
models are trained on different prompt and preference sets.
https://arxiv.org/pdf/2403.17377
Oğuzhan Ercan
x.com/oguzhannercan
Face Models
Following slides includes:
-InstantID: Zero-shot Identity-Preserving Generation in Seconds 2 Feb 2024
- And more
Oğuzhan Ercan
x.com/oguzhannercan
InstantID: Zero-shot Identity-Preserving Generation in Seconds
2 Feb 2024
They used a pre-trained face model to detect and extract face ID embedding from the reference facial image, providing us with strong identity
features to guide the image generation process.
Image Adapter: a lightweight adaptive module with decoupled cross-attention is introduced to support images as prompts. However, They diverge
by employing ID embedding as their image prompt, as opposed to the coarse-aligned CLIP embedding. This choice is aimed at achieving a more
nuanced and semantically rich prompt integration.
Directly adding the text and image tokens in cross-attention tends to weaken the control exerted by text tokens. So they adapt a controlnet, named
as IdentityNet. At this net, they use 5 landmarks instead of 68 (openpose) and instead of text embedding they use Arcface identities at cross
attention layer.
https://arxiv.org/pdf/2401.07519
Oğuzhan Ercan
x.com/oguzhannercan
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance
23 May 2024
They propose a well-known methodology as classifier guidance which modifies an existing denoising process using the gradient from a pre-trained
classifier. The rationale behind their exploitation is twofold: first, it directly harnesses the discriminator’s domain knowledge for identity preservation,
which may be a cost-effective substitute for training on domain-specific datasets; secondly, keeping the diffusion model intact allows for plug-and-
play combination with different discriminators. This work builds on a recent framework named rectified flow featuring strong theoretical properties,
e.g. the straightness of its sampling trajectory. By approximating the rectified flow to be ideally straight, the original classifier guidance is
reformulated as a simple fixed-point problem concerning only the trajectory endpoints, thus naturally overcoming its reliance on a special noise-
aware classifier. This allows flexible reuse of image discriminators for identity preservation in personalization tasks.
Rectified flow recap: They aim to learn a velocity field v that maps random noise z0 π0 to samples from a complex distribution z1 πdata
via an ordinary differential equation (ODE) Instead of directly solving the ODE (Chen et al., 2018), rectified flow (Liu et al., 2023a) simply learns
a linear interpolation between the two distributions by minimizing the following objective:
Classifier Guidance recap: a test-time mechanism to adjust the predicted noise ϵ(zt, t) based on the guidance from a classifier. Given condition c
and classifier output p(c|zt), the adjustment is formulated as:
They combine the recrifed flow and classifier guidance at inference time. They control the flow with classifier guidance in order to achieve desired
output. Since the theoretical foundation of this paper is too heavy, the details will not be included in this slides.
https://arxiv.org/pdf/2405.14677
Oğuzhan Ercan
x.com/oguzhannercan
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization 12 Agust 2024
UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-
form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The
ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The
ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image.
Unlike most preceding approaches that harness the final global features of a face recognition backbone for face ID representation, they utilize features from the
penultimate layer (prior to the fully connected layer). This adjustment aims to preserve an enhanced degree of spatial information pertaining to ID features. They say
that using clip features may couple with other ID-irrelevant facial informations (light, occlusions etc.). Given the typically scarce and non-diverse nature of personalization
training datain which the training reference and target faces often come from the same or similar imagesthese irrelevant features risk leading to model overfitting on
non-essential facial details. In order to solve these problems, they initially integrate the shallow features from the face recognition model to augment the structural
representation of the face. Subsequently, they apply a strong dropping regularization to the structure feature branch to decouple it from the intrinsic ID branch. The
shallow features of the backbone are empirically low-level, containing more texture details, and they are more ID-relevant, facilitating us to generate higher-fidelity
portraits.
https://arxiv.org/pdf/2408.05939
Oğuzhan Ercan
x.com/oguzhannercan
Theyfirst flatten and apply a Multilayer Perceptron (MLP) to the penultimate layer’s features of the face recognition model to obtain the intrinsic ID features Fr Rmr ×dr
. They then interpolate the shallow features, i.e., the 1/2, 1/4, and 1/8 feature maps from the face backbone and concatenate them with CLIP local features to derive the
face structure features Fs Rms×ds through another MLP. Next, they introduce a l layer Q-Former with m learnable queries to aggregate Fr and Fs. Each layer of the Q-
Former comprises two attention blocks and one Feed-Forward Network (FFN), with the attention blocks respectively attending to the intrinsic ID information and face
structure representations. In the input and output of the second attention block, they further introduce DropToken and DropPath as means of decoupling face structure
from intrinsic ID representation. The final output from the Q-Former, denoted as Fid Rm×d, is then employed as the ID embedding and aligned into the context space
of U-Net. Here, they use decoupled cross attention to inject the ID information into U-Net.
In this work, they introduce a position-wise ID routing module integrated within each cross-attention layer to adaptively route and assign a unique ID to each potential
face area in the latent features, thereby effectively mitigating the problem of identity blending. UniPortrait can work with N identities, not limit with specific number of
input. The structure of position-wise ID routing can be seen at the figure (stage2), placed at previous page. The idea behind the ID routing is that each face in an im-
age is associated with at most one ID feature. By confining each position to cross-attend to solely one ID information, the blending problem between IDs is efficaciously
circumvented.
They use CurricularFace for face recognition network. They sample the dataset from many different open dataset.
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization 12 Agust 2024 (Page 2)
https://arxiv.org/pdf/2408.05939
Oğuzhan Ercan
x.com/oguzhannercan
Image Editing
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
- And more
Oğuzhan Ercan
x.com/oguzhannercan
POSTEDIT: POSTERIOR SAMPLING FOR EFFICIENT
ZERO-SHOT IMAGE EDITING 8 Oct 2024
PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the
initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. To reconstruct and edit an image x0, they
make use of a measurement term y which contains the features of the initial image, and supervise the editing process by the posterior log-likelihood density xt log
p(xt|y).
read rest of it.
https://arxiv.org/pdf/2410.04844
Oğuzhan Ercan
x.com/oguzhannercan
n this paper, they propose merging these two fields by utilizing image-to-video models for image editing. They reformulate image editing as a temporal process, using
pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring
consistent edits while preserving the original image’s key aspects. They implement the proposed approach through a structured pipeline called Frame2Frame (F2F). First,
they transform the edit instruction into a Temporal Editing Caption a scenario describing how the edit should naturally evolve over timeusing a pretrained Vision-
Language Model (VLM). Next, a state-of-the-art image-to-video model generates a temporally coherent sequence guided by the temporal caption. Finally, they identify
the frame that best realizes the desired edit with the assistance of a VLM. The framework shows promising results in more classical computer vision problems such as
de-blurring, de-noising, and relighting.
Since video generation is a temporal process, the editing caption must be temporal. So they use ChatGPT4-o, it s instructed to produce a concise video scenario that
highlights how elements within the image change or move over time. For video generation, they used CogVideoX. They observed that the optimal number of frames
required for an edit can varysmall changes may be completed in fewer frames, while more extensive transformations often necessitate additional ones. herefore, we
aim to identify the optimal edited frame, denoted ft, which corresponds to the earliest timestep t that achieves the desired edit.
To automate the selection of tand avoid manual frame-by-frame review, they employ an automated approach. After generating the sequence V , they sample every
fourth frame, imprinting each with a unique identifier and assembling them into an image collage alongside Is. Inspired by ”An image grid can be worth a video” which
introduces a novel approach to video comprehension by transforming videos into image grids, they use GPT-4o, to assist in selecting tby providing it with the collage
and the editing prompt c. The VLM is tasked with identifying the frame that best fulfills the editing intent, evaluating each frame’s alignment with c and fidelity to Is. The
model is instructed to select the optimal frame with the lowest index that completes the edit.
Within manifold on the right,, there is a clear semantic progression: images of people with ’AI’ shirts
(green cluster) are close to images of people with ’AI’ shirts making a heart shape (purple cluster),
which are adjacent to images of people only making a heart shape (red cluster). Thus, transitioning
smoothly along the manifold allows a person with an ’AI’ shirt to perform a heart shape with their
hands while preserving the shirt’s text.
Pathways on the Image Manifold: Image Editing via Video Generation 27
Nov 2024
https://arxiv.org/pdf/2411.16819
Oğuzhan Ercan
x.com/oguzhannercan
They everage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT
lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, they propose an automatic method to identify “vital
layers” within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object
addition, using the same mechanism. These flow-based models are based on optimal transport conditional probability paths, resulting in faster training and sampling,
compared to diffusion models. This is attributed to the fact that they follow straight line trajectories, rather than curved paths. One of the known consequences of this
difference, however, is that these models exhibit lower diversity than previous diffusion models. While reduced diversity is generally considered an undesirable
characteristic, in this paper,they suggest leveraging it for the task of training-free image editing. Specifically, they explore image editing via parallel generation [4, 18,
91], where features from the generative trajectory of the source (reference) image are injected into the trajectory of the edited image. They say that there is no simple
relationship between the vitality of a layer and its position in the architecture, i.e., the vital layers are spread across the transformer.
They bypass layers one by one, and see how ith layer effects the generation. To assess the impact of each layer, they measured the perceptual similarity between G_ref
and Gℓ and using DINOv2. Results show that removing certain layers significantly affects the generated images, while others have minimal impact. Importantly, influential
layers are distributed across the trans- former rather than concentrated in specific regions. They adapt the self-attention injection mechanism, previously shown effective
for image and video editing in UNet-based diffusion models, to the DiT-based FLUX architecture. Since each DiT layer processes a sequence of image and text
embeddings, they propose generating both x and ˆx in parallel while selectively replacing the image embeddings of ˆx with those of x, but only within the vital layers set
V .
To edit real images, they first invert them into the latent space, transforming samples from p1 to p0. We initially implemented an inverse Euler ODE solver for FLUX by
reversing the vector field prediction.
Stable Flow: Vital Layers for Training-Free Image Editing 21 Nov 2024
https://arxiv.org/pdf/2411.14430
This approach proves insufficient for FLUX, resulting in corrupted image reconstructions and
unintended modifications during editing. We hypothesize that the assumption u(zt) ≈ u(zt−1)
does not hold, which causes the model to significantly alter the image during the forward
process. To address this, they introduce latent nudging: multiplying the initial latent z0 by a
small scalar λ = 1.15 to slightly offset it from the training distribution.
Oğuzhan Ercan
x.com/oguzhannercan
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
10 Dec 2024
This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while
extending its capabilities to accurate inversion and editing in 8 steps.
Read the rest of it,
https://arxiv.org/pdf/2412.07517
Oğuzhan Ercan
x.com/oguzhannercan
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
10 Dec 2024
This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while
extending its capabilities to accurate inversion and editing in 8 steps.
Read the rest of it,
https://arxiv.org/pdf/2412.07517
Oğuzhan Ercan
x.com/oguzhannercan
Guidance
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
- And more
Oğuzhan Ercan
x.com/oguzhannercan
Derivative-Free Guidance in Continuous and Discrete
Diffusion Models with Soft Value-Based Decoding 1 Aug 2024
Rather than merely -generating designs that are natural, they often aim to optimize downstream reward functions while preserving the naturalness
of these design spaces. Proposed algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how
intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models.
https://arxiv.org/pdf/2407.02398
Oğuzhan Ercan
x.com/oguzhannercan
Flow Control
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
- And more
Oğuzhan Ercan
x.com/oguzhannercan
Consistency Flow Matching:
Defining Straight Flows with Velocity Consistency 2 Jul 2024
https://arxiv.org/pdf/2407.02398
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Solvers
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
Oğuzhan Ercan
x.com/oguzhannercan
GENERALIZATION IN DIFFUSION MODELS ARISES FROM
GEOMETRY-ADAPTIVE HARMONIC REPRESENTATIONS 12 April
2024
Recent reports of memorization of the training set raise the question of whether these networks are learning the “true” continuous density of the
data. This shows that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density,
when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the
training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density.
They find that DNN denoisers trained on photographic images perform a shrinkage operation in an orthonormal basis consisting of harmonic
functions that are adapted to the geometry of features in the underlying image. Theyrefer to these as geometry-adaptive harmonic bases (GAHBs).
This observation, taken together with the generalization performance of DNN denoisers, suggests that optimal bases for denoising photographic
images are GAHBs and, moreover, that inductive biases of DNN denoisers encourage such bases. They say that an optimal denoiser (for small
noise) should project a noisy image on the tangent space of the manifold.
DNNs are susceptible to overfitting, because the number of training examples is typically small relative to the model capacity. Since density
estimation, in particular, suffers from the curse of dimensionality, overfitting is of more concern in the context of generative models. An overfitted
denoiser performs well on training images but fails to generalize to test images, resulting in low-diversity generated images. Consistent with this,
several papers have reported that diffusion models can memorize their training data.
https://arxiv.org/pdf/2310.02557
Oğuzhan Ercan
x.com/oguzhannercan
DIFFUSION POSTERIOR SAMPLING FOR GENERAL
NOISY INVERSE PROBLEMS 20 May 2024
https://arxiv.org/pdf/2209.14687