Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Models
Oğuzhan Ercan
x.com/oguzhannercan
Topics
Topics discussed
-Diffusion Models
-Diffusion Architectures
-Swap-Inpainting Models
-Output Control Techniques
-Inference Time Optimization
-Quality Enhancement
-Video Generation
-Face Models
Prerequisites
.
Probability
Statistic
Linear Algebra
Calculus
Deep Learning
Differential Equations
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Models
Diffusion models consist of two interconnected processes: forward and backward. The forward diffusion process gradually corrupts the data by
interpolating between a sampled data point x0 and Gaussian noise. The formulation described placed below:
informations here mostly taken from Imagine Flash paper. https://arxiv.org/pdf/2405.05224
where alfa t and sigma t define the signal-to-noise ratio (SNR) of the stochastic interpolant X_t. In the following, we opt for coefficients (alfa t,
sigma t) that result in a variance-preserving process. . When viewed in the continuous time limit, the forward process in Eq above can be
expressed as the stochastic differential equation which is:
where f (x, t) : Rd → Rd is a vector valued drift coefficient, g(t) : R → R is the diffusion coefficient, and wt denotes the Brownian motion at time t.
Inversely, the backward diffusion process is intended to undo the noising process and generate samples. According to Anderson’s theorem, the
forward SDE introduced earlier satisfies a reverse-time diffusion equation, which can be reformulated using the Fokker-Planck equations to have a
deterministic counterpart with equivalent marginal probability densities, known as the probability flow ODE :
Oğuzhan Ercan
x.com/oguzhannercan
This allows to estimate formulation above, usually parameterized by a time-conditioned neural network. Given these estimates, we can sample using
an iterative numerical solver f:
To end up with first-order solvers like DDIM:
where the sample data estimate ˆx0 at time-step t is computed as:
Oğuzhan Ercan
x.com/oguzhannercan
Diffusion Architectures
Following slides includes:
-High-Resolution Image Synthesis with Latent Diffusion Models 13 Apr 2022
-Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack 27 September 2023
-Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding 14
May 2024
-Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion
Transformers 9 May 2024
- Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 29 Dec 2023
Oğuzhan Ercan
x.com/oguzhannercan
High-Resolution Image Synthesis with Latent Diffusion Models 13 Apr
2022
https://arxiv.org/pdf/2112.10752
Oğuzhan Ercan
x.com/oguzhannercan
Emu: Enhancing Image Generation Models Using Photogenic Needles in
a Haystack 27 September 2023 ( Image quality should always be prioritized over quantity.)
Their key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the
generation quality. Effective fine-tuning of LLMs can be achieved with a relatively small but high-quality fine-tuning dataset, e.g., using 27K
prompts.
They increase the channel of AE from 4 to higher dimension. They use additional adversarial loss for reconstruction and also they apply a non
learnable preprocessing step to RGB images using a fourier feature transform to lift the input channel dimension.
They use a large U-Net with 2.8B trainable parameters. They increase the channel size and number of stacked residual blocks in each stage for
larger model capacity. They use text embeddings from both CLIP ViT-L and T5-XXL as the text conditions.
They pre-train the model with 1.1B images. They train the model with progressively increasing resolutions. This approach improve finer details at
higher resolutions.
https://arxiv.org/pdf/2309.15807
Oğuzhan Ercan
x.com/oguzhannercan
DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches. The first layer of DiT is “patchify,” which converts the spatial
Input into a sequence of T tokens, each of dimension d,by linearly embedding each patch in the input, following patchify, frequency based
positional embeddings applied. Number off embeddings determined by patch size(p), note that changing p has no meaningful impact on
downstream parameter counts.
Following patchify, the input tokens are processed by a sequence of transformer blocks. In addition to noised image inputs, diffusion models
sometimes process additional conditional information such as noise timesteps t, class labels c, natural language. There is 4 different version of
conditioning. In-context conditioning,imply append the vector embeddings of tand cas two additional tokens in the input sequence. This approach
introduces negligible new Gflops to the model. Cross-attention block. We concatenate the embeddings of t and c into a length-two sequence,
separate from the image token sequence. The transformer block is modified to include an additional multi-head cross-attention layer following the
multi-head self-attentionblock. Cross-attention adds the most Gflops to the model, roughly a 15% overhead.Adaptive layer norm They explore
replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN). Rather than directly learn dimension-wise scale and
shift parameters γand β, we regress them from the sum of the embedding vectors of t and c. adaLN adds the least Gflops and is thus the most
compute-efficient. Also they explore that a modification of the adaLN DiT block which does zero init accelerates large-scale training.
Scalable Diffusion Models with Transformers 2 Mar 2023
https://arxiv.org/pdf/2212.09748
Oğuzhan Ercan
x.com/oguzhannercan
They used 2 text encoder, bilingual t5 and clip. They perform post-training optimization in the inference stage to lower the
deployment cost of Hunyuan-DiT. They used VAE in SDXL which is fine-tuned on 512x512 images from the VAE in SD1.5. They say
SDXL VAE is improves clarity, alleviated over-saturation and reduced distortions.
They found the Adaptive Layer Norm used in class-conditional DiT performs unsatisfactorily to enforce fine-grained text conditions, so
they used cross-attention. Hunyuan-DiT has two types of transformer blocks, the encoder block and the decoder block. Both of them
contain three modules - self-attention, cross-attention, and feed-forward network (FFN). The text information is fused in the cross-
attention module. The decoder block additionally contains a skip module, which adds the information from the encoder block in the
decoding stage. The skip module is similar to the long skip-connection in U-Nets, but there are no upsampling or downsampling
modules in Hunyuan-DiT due to our transformer structure. Finally, the tokens are reorganized to recover the two-dimensional spatial
structure. For training, they find using v-prediction (velocity -> instead of noise, predict the rate of change) gives better empirical
result.
They used 2 dimensional Rotary Positional Embedding (RoPE) for positional encoding to encode the absolute position and relative
position dependency. To be able to generate images with multiple resolutions, they tried to use extended positional encoding and
centralized interpolative positional encoding. They find that CIPE converges faster and generalizes better.
To stabilize the training, they used QK-Norm. They add layer normalization after the skip module in the decoder blocks to avoid loss
explosion during training.
They found certain operations, e.g., layer normalization, tend to overflow with FP16 so they
specifically switch them to FP32 to avoid numerical errors.
Due to the large number of model parameters in Hunyuan-DiT and the massive amount of image data required for training, they
adopted ZeRO, flash-attention , multi-stream asynchronous execution, activation checkpointing, kernel fusion to enhance the training
speed. Deploying Hunyuan-DiT for the users is expensive, they adopt multiple engineering optimization strategies to improve the
inference efficiency, including ONNX graph optimization, kernel optimization, operator fusion, precomputation, and GPU memory
reuse.
They find that adversarial training tends to collapse, and best way to accelerate the model speed at inference time is progressive
distillation.
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding 14 May 2024
https://arxiv.org/pdf/2405.08748
Oğuzhan Ercan
x.com/oguzhannercan
Lumina-T2X: Transforming Text into Any Modality, Resolution, and
Duration via Flow-based Large Diffusion Transformers 9 May 2024
Lumina-T2X family a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework
designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. RoPE, RMSNorm, and flow
matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend
the context window to 128K tokens. Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs
of a 600-million-parameter naive DiT (PixArt-α), indicating that increasing the number of parameters significantly accelerates convergence of
generative models without compromising visual quality. Lumina-T2X tokenizes images, videos, multi-views of 3D objects, and spectrograms into
one-dimensional sequences, similar to the way LLMs process natural language. By incorporating learnable placeholders such as [nextline] and
[nextframe] tokens, Lumina-T2X can seamlessly encode any modality -regardless of resolution, aspect ratio, or even temporal duration - into a
unified 1-D token sequence. The empirical observations indicate that employing larger models, high-resolution images, and longer-duration video
clips can significantly accelerate the convergence speed of diffusion transformers.
https://arxiv.org/pdf/2405.05945
Oğuzhan Ercan
x.com/oguzhannercan
PIXART-α: FAST TRAINING OF DIFFUSION TRANSFORMER FOR
PHOTOREALISTIC TEXT-TO-IMAGE SYNTHESIS 29 Dec 2023
They propose a way to train text to image diffusion model with low computational cost (but still 753 A100 GPU days and
$28,400).They decompose the intricate text-to-image generation task into three streamlined subtasks: (1) learning the
pixel distribution of natural images, (2) learning text-image alignment, and (3) enhancing the aesthetic quality of images.
The T2I generation task can be decomposed into three aspects: Capturing pixel dependency - Alignment between text and
Image and high aesthetic quality.
To generate captions with high information density, they leverage state-of-the-art vision-language models LLaVA.
Employing the prompt, “Describe this image and its style in a very detailed manner”, they have significantly improved the
quality of captions.
Based on the Diffusion Transformer (DiT) theyincorporate cross-attention modules to inject text conditions and streamline
the computation-intensive class-condition branch to improve efficiency.
https://arxiv.org/pdf/2310.00426
Oğuzhan Ercan
x.com/oguzhannercan
PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution, which evolves from the ‘weaker’ baseline to a ‘stronger’ model via
incorporating higher quality data, a process They term “weak-to-strong training”. Also they propose a novel attention module within the DiT framework that
compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. To enhance PixArt-α, they expand its
generation resolution from 1K to 4K. Generating images at high resolutions introduces a significant increase in the number of tokens, so computational demand. They
introduced a self-attention module with key and value token compression tailored to the DiT framework. Additionally, they employ a specialized weight initialization
scheme, allowing for a smooth adaptation from a pre-trained model without KV compression. This design effectively reduces training and inference time by 34% for
high-resolution image generation. They utilize only 9% of the GPU days required by PixArt-α to achieve a strong 1K high resolution image generation model. They
changed LLaVa with share-captioner for preventing hallucinations.
to mitigate the potential information loss caused by KV compression in self-attention computation, we opt to retain all the tokens of queries (Q). This strategy allows us
to utilize KV compression effectively while mitigating the risk of losing crucial information. They utilize group convolutions with a stride of 2 for local aggregation of
keys and values as compression function. They design a specialized convolution kernel initialization “Conv Avg Init” that utilizes group convolution and initializes the
weights w = 1/R**2 ,equivalent to an average operator.
They changed the pixart alfa’s vae which is SD1.5 vae with SDXL vae and finetuned it. 2K training steps is enough for this finetuning. While fine tuning LR to HR
model, they see a performance degradation caused by discrepancies in positional embeddings (PE) between different resolutions. To solve it, they initialized the HR
model’s PE by interpolating the LR model’s PE. The fine-tuning quickly converges at 1K steps. We can use KV compression directly when fine-tuning from LR pre-
trained models without KV compression and this reduce 34% of the training and inference time.
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-
Image Generation 17 Mar 2024
https://arxiv.org/pdf/2403.04692
Oğuzhan Ercan
x.com/oguzhannercan
WÜRSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE
TEXT-TO-IMAGE DIFFUSION MODELS 29 SEP 2023
Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for
large-scale text-to-image diffusion models. A key contribution of their work is to develop a latent diffusion technique in which we learn a detailed but
extremely compact semantic image representation used to guide the diffusion process.
They first trained VQGAN (Stage A) then Latent image decoder (Stage B) and then text conditional latent image
generation model. For image generation, they first generate a latent image at a strong compression ratio using a
text-conditional LDM (Stage C). Subsequently, this representation is transformed to a less-compressed latent
space by the means of a secondary model which is tasked for this reconstruction (Stage B). Finally, the tokens
that comprise the latent image in this intermediate resolution are decoded to yield the output image (Stage A).
They initialized the Semantic Compressor with weights pre-trained on ImageNet, which, however, does
not capture the broad distribution of images present in large text-image datasets and is not well-suited
for semantic image projection, since it was trained with an objective to discriminate the ImageNet
categories. So they updated the weights of the Semantic Compressor during training, establishing a
latent space with high-precision semantic information. During training Stage B, thet intermittently add noise to
the Semantic Compressor’s embeddings, to teach the model to understand non-perfect embeddings, which is likely
to be the case when generating these embeddings with Stage C. At stage C training, they follow a standard
diffusion process, applied in the latent space of the finetuned Semantic Compressor.
https://arxiv.org/pdf/2405.16759
Oğuzhan Ercan
x.com/oguzhannercan
They introduce a novel architecture, Shallow-UViT, which allows one to pretrain the pixel space diffusion models core layers on huge datasets of
text-image data eliminating the need to train at the entire model with high resolution images. one can significantly improve different image quality
metrics by leveraging the representation pretrained at low- resolution, while growing model resolution in a greedy fashion. They sim-
plify the UNet’s conventional hierarchical structure, which operates on multiple resolutions, and define the Shallow-UViT (SU), a simplified
architecture comprising a shallow encoder and decoder operating on a fixed spatial grid.
Bad paper. Do not read it.
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
28 May 2024
https://arxiv.org/pdf/2405.16759
Oğuzhan Ercan
x.com/oguzhannercan
They identify two crucial requirements for text encoders: character awareness and alignment with glyphs, solution involves crafting a series of customized text encoder:
Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. They created high-quality text-glyph data by
establishing a scalable pipeline capable of generating virtually unlimited paired data based on graphic rendering. They employed an innovative box-level contrastive loss
and fine-tune ByT5 into a series of customized text encoder for glyph generation, named Glyph-ByT5, then integrated it to SDXL using an efficient regionwise cross-
attention mechanism.Beside words, paragraph is a challenging task since it is not fit into single line. So they define a ‘paragraph’ as a block of text content that cannot be
accommodated within a single line, typically consisting of more than 10 words or 100 characters. They empirically demonstrate that the diffusion model can effectively
plan multi-line arrangements and adjust the line or word spacing according to the given text box, regardless of its size or aspect ratios. Unlike conventional CLIP,
which applies contrastive loss to the entire image, they pro pose applying a box-level contrastive loss that treats each text box and its corresponding text prompt as an
instance. Based on the number of characters or words within the text box, they can categorize them into either a word text box, a sentence text box, or a paragraph text
box.
They designed box level contrastive loss and they compute the box embedding and sub-text embedding of image-text pair, embeddings comes from text encoder and
visual encoder. They introduce a region-wise multi-head cross-attention mechanism to seamlessly fuse the glyph knowledge encoded in our customized text encoder
within the target typography boxes and the prior knowledge carried by the original text encoders in the regions outside of typography boxes. At region-wise multi-head
cross-attention mechanism they first partition the input pixel embeddings (Query) into multiple groups. These groups correspond to the target text boxes, which can be
either specified by the user or auto-matically predicted by leveraging the planning capability of GPT-4. Simultaneously, they divide the text prompts (Key-Value) into
corresponding sub-sections, which include a global prompt and several groups of glyph-specific prompts.They specifically direct the pixel embeddings within the target
text boxes to attend only to the glyph text embeddings extracted with Glyph-ByT5. Similarly, pixel embeddings outside the text boxes are made to attend exclusively to
the global prompt embeddings extracted with the original two CLIP text encoders.
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text
Rendering 14 Mar 2024
https://arxiv.org/pdf/2403.09622
Oğuzhan Ercan
x.com/oguzhannercan
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-
Modal Prompts 13 Jun 2024
https://arxiv.org/pdf/2403.09622
Oğuzhan Ercan
x.com/oguzhannercan
No Training, No Problem: Rethinking Classifier-Free
Guidance for Diffusion Models 2 Jul 2024
https://arxiv.org/pdf/2407.02687
Oğuzhan Ercan
x.com/oguzhannercan
Swap-Inpainting Models
Following slides includes:
-SWAPANYTHING Enabling Arbitrary Object Swapping in Personalized Visual Editing 6 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
SWAPANYTHING: Enabling Arbitrary Object Swapping in Personalized
Visual Editing 6 May 2024
https://arxiv.org/pdf/2404.05717
Oğuzhan Ercan
x.com/oguzhannercan
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
7 Jul 2024
https://arxiv.org/pdf/2407.05282
Oğuzhan Ercan
x.com/oguzhannercan
Output Control Techniques
Following slides includes:
-Adding Conditional Control to Text-to-Image Diffusion Models 26 Nov 2023
-ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback 11 Apr 2024
- CTRLororALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models 13 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
Controlnet is a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion
models. Zero-Initialized Layers are used by ControlNet for connecting network blocks. The reason for initializing weights as
zero instead of gaussian is that progressively grow the parameters from zero and ensure that no harmful noise could affect
the finetuning. To add a ControlNet to such a pre-trained neural block, they lock (freeze) the parameters Θ of the original
block and simultaneously clone the block to a trainable copy with parameters. The trainable copy takes an external
conditioning vector c as input. The trainable copy is connected to the locked model with zero 1x1 convolution layers
In the training process, they randomly replace 50% text prompts with empty strings. This approach increases ControlNet’s
ability to directly recognize semantics in the input conditioning images (e.g., edges, poses, depth, etc.) as a replacement for
the prompt. They see that the model converges suddenly, not progressively.
When a conditioning image is added via ControlNet, it can be added to both ϵuc and ϵc, or only to the ϵc. Their solution is
to first add the conditioning image to ϵc and then multiply a weight wi to each connection between Stable Diffusion and
ControlNet according to the resolution of each block.
To apply multiple conditioning images (e.g., Canny edges, and pose) to a single instance of Stable Diffusion, we can directly
add the outputs of the corresponding ControlNets to the Stable Diffusion model.
Adding Conditional Control to Text-to-Image Diffusion Models
26 Nov 2023
https://arxiv.org/pdf/2302.05543
Oğuzhan Ercan
x.com/oguzhannercan
IP-Adapter: Text Compatible Image Prompt Adapter for
Text-to-Image Diffusion Models 13 Aug 2023
https://arxiv.org/pdf/2406.09413
Oğuzhan Ercan
x.com/oguzhannercan
For an input conditional control, they use a pre-trained discriminative reward model to extract the corresponding condition of the generated images,
and then optimize the consistency loss between the input conditional control and extracted condition. They introduce an efficient reward strategy
that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning.
model performs T denoising steps to generate the image x′0 from random noise xT an abstract metric function that can take on different concrete
forms for different visual conditions. L takes the condition and output of supervisor model which takes the output of diffusion model. For example, in
the context of using segmentation mask as the input conditional control, L could be the per-pixel cross-entropy loss.
Since to achieve the pixel-space consistency loss it requires x0, the final diffused image which requires 20-50 steps, which requires too much
computation so they propose one step efficient reward strategy. At eq below, they show that algebraic manipulation can be used for single step
noise prediction.
The timestep threshold, which is a hyper-parameter used to determine whether a noised image xt should be utilized for reward fine-tuning. They
note that a small noise ϵ (i.e., a relatively small timestep t) can disturb the consistency and lead to effective reward fine-tuning. During the reward
fine-tuning phases, they freeze the pre-trained discriminative reward model and text-to-image model, and only update the ControlNet module
following its original implementation, which ensures its generative capabilities are not compromised.
ControlNet++: Improving Conditional Controls
with Efficient Consistency Feedback 11 Apr 2024
https://arxiv.org/pdf/2404.07987
Oğuzhan Ercan
x.com/oguzhannercan
CTRLororALTer: Conditional LoRAdapter for Efficient
0-Shot Control & Altering of T2I Models 13 May 2024
Ctrlororalter, formulating such a unified approach for conditioning on global controls like style and on local controls like structure, in an efficient and
generic manner remains a key open problem. LoRAdapter is a novel approach to adding conditional information to LoRAs, enabling zero-shot
generalization and making them applicable for both structure and style and possibly many other conditioning types. They propose a LoRA-based
conditioning mechanism whose behavior changes based on conditioning provided at inference time, enabling zero-shot generalization.
They add a condition to matrix A: from to
where conditioning is as denoting an elementwise (Hadamard) product and γ and β referring to
the scale and shift factors.
https://arxiv.org/pdf/2405.07913
Oğuzhan Ercan
x.com/oguzhannercan
pps: Photo-Inspired Diffusion perators 3 Jun 2024
https://arxiv.org/pdf/2406.01300
Oğuzhan Ercan
x.com/oguzhannercan
Training Time Optimization
Following slides includes:
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024
-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024
- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18
March 2024
- Distilling Diffusion Models into Conditional GANs 9 May 2024
- Cross-Attention Makes Inference Cumbersome
- in Text-to-Image Diffusion Models 3 April 2024
- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
Oğuzhan Ercan
x.com/oguzhannercan
Immiscible Diffusion: Accelerating Diffusion Training with Noise
Assignment 18 Jun 2024
https://arxiv.org/pdf/2406.12303
Oğuzhan Ercan
x.com/oguzhannercan
Stretching Each Dollar: Diffusion Training from Scratch on
a Micro-Budget 22 Jul 2024
https://arxiv.org/pdf/2407.15811
Oğuzhan Ercan
x.com/oguzhannercan
Inference Time Optimization
Following slides includes:
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions: Real 25 Mar 2024
-Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation 18 Apr 2024
- PeRFlow: Piecewise Rectified Flow as Universal Plug-and- Play Accelerator 13 May 2024
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (Stability AI Turbo Solution) 18
March 2024
- Distilling Diffusion Models into Conditional GANs 9 May 2024
- Cross-Attention Makes Inference Cumbersome
- in Text-to-Image Diffusion Models 3 April 2024
- EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
Oğuzhan Ercan
x.com/oguzhannercan
SDXS: Real-Time One-Step Latent Diffusion Models
with Image Conditions 25 Mar 2024
They introduce a dual approach involving model miniaturization and a reduction in sampling steps. The methodology leverages knowledge distillation
to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature
matching and score distillation.
VAE Decoder Optimization: Utilizing a pretrained diffusion model F to sample latent codes z and a pretrained VAE decoder to reconstruct images x,
we introduce a VAE Distillation (VD) loss for training a tiny image decoder G. they only build the G with CNN blocks to eliminate complex
components like attention and normalization (I do not know why they think that norm layers are computationally overwhelming.)
U-Net Optimization: They selectively removed residual and Transformer blocks from the U-Net, aiming to train a more compact model that can still
reproduce the original model’s intermediate feature maps and outputs effectively. Initializing noises and sampling images with an ODE to get noise
image pairs results in low quality images. So they use Rectified Flow tackles this challenge by straightening the sampling trajectories. Using MSE
loss causes the model tends to output the average of multiple feasible solutions. So they use SSIM. They also straighten the model’s trajectories to
narrow the range of feasible outputs using existing finetuning methods like LCM.
One Step Training: They first trained model for feature matching with SSIM Loss as warm-up. While at this stage, they sample noise - image pairs.
As trajectory of ODE’s (For example DDPM) are not straight, they used LCM-Lora for rectifying the flow. They say that warm-up training results are
good at image quality but do not capture the data distribution . For this reason they use score distillation sampling with learned manifold corrective.
https://arxiv.org/pdf/2403.16627
Oğuzhan Ercan
x.com/oguzhannercan
To achieve better quality at low step size, they propose to distill along the student’s backward path instead of the forward path. Put differently,
rather than having the student mimic the teacher, They use the teacher to improve the student based on its current state of knowledge. They
propose a Shifted Reconstruction Loss that dynamically adapts the knowledge transfer from the teacher model. Specifically, the loss is designed to
distill global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency
components at lower time steps. They propose noise correction, a training free inference time modification that enhances sample quali0ty.
Commonly chosen s.t. xT is not pure noise during training, but rather contains low-frequency information leaked from x0. xt = αtx0 + σtxT, here
the leakage reason, any stochastic interpolant xt, t < T still contains information from the ground-truth sample via the first summand αtx0.
backward distillation eliminates information leakage at all time steps t, preventing the model from relying on a ground-truth signal. This is achieved
by simulating the inference process during training, which can also be interpreted as calibrating the student on its own upstream backward path.
They first perform backward iterations of the student model to obtain , then use this as input for
both the student and teacher models during training.
For distillation loss, they define shifted reconstruction loss which is designed such that for higher values of t, the target produced by the teacher
model displays global content similarity with the student output but with improved semantic text alignment; and for lower values of t, the target
image features enhanced fine-grained details while maintaining the same overall structure as the student’s prediction.
When t=T, which is pure noise, at that time step predicting the noise is not informative. So existing works propose predicting the velocity which is
the rate of change. Unfortunately, converting a model to velocity prediction requires extra training efforts. They present training free method, by
treating t = T as a unique case and replacing ϵΘ with the true noise xT , the update f is corrected.
Shifted Reconstruction Loss
Imagine Flash: Accelerating Emu Diffusion Models with Backward
Distillation 18 Apr 2024
https://arxiv.org/pdf/2405.05224
Oğuzhan Ercan
x.com/oguzhannercan
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-
Play Accelerator 13 May 2024
PeRFlow divides the sampling process of generative flows into several time windows
and straightens the trajectories in each interval via the reflow operation, thereby
approaching piecewise linear flows. Specifically, they attempt to straighten the
trajectories of the original PF-ODEs via a piecewise reflow operation. By solving the
ODEs in the shortened time interval, PeRFlow avoids simulating the entire ODE
trajectory for preparing the training data.Through such a divide-and-conquer strategy,
PeRFlow can straighten the sampling trajectories with large-scale real training data.
Diffusion models are usually trained with ϵ-prediction, but flow-based generative
models generate data by following the velocity field.They derive the correspondence
between ϵ-prediction and the velocity field of flow, thus narrowing the gap between the
pretrained diffusion models and the student PeRFlow model.
They divide the ODE trajectories into multiple time windows and straighten the
trajectories in each time window via the reflow operation. The pretrained diffusion
models are usually trained by two parameterization tricks, namely ϵ-prediction and
velocity-prediction. To inherit knowledge from the pretrained network, they
parameterize the PeRFlow model as the same type of diffusion and initialize network θ
from the pretrained diffusion model ϕ.
https://arxiv.org/pdf/2405.07510
Oğuzhan Ercan
x.com/oguzhannercan
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion
Distillation 18 March 2024
Authors says that Adversarial Diffusion Distillation was a big move but usage of the fixed and pretrained DINOv2 network restricts the discriminator’s
training resolution to 518 × 518 pixels. Also there is no straightforward way to control the feedback level of the discriminator. Plus as Yann
lecun said that “need to decode to RGB space” is a problem. They say smaller discriminator
feature networks often offer better performance than their larger counterparts.
They distill generative features of a pretrained diffusion model instead of DINOv2. By targeted sampling of the noise levels during training, it
can be bias the discriminator features towards more global (high noise level) or local (low noise level) behavior.
many distillation techniques attempt to learn “simpler” differential equations that result in the same distribution at t=0 however with
“straighter”, more linear, trajectories which allows for larger step sizes and therefore less evaluations of the network.
LADD introduces two main modifications: the unification of discriminator and teacher model, and the adoption of synthetic data for training.
They first generate an image with teacher model. Then they add noise to the generated image, after that they denoise the image with both
teacher and student networks. They calculate the loss over these latent space representations.
They also fed the students output to the teacher model. After each layer of teacher model which is a transformer, they add a discriminator
head and calculate the adversarial loss with these heads.
in one-shot scenarios, CFG simply oversaturates samples rather than improving text-alignment. This observation suggests that CFG works
best in settings with multiple steps, allowing for corrections of oversaturation issues ins most case. Also they see that while distillation loss
benefits training with real data, it offers no advantage for synthetic data. Thus, training on synthetic data can be effectively conducted using
only an adversarial loss.
https://arxiv.org/pdf/2403.12015
Oğuzhan Ercan
x.com/oguzhannercan
Distilling Diffusion Models into Conditional GANs 9 May 2024
https://arxiv.org/pdf/2405.05967
Oğuzhan Ercan
x.com/oguzhannercan
Cross-Attention Makes Inference Cumbersome
in Text-to-Image Diffusion Models 3 April 2024
https://arxiv.org/pdf/2404.02747
Oğuzhan Ercan
x.com/oguzhannercan
EdgeFusion: On-Device Text-to-Image Generation 18 April 2024
https://arxiv.org/pdf/2404.11925
Oğuzhan Ercan
x.com/oguzhannercan
Improved Distribution Matching Distillation for Fast Image Synthesis
23 May 2024
https://arxiv.org/pdf/2405.14867
Oğuzhan Ercan
x.com/oguzhannercan
EM Distillation for One-step Diffusion Models 27 May 2024
https://arxiv.org/pdf/2405.16852
Oğuzhan Ercan
x.com/oguzhannercan
DiTFastAttn: Attention Compression for Diffusion Transformer Models
12 Jun 2024
https://arxiv.org/pdf/2406.08552
Oğuzhan Ercan
x.com/oguzhannercan
Boosting Latent Diffusion with Flow Matching 28 Mar 2024
https://arxiv.org/pdf/2312.07360
Oğuzhan Ercan
x.com/oguzhannercan
Quality Enhancement
Following slides includes:
-Align Your Steps, 22 april 2024
Oğuzhan Ercan
x.com/oguzhannercan
Align Your Steps, 22 april 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2404.14507
Oğuzhan Ercan
x.com/oguzhannercan
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step 6 Jun 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2406.04314
Oğuzhan Ercan
x.com/oguzhannercan
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
26 Mar 2024
Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. They
propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
SDE solvers excel in sampling from diffusion models due to their built-in error-correction, allowing them to outperform ODE solvers.
https://arxiv.org/pdf/2403.17377
Oğuzhan Ercan
x.com/oguzhannercan
Face Models
Following slides includes:
-InstantID: Zero-shot Identity-Preserving Generation in Seconds 2 Feb 2024
Oğuzhan Ercan
x.com/oguzhannercan
InstantID: Zero-shot Identity-Preserving Generation in Seconds
2 Feb 2024
They used a pre-trained face model to detect and extract face ID embedding from the reference facial image, providing us with strong identity
features to guide the image generation process.
Image Adapter: a lightweight adaptive module with decoupled cross-attention is introduced to support images as prompts. However, They diverge
by employing ID embedding as their image prompt, as opposed to the coarse-aligned CLIP embedding. This choice is aimed at achieving a more
nuanced and semantically rich prompt integration.
Directly adding the text and image tokens in cross-attention tends to weaken the control exerted by text tokens. So they adapt a controlnet, named
as IdentityNet. At this net, they use 5 landmarks instead of 68 (openpose) and instead of text embedding they use Arcface identities at cross
attention layer.
https://arxiv.org/pdf/2401.07519
Oğuzhan Ercan
x.com/oguzhannercan
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance
23 May 2024
They propose a well-known methodology as classifier guidance which modifies an existing denoising process using the gradient from a pre-trained
classifier. The rationale behind their exploitation is twofold: first, it directly harnesses the discriminator’s domain knowledge for identity preservation,
which may be a cost-effective substitute for training on domain-specific datasets; secondly, keeping the diffusion model intact allows for plug-and-
play combination with different discriminators. This work builds on a recent framework named rectified flow featuring strong theoretical properties,
e.g. the straightness of its sampling trajectory. By approximating the rectified flow to be ideally straight, the original classifier guidance is
reformulated as a simple fixed-point problem concerning only the trajectory endpoints, thus naturally overcoming its reliance on a special noise-
aware classifier. This allows flexible reuse of image discriminators for identity preservation in personalization tasks.
Rectified flow recap: They aim to learn a velocity field v that maps random noise z0 π0 to samples from a complex distribution z1 πdata
via an ordinary differential equation (ODE) Instead of directly solving the ODE (Chen et al., 2018), rectified flow (Liu et al., 2023a) simply learns
a linear interpolation between the two distributions by minimizing the following objective:
Classifier Guidance recap: a test-time mechanism to adjust the predicted noise ϵ(zt, t) based on the guidance from a classifier. Given condition c
and classifier output p(c|zt), the adjustment is formulated as:
They combine the recrifed flow and classifier guidance at inference time. They control the flow with classifier guidance in order to achieve desired
output. Since the theoretical foundation of this paper is too heavy, the details will not be included in this slides.
https://arxiv.org/pdf/2405.14677
Oğuzhan Ercan
x.com/oguzhannercan
Video Generation
Following slides includes:
-STORYDIFFUSION: CONSISTENT SELF-ATTENTION FOR LONG-RANGE IMAGE AND VIDEO GENERATION
2 May 2024
Oğuzhan Ercan
x.com/oguzhannercan
STORYDIFFUSION: CONSISTENT SELF-ATTENTION
FOR LONG-RANGE IMAGE AND VIDEO GENERATION 2 May 2024
https://arxiv.org/pdf/2405.01434
Oğuzhan Ercan
x.com/oguzhannercan
SF-V: Single Forward Video Generation Model 6 Jun 2024
https://arxiv.org/pdf/2406.04324
Oğuzhan Ercan
x.com/oguzhannercan
Image Editing
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
Oğuzhan Ercan
x.com/oguzhannercan
Invertible Consistency Distillation for Text-Guided Image Editing in Around
7 Steps 20 Jun 2024
https://arxiv.org/pdf/2406.04324
Oğuzhan Ercan
x.com/oguzhannercan
Flow Control
Following slides includes:
-D-Flow: Differentiating through Flows for Controlled Generation
Oğuzhan Ercan
x.com/oguzhannercan
Consistency Flow Matching:
Defining Straight Flows with Velocity Consistency 2 Jul 2024
https://arxiv.org/pdf/2407.02398
Oğuzhan Ercan
x.com/oguzhannercan
Resaerrch topics
Flow, velocity
Flow matching,
rectrifiying flow.
Training logs, techniques and strategies to be add about:
InstantID
IpAdapter
Controlnet
Lora
Dreambooth
Adapter Adapter (basic training strategy that I find)