Oğuzhan Ercan
x.com/oguzhannercan
They used 2 text encoder, bilingual t5 and clip. They perform post-training optimization in the inference stage to lower the
deployment cost of Hunyuan-DiT. They used VAE in SDXL which is fine-tuned on 512x512 images from the VAE in SD1.5. They say
SDXL VAE is improves clarity, alleviated over-saturation and reduced distortions.
They found the Adaptive Layer Norm used in class-conditional DiT performs unsatisfactorily to enforce fine-grained text conditions, so
they used cross-attention. Hunyuan-DiT has two types of transformer blocks, the encoder block and the decoder block. Both of them
contain three modules - self-attention, cross-attention, and feed-forward network (FFN). The text information is fused in the cross-
attention module. The decoder block additionally contains a skip module, which adds the information from the encoder block in the
decoding stage. The skip module is similar to the long skip-connection in U-Nets, but there are no upsampling or downsampling
modules in Hunyuan-DiT due to our transformer structure. Finally, the tokens are reorganized to recover the two-dimensional spatial
structure. For training, they find using v-prediction (velocity -> instead of noise, predict the rate of change) gives better empirical
result.
They used 2 dimensional Rotary Positional Embedding (RoPE) for positional encoding to encode the absolute position and relative
position dependency. To be able to generate images with multiple resolutions, they tried to use extended positional encoding and
centralized interpolative positional encoding. They find that CIPE converges faster and generalizes better.
To stabilize the training, they used QK-Norm. They add layer normalization after the skip module in the decoder blocks to avoid loss
explosion during training.
They found certain operations, e.g., layer normalization, tend to overflow with FP16 so they
specifically switch them to FP32 to avoid numerical errors.
Due to the large number of model parameters in Hunyuan-DiT and the massive amount of image data required for training, they
adopted ZeRO, flash-attention , multi-stream asynchronous execution, activation checkpointing, kernel fusion to enhance the training
speed. Deploying Hunyuan-DiT for the users is expensive, they adopt multiple engineering optimization strategies to improve the
inference efficiency, including ONNX graph optimization, kernel optimization, operator fusion, precomputation, and GPU memory
reuse.
They find that adversarial training tends to collapse, and best way to accelerate the model speed at inference time is progressive
distillation.
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding 14 May 2024
https://arxiv.org/pdf/2405.08748