Oğuzhan Ercan
x.com/oguzhannercan
They introduce a convolution-like local at- tention strategy termed CLEAR, which limits feature inter- actions to a local window around each query token, and thus
achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self- generated samples for 10K iterations, they can
effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. They find that while
formulation variation strategies have proven effective in attention-based UNets [38] and DiTs trained from scratch [62], they do not yield similar success with pre-trained
DiTs. Key-value compression often leads to distorted details, and key-value sampling highlights the necessity of local tokens for each query to generate visually coherent
results. They say that four elements crucial for for linearizing pre-trained DiTs, including locality, formulation consistency, high-rank attention maps, and feature integrity.
Their proposal is each query interacts only with tokens within a predefined distance r. Since the number of key-value tokens interacting with each query is fixed, the
resulting DiT achieves linear complexity with respect to image resolution.
LinFusion has shown that linear attention approaches like linear attention achiev epromising results in attention-based UNets. However, they find that it is not the case
for pre-trained DiTs. They speculate that it is due to attention layers being the only modules for token interactions in DiTs, unlike the case in U-Nets. Substituting all of
them would have a substantial impact on the final outputs. Other formulations likeSigmoid Attention fails to converge within a limited number of iterations. High-Rank
Attention Maps means that attention maps calculated by efficient attention alternatives should be suf- ficient to capture the intricate token-wise relationships. Extensive
attention scores are concentrated along the diagonal, indicating that the attention maps do not exhibit the low-rank property assumed by many prior works. That is why
methods like linear attention and Swin Transformer largely produce blocky patterns. Feature Integrity implies that raw query, key, and value features are more favorable
than the compressed ones. Al- though PixArt-Sigma has demonstrated that applying KV compression on deep layers would not hurt the performance much, this approach
is not suitable for completely lineariz- ing pre-trained DiTs. Methods based on KV compression, such as PixArt-Sigma and Agent At- tention, tend to produce distorted
textures compared to the results from Swin Transformer and Neighborhood Attention, which highlights the necessity to preserve the integrity of the raw query, key, and
value tokens.
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers
Up 20 Dec 2024
https://arxiv.org/pdf/2412.16112
CLEAR adopts circular windows, where key-value tokens within a Euclidean distance less than a
predefined radius r are considered for each query. Comparing with corresponding square windows,
the computation overhead introduced by this design is ∼π 4 times. Although each query only has
access to tokens within a local window, stacking multiple Transformer blocks en- ables each token to
gradually capture holistic information—similar to the way convolutional neural networks operate. To
promote functional consistency between models before and after fine-tuning, they employ a
knowledge distillation objective during the fine-tuning process.Since attention is confined to a local
window around each query, CLEAR offers greater efficiency for multi-GPU patch-wise parallel
inference compared to the full attention in the original DiTs, which is particularly valuable for
generating ultra-high-resolution images. Specifically, each GPU is responsible for processing an image
patch, and the GPU communication is only required in the boundary areas.