LongAlign: Improving Long-Text Alignment for Text-to-Image Diffusion Models

Luping Liu*1,2, Chao Du†2, Tianyu Pang2, Zehan Wang*2,4, Chongxuan Li3,5, Dong Xu†1
1The University of Hong Kong; 2Sea AI Lab, Singapore; 3Renmin University of China;
4Zhejiang University; 5Beijing Key Laboratory of Big Data Management and Analysis Methods
luping.liu@connect.hku.hk

*Work done during Luping Liu and Zehan Wang's associate memberships at Sea AI Lab.
Corresponding authors.

ArXiv GitHub HuggingFace Gallery

Abstract

The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning $512\times 512$ Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt-$\alpha$ and Kandinsky v2.2.

Method

Segment-Level Text Encoding

Problem: Although CLIP-like models are commonly used for representation encoding, result evaluation, and reward fine-tuning, existing CLIP-based models have limitations on input text length.

Solution:

Preference Decomposition and Reweighting

Problem: Preference optimization can effectively enhance T2I diffusion models, but this fine-tuning process encounters significant overfitting challenges.

Solution:

Result

The results compare the original SDXL with our LongAlign fine-tuned version of SDXL.
Image 1 (w/o LongAlign) Image 2 (w LongAlign) Prompt
Baseline Result LongAlign Result
The image captures the exterior of an HSBC bank branch. Dominating the scene is a gray building, its facade punctuated by a large window on the right side. This window, framed in black, is divided into six panes, each reflecting the world outside. Above this window, a red and white sign proudly displays the HSBC logo. The letters "HSBC" are written in black, standing out against the white background of the sign. To the right of the logo, a red diamond-shaped symbol adds a splash of color to the scene. The image is a blend of urban architecture and corporate branding, a snapshot of a moment in the life of the city.
Baseline Result LongAlign Result
The image depicts a fantastical scene featuring a large ship with a distinctive design. The ship is predominantly dark in color, with intricate carvings and decorations that suggest a historical or mythical inspiration. The most striking feature of the ship is the presence of a large, elephant-like head at the bow, which is facing towards the viewer. This head is detailed and realistic, with tusks and a trunk that are prominently displayed. The ship is sailing on a body of water, with a clear sky above and a calm, reflective surface below. In the background, there is a rocky outcrop that adds to the sense of a natural, outdoor setting. On the deck of the ship, there are several people visible, although they are too small to discern any specific details about them. The ship is also adorned with multiple flags, which are attached to the mast and the bow. These flags are not detailed enough to identify any specific symbols or insignias. The overall style of the image is realistic with a touch of fantasy, as evidenced by the elephant head and the elaborate carvings on the ship. The lighting and shadows suggest that the image is set during the daytime, with the sun casting a warm glow on the scene. There are no visible texts or brands in the image.
Baseline Result LongAlign Result
The image presents a 3D rendering of a futuristic car, which is the central focus of the composition. The car is predominantly black, with a glossy finish that reflects the surrounding environment. It's equipped with a large, transparent bubble-like dome on the roof, through which a small astronaut can be seen. The astronaut, dressed in a black suit with a helmet, is floating in space, surrounded by stars and planets. The car is not just a vehicle but a spacecraft, as indicated by the presence of the astronaut and the celestial backdrop. The car's design is sleek and modern, with a curved front and a pointed rear. The wheels are large and silver, adding to the futuristic aesthetic. The background of the image is a dark blue, providing a stark contrast to the black car and the astronaut. This contrast further emphasizes the car and the astronaut, making them the focal points of the image. Overall, the image is a blend of science fiction and modern design, creating a visually striking and imaginative scene.
Baseline Result LongAlign Result
The image is a close-up portrait of a man with a serious expression. His hair is styled in a slicked-back manner, and his facial features are highlighted by the lighting, which casts a dramatic shadow on his face. The man's eyes are directed towards the camera, and his eyebrows are slightly furrowed, adding to the intensity of his expression. The most striking element of the image is the fire that appears to be emanating from the man's neck and shoulders. The fire is depicted in a realistic style, with orange and yellow hues that suggest a bright, intense flame. The fire is not contained within the image; it seems to be flowing outward, creating a sense of movement and energy. The background of the image is dark, which serves to highlight the man and the fire. The darkness also helps to emphasize the contrast between the man's skin and the fiery elements in the image. Overall, the image is a powerful and dramatic portrait that combines realistic elements with artistic flair. The use of fire as a visual motif adds a layer of intrigue and mystery to the image, inviting the viewer to wonder about the story behind the scene.
Baseline Result LongAlign Result
In the center of the image, a daring motorcyclist is captured in mid-air, performing a thrilling stunt on an orange and black dirt bike. The rider, clad in a black and white helmet, grips the handlebars tightly, demonstrating control and precision. The bike is tilted slightly to the left, adding to the sense of motion and excitement. The setting is a large indoor stadium, filled with a crowd of spectators who are watching the spectacle unfold. Their faces are a blur of anticipation and awe. The background is adorned with various advertisements, adding a splash of color and life to the scene. Despite the action-packed nature of the image, there's a certain harmony to it. The motorcyclist, the bike, the crowd, and the stadium all come together to create a snapshot of a moment filled with adrenaline and excitement. It's a testament to the skill and courage of the rider, and the thrill of the sport.
Baseline Result LongAlign Result
The image is a digital artwork of an animated female character. She has long, flowing blonde hair and is wearing a white armor with gold accents. The armor features a high collar and a chest plate with a cross symbol in the center. The character is holding a large, blue sword with a glowing blade. The background is a dark blue with red and blue sparks or particles floating around her. The character's expression is intense, with her eyes focused and her mouth slightly open. The overall style of the image is reminiscent of Japanese anime or manga.
Baseline Result LongAlign Result
The image is a digital artwork depicting a nighttime scene. In the foreground, there are three silhouetted trees with curved branches, suggesting a tranquil setting. The trees are set against a dark sky, which is filled with numerous stars and a faint, milky-way-like nebula. The colors in the sky transition from deep blues at the top to lighter purples and pinks near the horizon, indicating either dawn or dusk. The overall atmosphere of the image is serene and somewhat mystical, with a sense of depth and vastness conveyed by the starry sky. There are no visible texts or distinguishing marks that provide additional context or information about the image. The style of the artwork is realistic with a focus on creating a peaceful and somewhat ethereal nighttime landscape.
Baseline Result LongAlign Result
In the image, there's a lively scene unfolding in a room with a wooden floor and a white wall in the background. Two individuals are engaged in a discussion, standing in front of a bulletin board. The bulletin board, made of wood, is adorned with two posters. One poster is white with black text, while the other is black with white text. The person on the left, clad in a green shirt, is gesturing with their hands, perhaps emphasizing a point or explaining something. On the right, the other person, wearing a black jacket, is attentively listening, their gaze fixed on the person in green. The interaction between the two individuals suggests a discussion or presentation of some sort. The bulletin board, with its two posters, serves as the backdrop for this exchange.
Baseline Result LongAlign Result
The image portrays a female character with a fantasy-inspired design. She has long, dark hair that cascades down her shoulders. Her skin is pale, and her eyes are a striking shade of blue. The character's face is adorned with intricate gold and pink makeup, which includes elaborate patterns and designs around her eyes and on her cheeks. Atop her head, she wears a crown made of gold and pink roses, with the roses arranged in a circular pattern. The crown is detailed, with each rose appearing to have a glossy finish. The background of the image is dark, which contrasts with the character's pale skin and the bright colors of her makeup and attire. The lighting in the image highlights the character's features and the details of her makeup and attire, creating a dramatic and captivating effect. There are no visible texts or brands in the image. The style of the image is highly stylized and artistic, with a focus on the character's beauty and the intricate details of her makeup and attire. The image is likely a digital artwork or a concept illustration, given the level of detail and the fantastical elements present.
Baseline Result LongAlign Result
The image presents a dramatic and surreal scene set against a dark, cloudy sky. Dominating the center of the image is a large, black ring, which appears to be a portal or vortex. This ring is encircled by a halo of light, creating a stark contrast against the dark sky. The light seems to emanate from the center of the ring, suggesting a source of power or energy. Beyond the ring, the sky is filled with clouds that are illuminated by the light from the ring. The clouds are dense and appear to be in motion, adding a sense of dynamism to the scene. The colors in the image are predominantly dark and black, with the light from the ring providing a stark contrast. The overall composition of the image is balanced, with the ring centrally located and the clouds filling the rest of the frame. The use of light and shadow creates a sense of depth and dimension, drawing the viewer's eye towards the center of the image. Despite the fantastical elements, the image is grounded in a realistic aesthetic. The clouds and sky are rendered with a high level of detail, and the colors are naturalistic. This combination of realism and fantasy creates a visually striking image that invites the viewer to imagine what lies beyond the ring.
Baseline Result LongAlign Result
The image is a digital illustration featuring a stylized female character in the center. She has a short, blue, curly hairstyle and is wearing a bright orange jacket with a blue collar and a white logo on the left chest area. The jacket is unbuttoned, revealing a blue top underneath. She is also wearing black pants with a white stripe down the side, and her shoes are white with black accents. The character is standing against a vibrant orange background. To her right, there are various objects and illustrations that seem to be related to technology and gaming. These include a white game controller, a gray electronic device with buttons and a screen, a white keyboard, and a white monitor displaying a colorful graphic. In the top left corner of the image, there is a small, blue, curly hairstyle that matches the character's hair, suggesting a connection or theme between the character and the hairstyle. The overall style of the image is modern and graphic, with a focus on bold colors and clean lines. The illustration has a dynamic and energetic feel, with a sense of movement and activity suggested by the character's pose and the surrounding objects.

Citation

Please cite our paper if you find this work useful:

@article{liu2024improving,
         title={Improving Long-Text Alignment for Text-to-Image Diffusion Models},
         author={Liu, Luping and Du, Chao and Pang, Tianyu and Wang, Zehan and Li, Chongxuan and Xu, Dong},
         journal={arXiv preprint arXiv:2410.11817},
         year={2024}
}