"Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 16, 2025

Article voiceover

1×

0:00

-3:41

https://arxiv.org/abs/2502.05415

The original Show-o model, while versatile for both image and text generation, suffers from slow inference speeds due to its iterative generation processes for both modalities. This paper addresses this inefficiency by introducing Show-o Turbo, a method to accelerate both image and text generation in a unified multimodal model.

This paper proposes to shorten the generation process in Show-o by applying consistency distillation to its multimodal denoising trajectories. This approach treats both image and text generation as denoising tasks, enabling a unified acceleration strategy.

-----

📌 Show-o Turbo cleverly unifies text and image generation as denoising. Jacobi decoding enables parallel text token refinement, mirroring image token processing for consistent acceleration.

📌 This paper demonstrates effective generalization of consistency distillation to discrete multimodal generation. Trajectory segmentation stabilizes training and achieves few-step, high-quality multimodal output.

📌 The key innovation is practical acceleration. Show-o Turbo achieves 1.5x speedup in multimodal tasks and maintains strong generation quality with significantly fewer steps.

----------

Methods Explored in this Paper 🔧:

→ This paper introduces a unified perspective by viewing text generation as a denoising process, similar to image generation in Show-o. This is achieved by applying Jacobi decoding, a parallel text decoding algorithm, to Show-o. Jacobi decoding refines multiple text tokens simultaneously, mimicking the parallel denoising of image tokens.

→ Show-o Turbo employs consistency distillation to shorten the denoising trajectories for both images and text. Consistency distillation trains Show-o Turbo to map any point on the original Show-o's generation trajectory to the final output in fewer steps. This is done by minimizing the difference between the predictions of the student model (Show-o Turbo) and the teacher model (Show-o) at different points in the generation process.

→ Trajectory segmentation and curriculum learning are used to improve training. The long denoising trajectory is divided into segments. Consistency distillation is applied within each segment. Training proceeds in stages with decreasing segment lengths, making learning more manageable and improving convergence.

-----

Key Insights 💡:

→ Treating text generation as denoising, through parallel decoding methods like Jacobi decoding, allows for a unified acceleration approach for multimodal models. This perspective bridges the gap between image and text generation processes in Show-o.

→ Consistency distillation, originally developed for continuous diffusion models, can be effectively generalized and extended to accelerate discrete multimodal models like Show-o. This extension enables significant speedups without sacrificing generation quality.

→ Trajectory segmentation and curriculum learning are crucial for the successful application of consistency distillation to complex models like Show-o. These techniques stabilize training and enhance the effectiveness of distillation by breaking down the learning task into manageable stages.

-----

Results 📊:

→ Show-o Turbo achieves a GenEval score of 0.625 in text-to-image generation with only 4 sampling steps, without classifier-free guidance (CFG). This outperforms the original Show-o with 8 steps and CFG, which has a GenEval score of 0.580 (at 5 steps with CFG in Table 1).

→ In image-to-text generation tasks, Show-o Turbo demonstrates a 1.5x speedup in inference time compared to the original Show-o, while maintaining comparable performance on image description benchmarks like Flickr30K and NoCaps (Table 2).

→ On multimodal understanding tasks, Show-o Turbo maintains competitive performance on benchmarks like POPE, MME, and MMMU, demonstrating its ability to accelerate without significantly degrading understanding capabilities (Table 2).

Rohan's Bytes

Discussion about this post