RecA: Reconstruction Alignment Improves Unified Multimodal Models

Abstract

Unified multimodal models (UMMs) are designed to perform both vision understanding and image generation within a single architecture. While they have achieved strong performance on image-text understanding tasks, their generation capabilities often lag behind, revealing a misalignment between what the model understands and what it can produce. We identify this disconnect as a consequence of sparse and biased text supervision in conventional training.

We propose RecA, a self-supervised training framework that aligns understanding and generation through image reconstruction at the semantic level. By reconstructing images from their own vision encoder embeddings, UMMs receive dense, semantically grounded supervision—free of captions or paired image-text data. This alignment mechanism effectively bridges the modality gap and improves generation fidelity.

Despite its simplicity, our approach delivers strong gains for unified multimodal models across generation and editing tasks. Applied to a 1.5B parameter UMM, RecA achieves state-of-the-art results on GenEval (0.90) and DPGBench (88.15), outperforming models with significantly larger scale. More impressively, RecA achieves this with modest compute, requiring just 8,000 unlabeled images and 6×A100 GPUs for 4.5 hours (27 GPU-hours).

Teaser

RecA demonstrates remarkable improvements in multimodal understanding and generation capabilities

RecA: Semantic-Level Image Reconstruction

The Challenge

As shown in the figure, longer captions capture more details but still cannot fully represent the original image, missing the detailed overall layouts, object shapes, instance attributes, etc.

Vision embeddings from the understanding vision encoder are already mapped into the UMM's space while retaining richer visual information. Can we prompt the UMMs with embeddings from visual understanding models to close this information gap?

Our Solution

RecA implements a self-supervised training paradigm where a understanding vision encoder extracts features from the input image; these features are fused with template text embeddings and fed into a Unified Multimodal Model (UMM) to regenerate the image.

We use the self-supervised loss between the original and generated images to optimize the UMM, providing dense supervision that preserves almost all fine-grained details that captions omit.

State-of-the-Art Performance

After only a few training steps, all models post large zero-shot gains in generation capability with no loss in vision-understanding accuracy. Our fine-tuned Harmon model, even with just 1.5B parameters, achieves a high score of 0.86 on GenEval and 87.21 on DPGBench, significantly outperforming the previous state-of-the-art models without any GPT-4o-Image distillation data or reinforcement learning.

The most effective approach is a two-stage strategy: first applying SFT followed by reconstruction tuning, which achieves 0.90 on GenEval and 88.15 on DPGBench.

Enhanced Editing Capabilities

We surprisingly discover that, for models with image editing capabilities, our method also significantly improves their editing performance. RecA demonstrates consistent improvements across all editing categories, increasing the ImgEdit scores from 3.38 to 3.75 and GEdit from 6.94 to 7.25, using only 1,000 training steps and 8,000 unlabeled images.

Our method unlocks the model's inherent editing potential without expensive annotation across various tasks like addition, replacement, stylization and color modification.

Enhanced Generalizability

Across Different Architectures and Tasks

RecA achieves consistent performance gains across different UMM frameworks, showcasing its generalizability. We apply RecA to various unified multimodal models including Show-o (AR), Harmon (AR+MAR), OpenUni (AR+Diffusion), and BAGEL (AR+Diffusion).

All models demonstrate significant improvements through RecA: the most notable improvement is achieved by Harmon-1.5B with 85.7 GenEval score (+12.8). Our method exhibits the most significant gains in Position and Color Attribution tasks, while maintaining correct subjects, bindings, and positions across cases with multiple objects, complex attributions, and explicit spatial layouts.