Unifying Visual Understanding and Generation via Text-Aligned Representations

Project lead   Corresponding authors

Abstract

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.

Method

Framework. Tar is a unified multimodal LLM for both visual understanding and generation, which consists of an autoregressive LLM, a visual tokenizer TA-Tok and a visual de-tokenizer. Different from previous works, Tar leverages fully discrete, text-aligned visual tokens, eliminating the need of modality-specific designs like visual projectors. We can train Tar using the standard next-token prediction objective.

Tar

Visual Tokenizer. The key deisng of Tar is a text-aligned visual tokenizer, TA-Tok. It add a vector quantization module to pretrained SigLIP2, thus converting input images into semantic, discrete tokens. Unlike other discrete tokenizers (e.g., VQVAE), TA-Tok directly leverage LLM's token embeddings as its codebook. The visual token can be represented by a transformed LLM token. Therefore, training unified MLLM with TA-Tok is similar to adding foreign languages to the LLM.

TA-Tok

De-Tokenizer. Since the visual tokenizer TA-Tok is fully text-aligned, it cannot decode images directly like VQVAE. Instead, we propose visual de-tokenizers to decode visual tokens back to images. Here are two variants: an autoregressive model and a diffusion-based model. The AR de-tokenizer works well with discrete visual tokens form TA-Tok, while for the diffusion-based de-tokenier, we can leverage pretrained models for fast adaptation.

Implementation

The trained Tar model is a standard LLM with expanded visual vocabulary. As shown in the below code, the finetuned Qwen2 model can understand and generate TA-Tok's discrete tokens. We do not need to modify the architecture of Qwen2. Instead, we only need to feed TA-Tok's discrete tokens to Qwen2 or decode them with the de-tokenizer.



from transformers import AutoTokenizer, Qwen2ForCausalLM
from tok.ta_tok import TextAlignedTokenizer

class ImageToTextInference:
    def __init__(self, config: I2TConfig):
        self.config = config
        self.model = Qwen2ForCausalLM.from_pretrained(config.model_path)
        self.text_tokenizer = AutoTokenizer.from_pretrained(config.model_path)
        self.visual_tokenizer = TextAlignedTokenizer.from_checkpoint(
            config.ta_tok_path, load_teacher=False, input_type='indices')

    def generate(self, image_path: str, prompt: str) -> str:
        image = Image.open(image_path).convert('RGB')
        image = to_tensor(image).unsqueeze(0)
        
        image_code = self.visual_tokenizer(image)['encoded']
        image_text = "".join([f"<I{x}>" for x in image_code[0].cpu().tolist()])
        
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{image_text}\n{prompt}"}]
        
        input_text = self.text_tokenizer.apply_chat_template(messages)
        inputs = self.text_tokenizer(input_text, return_tensors="pt")
        
        gen_ids = self.model.generate(
            inputs.input_ids, max_new_tokens=256, do_sample=True)
        return self.text_tokenizer.batch_decode(gen_ids)
          

Experiment

Results on Visual Understanding Benchmarks

* Token: Token type, including Continuous (C), Discrete (D), Semantic (S), Pixel (P) and Hybrid (H).

Model # LLM Token POPE↑ MME-P↑ MME-C↑ MMB↑ SEED↑ GQA↑ MMMU↑
Show-o 1.3B D,P 80.0 1097 248 - - 58.0 26.7
Harmon 1.5B C,H 87.6 1155 321 65.5 67.1 58.9 38.9
Janus 1.5B C,S 87.0 1338 222 69.4 63.7 59.1 30.5
Janus-Pro 1.5B C,S 86.2 1444 268 75.5 68.3 59.3 36.3
D-Dit 2.0B C,P 84.0 1125 - - - 59.2 -
Tar (Ours) 1.5B D,S 88.4 1390 342 65.6 70.4 61.1 36.0
ILLUME 7B C,S 88.5 1445 - 65.1 72.9 - 38.2
Chameleon 7B D,P - - - - - - 22.4
LWM 7B D,P 75.2 - - - - 44.8 -
Liquid 7B D,P 81.1 1119 - - - 58.4 -
UniTok 7B D,H 83.2 1448 - - 61.1 -
VILA-U 7B D,H 85.8 1402 - - 59.0 60.8 -
Janus-Pro 7B C,S 87.4 1567 260 79.2 72.1 62.0 41.0
MetaMorph 8B C,S - - - 75.2 71.8 - 41.8
Tar (Ours) 7B D,S 87.8 1571 355 74.4 73.0 61.3 39.0

Results on Visual Generation Benchmarks

Method GenEval DPG Bench
Two Obj. Counting Color Attri. Overall↑ Entity Attribute Relation Overall↑
LWM-7B 0.41 0.46 0.15 0.47 - - - -
SEED-X-13B 0.58 0.26 0.14 0.49 - - - -
Show-o-1.3B 0.52 0.49 0.28 0.53 - - - -
Transfusion-7B - - - 0.63 - - - -
D-DiT-2B 0.80 0.54 0.50 0.65 - - - -
ILLUME-7B 0.86 0.45 0.28 0.61 - - - -
Janus-1.3B 0.68 0.30 0.42 0.61 87.38 87.70 85.46 79.68
Janus-Pro-1B 0.82 0.51 0.56 0.73 88.63 88.17 88.98 82.63
Harmon-1.5B 0.86 0.57 0.48 0.76 - - - -
Janus-Pro-7B 0.89 0.59 0.66 0.80 88.90 89.40 89.32 84.19
Tar-1.5B 0.91 0.76 0.51 0.76 89.35 86.91 93.50 82.96
Tar-1.5B + Self Reflect 0.92 0.77 0.55 0.78 88.48 87.83 93.38 84.10
Tar-7B 0.92 0.83 0.65 0.84 88.62 88.05 93.98 84.19
Tar-7B + Self Reflect 0.93 0.86 0.70 0.85 88.60 88.78 93.59 84.65

Demo

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:
@article{han2025tar,
  title={Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations}, 
  author={Han, Jiaming and Chen, Hao and Zhao, Yang and Wang, Hanyu and Zhao, Qi and Yang, Ziyan and He, Hao and Yue, Xiangyu and Jiang, Lu},
  journal={arXiv preprint arXiv:2506.18898},
  year={2025},
}