Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han^1,2, Hao Chen^2,†, Yang Zhao², Hanyu Wang², Qi Zhao²,
Ziyan Yang², Hao He^1,2, Xiangyu Yue^1,‡, Lu Jiang^2,‡

¹ CUHK MMLab ² ByteDance Seed

^†Project lead ^‡Corresponding authors

Paper Code

Demo 1

Demo 2

Model

Fairytale vibes. This stone house with its flower-filled balcony looks like it's straight out of a storybook.

The image depicts a fantastical scene set in a lush forest with tall pine trees and a backdrop of a large, glowing moon. In the foreground, three human figures are seen from behind, observing a colossal, green-haired creature with large horns. The creature has a mystical aura, with its hair flowing like a river. The three human figures are dressed in period clothing, suggesting an older time period. The colors are vibrant, with the green of the creature contrasting beautifully against the greenery of the forest and the purple hues of the human figures' hair.

16 years old redhead girl with many freckles, gray eyes, full lips

a chinese rabbit made of white ceramic with blue ink, stunning intricate designs, ((geometric and floral patterns, fine art))

A serene photograph of a ginger and white cat sitting in a sunlit grassy field. The cat is positioned slightly to the right, gazing upwards with a calm expression. Its fur is a soft orange with distinct white patches on its chest and face. The foreground features out-of-focus blades of grass, creating a dreamy bokeh effect. The background is a blurred mix of soft greens and browns, suggesting a natural outdoor setting. The lighting is warm and golden, highlighting the cat's fur and casting gentle shadows. The image has a shallow depth of field, emphasizing the cat while the background remains softly blurred. Photorealistic, tranquil, natural lighting, warm color palette, high contrast, intimate, peaceful atmosphere.

The image depicts a surreal scene set against a backdrop of a starry night sky. On the left side, a silhouette of a young boy sits at a table, holding a handful of playing cards. The boy is dressed in a red jacket and is seated on a chair. On the right side, a fantastical creature emerges from the cosmos. This creature has a vibrant, multi-colored, and intricate design, resembling a fusion of swirling galaxies and nebulae. It has a large, glowing eye and tentacle-like appendages that seem to be reaching out towards the boy. The overall mood of the image is mysterious and otherworldly.

The image depicts a picturesque winter landscape with a grand stone bridge arching over a turquoise river. On either side of the river, there are snow-covered wooden houses with steep roofs, and the ground is blanketed in snow. Tall, rugged cliffs rise in the background, and a castle-like structure with multiple towers stands atop one of the cliffs. The sky is painted in soft hues of blue and white, suggesting a serene winter day.

The image showcases a meticulously crafted origami model of an elephant. The elephant is intricately designed with folded paper pieces, creating a three-dimensional appearance. The elephant is standing on a patch of green grass, which adds a touch of realism to the scene. The background is plain white, emphasizing the elephant and the grass.

Fresh sweet iced tea is displayed on the table with mint flowers

The image showcases a female character with long, wavy, silver hair adorned with a crown made of blue gemstones. She wears a white, flowing gown with intricate blue and silver embroidery. The gown has a high neckline and sleeveless design. The character's eyes are blue, and she has a serene expression. In the background, there are tall, classical columns and a hint of a blue sky with floating petals or leaves.

Abstract

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.

Method

Framework. Tar is a unified multimodal LLM for both visual understanding and generation, which consists of an autoregressive LLM, a visual tokenizer TA-Tok and a visual de-tokenizer. Different from previous works, Tar leverages fully discrete, text-aligned visual tokens, eliminating the need of modality-specific designs like visual projectors. We can train Tar using the standard next-token prediction objective.

Visual Tokenizer. The key design of Tar is a text-aligned visual tokenizer, TA-Tok. It adds a vector quantization module to pretrained SigLIP2, thus converting input images into semantic, discrete tokens. Unlike other discrete tokenizers (e.g., VQVAE), TA-Tok directly leverages LLM's token embeddings as its codebook. The visual token can be represented by a transformed LLM token. Therefore, training unified MLLM with TA-Tok is similar to adding foreign languages to the LLM.

De-Tokenizer. Since the visual tokenizer TA-Tok is fully text-aligned, it cannot decode images directly like VQVAE. Instead, we propose visual de-tokenizers to decode visual tokens back to images. Here are two variants: an autoregressive model and a diffusion-based model. The AR de-tokenizer works well with discrete visual tokens from TA-Tok, while for the diffusion-based de-tokenier, we can leverage pretrained models for fast adaptation.

Implementation

The trained Tar model is a standard LLM with expanded visual vocabulary. As shown in the below code, the finetuned Qwen2 model can understand and generate TA-Tok's discrete tokens. We do not need to modify the architecture of Qwen2. Instead, we only need to feed TA-Tok's discrete tokens to Qwen2 or decode them with the de-tokenizer.

Visual Understanding
Visual Generation


from transformers import AutoTokenizer, Qwen2ForCausalLM
from tok.ta_tok import TextAlignedTokenizer

class ImageToTextInference:
    def __init__(self, config: I2TConfig):
        self.config = config
        self.model = Qwen2ForCausalLM.from_pretrained(config.model_path)
        self.text_tokenizer = AutoTokenizer.from_pretrained(config.model_path)
        self.visual_tokenizer = TextAlignedTokenizer.from_checkpoint(
            config.ta_tok_path, load_teacher=False, input_type='indices')

    def generate(self, image_path: str, prompt: str) -> str:
        image = Image.open(image_path).convert('RGB')
        image = to_tensor(image).unsqueeze(0)
        
        image_code = self.visual_tokenizer(image)['encoded']
        image_text = "".join([f"<I{x}>" for x in image_code[0].cpu().tolist()])
        
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{image_text}\n{prompt}"}]
        
        input_text = self.text_tokenizer.apply_chat_template(messages)
        inputs = self.text_tokenizer(input_text, return_tensors="pt")
        
        gen_ids = self.model.generate(
            inputs.input_ids, max_new_tokens=256, do_sample=True)
        return self.text_tokenizer.batch_decode(gen_ids)


from transformers import AutoTokenizer, Qwen2ForCausalLM
from tok.mm_autoencoder import MMAutoEncoder

class TextToImageInference:
    def __init__(self, config: T2IConfig):
        self.config = config
        self.model = Qwen2ForCausalLM.from_pretrained(self.config.model_path)
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_path)
        self.visual_tokenizer = MMAutoEncoder(**tok_config).eval()

    def generate_image(self, prompt: str) -> Image.Image:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}]
        
        input_text = self.tokenizer.apply_chat_template(messages)
        inputs = self.tokenizer(input_text, return_tensors="pt")
        gen_ids = self.model.generate(
            inputs.input_ids, max_new_tokens=729, do_sample=True)
        
        gen_text = self.tokenizer.batch_decode(gen_ids)[0]
        gen_code = [int(x) for x in re.findall(r'<I(\d+)>', gen_text)]
        gen_code = torch.tensor(gen_code).unsqueeze(0)
        
        gen_tensor = self.visual_tokenizer.decode_from_encoder_indices(gen_code)
        return Image.fromarray(gen_tensor[0].numpy())

Experiment

Results on Visual Understanding Benchmarks

* Token: Token type, including Continuous (C), Discrete (D), Semantic (S), Pixel (P) and Hybrid (H).

Model	# LLM	Token	POPE↑	MME-P↑	MME-C↑	MMB↑	SEED↑	GQA↑	MMMU↑
Show-o	1.3B	D,P	80.0	1097	248	-	-	58.0	26.7
Harmon	1.5B	C,H	87.6	1155	321	65.5	67.1	58.9	38.9
Janus	1.5B	C,S	87.0	1338	222	69.4	63.7	59.1	30.5
Janus-Pro	1.5B	C,S	86.2	1444	268	75.5	68.3	59.3	36.3
D-Dit	2.0B	C,P	84.0	1125	-	-	-	59.2	-
Tar (Ours)	1.5B	D,S	88.4	1390	342	65.6	70.4	61.1	36.0
ILLUME	7B	C,S	88.5	1445	-	65.1	72.9	-	38.2
Chameleon	7B	D,P	-	-	-	-	-	-	22.4
LWM	7B	D,P	75.2	-	-	-	-	44.8	-
Liquid	7B	D,P	81.1	1119	-	-	-	58.4	-
UniTok	7B	D,H	83.2	1448	-	-		61.1	-
VILA-U	7B	D,H	85.8	1402	-	-	59.0	60.8	-
Janus-Pro	7B	C,S	87.4	1567	260	79.2	72.1	62.0	41.0
MetaMorph	8B	C,S	-	-	-	75.2	71.8	-	41.8
Tar (Ours)	7B	D,S	87.8	1571	355	74.4	73.0	61.3	39.0

Results on Visual Generation Benchmarks

Method	GenEval				DPG Bench
Method	Two Obj.	Counting	Color Attri.	Overall↑	Entity	Attribute	Relation	Overall↑
LWM-7B	0.41	0.46	0.15	0.47	-	-	-	-
SEED-X-13B	0.58	0.26	0.14	0.49	-	-	-	-
Show-o-1.3B	0.52	0.49	0.28	0.53	-	-	-	-
Transfusion-7B	-	-	-	0.63	-	-	-	-
D-DiT-2B	0.80	0.54	0.50	0.65	-	-	-	-
ILLUME-7B	0.86	0.45	0.28	0.61	-	-	-	-
Janus-1.3B	0.68	0.30	0.42	0.61	87.38	87.70	85.46	79.68
Janus-Pro-1B	0.82	0.51	0.56	0.73	88.63	88.17	88.98	82.63
Harmon-1.5B	0.86	0.57	0.48	0.76	-	-	-	-
Janus-Pro-7B	0.89	0.59	0.66	0.80	88.90	89.40	89.32	84.19
Tar-1.5B	0.91	0.76	0.51	0.76	89.35	86.91	93.50	82.96
Tar-1.5B + Self Reflect	0.92	0.77	0.55	0.78	88.48	87.83	93.38	84.10
Tar-7B	0.92	0.83	0.65	0.84	88.62	88.05	93.98	84.19
Tar-7B + Self Reflect	0.93	0.86	0.70	0.85	88.60	88.78	93.59	84.65

Demo

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:

@article{han2025tar,
  title={Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations}, 
  author={Han, Jiaming and Chen, Hao and Zhao, Yang and Wang, Hanyu and Zhao, Qi and Yang, Ziyan and He, Hao and Yue, Xiangyu and Jiang, Lu},
  journal={arXiv preprint arXiv:2506.18898},
  year={2025},
}