Generalized Decoding for Pixel, Image and Language

University of Wisconsin-Madison; UCLA; Microsoft Research, Redmond; Microsoft Cloud & AI
*Equal Technical Contribution, Equal Advisory Contribution, Project Lead

In CVPR2023

Figure 1. X-Decoder is a single model trained to support a wide range of vision and vision-language tasks.


We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. After pretraining on a mixed set of a limited amount of segmentation data and million of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art open-vocabulary segmentation and referring segmentation on 10 settings of 7 datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning = referring segmentation + captioning).



X-GPT: Connecting generalist X-Decoder with GPT-3

Instruct-X-Decoder: Object-centric instructional image editing


Our X-Decoder is unique for three critical designs:

  • It has two types of queries (latent queries and text queries) and outputs (semantic outputs and pixel-level outputs).
  • It uses a single text encoder for all text corpus, ranging from class concepts, referring phrases to image captions.
  • It decouples image and text encoder to accomadate cross-image tasks (e.g., image-text retrieval) and within-image tasks (e.g., segmentation and captioning).

Figure 2. X-Decoder overall architecture.

X-Decoder can be used to unify a variety of vision and vision-language tasks:

  • Generic Segmentation: Instance, semantic and panoptic segmentation, all supporting open-vocabulary and zero-shot.
  • Referring Segmentation: Refer to specific image segment given arbitary textual queries from text encoder.
  • Image-Text Retrieval: Decoupled image and text representation extraction and dot-product for computing the similarities.
  • Image Captioning: Decode textual tokens using exactly the same decoder in autoregressive manner.

Figure 3. Unify different types of vision and vision-language tasks with a single X-Decoder.


Zero-Shot Segmentation

Chart 1. Left: Zero-shot Segmentation Result Compared with SoTA; Right: Finetuned Segmentation Result Compared with SoTA.

Segmentation In the Wild

Chart 2. Zero-shot instance segmentation performance on SeginW benchmark.


Zero-Shot Segmentation

Figure 4. Zero-shot semantic segmentation with pretrained X-Decoder on 10 settings of 7 datasets.

Zero-shot Segmentation for Videos

Figure 5. Zero-shot panoptic segmentation for YouTubeVOS video dataset.

Referring Segmentation for Videos

Figure 6. Zero-shot referring segmentation for YouTubeVOS video dataset.

Image Captioning

Figure 7. Zero-shot image captioning on YoutubeVOS video frames.

Referring Image Captioning

Figure 8. Zero-shot referring image captioning on COCO 2017 val images (pink regions are referred).

Referring Image Editing

Figure 9. Zero-shot referring image editing by combining X-Decoder with Stable Diffusion inpainting.

Segmentation in the Wild

Figure 10. Ground Truth visualization of segmentation in the wild datasets from Roboflow for a wider evaluation.


  author      = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee^, Yong Jae and Gao^, Jianfeng},
  title       = {Generalized Decoding for Pixel, Image and Language},
  publisher   = {arXiv:2212.11270v1},
  year        = {2022},


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.