Generalized Decoding for Pixel, Image and Language

University of Wisconsin-Madison; UCLA; Microsoft Research, Redmond; Microsoft Cloud & AI
*Equal Technical Contribution, Equal Advisory Contribution, Project Lead

Figure 1. X-Decoder is a single model trained to support a wide range of vision and vision-language tasks.


We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. After pretraining on a mixed set of a limited amount of segmentation data and million of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art open-vocabulary segmentation and referring segmentation on 10 settings of 7 datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning = referring segmentation + captioning).



Referring Image Editing. Left: Object Transfer, Middle: Object/Stuff Removal, Right: Stuff Transfer


Our X-Decoder is unique for three critical designs:

  • It has two types of queries (latent queries and text queries) and outputs (semantic outputs and pixel-level outputs).
  • It uses a single text encoder for all text corpus, ranging from class concepts, referring phrases to image captions.
  • It decouples image and text encoder to accomadate cross-image tasks (e.g., image-text retrieval) and within-image tasks (e.g., segmentation and captioning).

Figure 2. X-Decoder overall architecture.

X-Decoder can be used to unify a variety of vision and vision-language tasks:

  • Generic Segmentation: Instance, semantic and panoptic segmentation, all supporting open-vocabulary and zero-shot.
  • Referring Segmentation: Refer to specific image segment given arbitary textual queries from text encoder.
  • Image-Text Retrieval: Decoupled image and text representation extraction and dot-product for computing the similarities.
  • Image Captioning: Decode textual tokens using exactly the same decoder in autoregressive manner.

Figure 3. Unify different types of vision and vision-language tasks with a single X-Decoder.


Zero-Shot Segmentation

Chart 1. Left: Zero-shot Segmentation Result Compared with SoTA; Right: Finetuned Segmentation Result Compared with SoTA.

Segmentation In the Wild

Chart 2. Zero-shot instance segmentation performance on SeginW benchmark.


Zero-Shot Segmentation

Figure 4. Zero-shot semantic segmentation with pretrained X-Decoder on 10 settings of 7 datasets.

Zero-shot Segmentation for Videos

Figure 5. Zero-shot panoptic segmentation for YouTubeVOS video dataset.

Referring Segmentation for Videos

Figure 6. Zero-shot referring segmentation for YouTubeVOS video dataset.

Image Captioning

Figure 7. Zero-shot image captioning on YoutubeVOS video frames.

Referring Image Captioning

Figure 8. Zero-shot referring image captioning on COCO 2017 val images (pink regions are referred).

Referring Image Editing

Figure 9. Zero-shot referring image editing by combining X-Decoder with Stable Diffusion inpainting.

Segmentation in the Wild

Figure 10. Ground Truth visualization of segmentation in the wild datasets from Roboflow for a wider evaluation.


  author      = {Zou, Xueyan and Dou, Zi-Yi and Yang, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee, Yong Jae and Gao, Jianfeng},
  title       = {Generalized Decoding for Pixel, Image and Language},
  publisher   = {arXiv:2212.11270v1},
  year        = {2022},


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.