We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. After pretraining on a mixed set of a limited amount of segmentation data and million of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art open-vocabulary segmentation and referring segmentation on 10 settings of 7 datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning = referring segmentation + captioning).
Referring Image Editing. Left: Object Transfer, Middle: Object/Stuff Removal, Right: Stuff Transfer
Our X-Decoder is unique for three critical designs:
Figure 2. X-Decoder overall architecture.
X-Decoder can be used to unify a variety of vision and vision-language tasks:
Figure 3. Unify different types of vision and vision-language tasks with a single X-Decoder.
Chart 1. Left: Zero-shot Segmentation Result Compared with SoTA; Right: Finetuned Segmentation Result Compared with SoTA.
Chart 2. Zero-shot instance segmentation performance on SeginW benchmark.
Figure 4. Zero-shot semantic segmentation with pretrained X-Decoder on 10 settings of 7 datasets.
Figure 5. Zero-shot panoptic segmentation for YouTubeVOS video dataset.
Figure 6. Zero-shot referring segmentation for YouTubeVOS video dataset.
Figure 7. Zero-shot image captioning on YoutubeVOS video frames.
Figure 8. Zero-shot referring image captioning on COCO 2017 val images (pink regions are referred).
Figure 9. Zero-shot referring image editing by combining X-Decoder with Stable Diffusion inpainting.
Figure 10. Ground Truth visualization of segmentation in the wild datasets from Roboflow for a wider evaluation.
@article{zou2022xdecoder,
author = {Zou, Xueyan and Dou, Zi-Yi and Yang, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee, Yong Jae and Gao, Jianfeng},
title = {Generalized Decoding for Pixel, Image and Language},
publisher = {arXiv:2212.11270v1},
year = {2022},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.