SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

¹School of Computer Science, Beijing Institute of Technology ²The MIx group, School of Computer Science, University of Birmingham ³WEI Lab, Institute of Information Science, Beijing Jiaotong University ⁴Beijing Academy of Artificial Intelligence

Highlights

✅️ No external expert decoder for text-guided referring segmentation.
✅️ Only 1 [SEG] token for segmentation.
✅️ First method integrating two characteristics above with solid and competitive performance.
🚀 Step forward for integrating segmentation ability inside MLLM.

Abstract

Our project aims to investigate whether and how we can unlock segmentation ability from MLLM itself with one segmentation embedding (SELF1E) while achieving competitive performance, thus eliminating the need for external decoders.

First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision.

Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features.

Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. SELF1E serves as a step forward in integrating segmentation ability inside MLLM.

Performance

On RefCOCO family benchmarks, SELF1E achieves solid and competitive performance compared with methods that rely on external decoders.

On ReasonSeg, SELF1E demonstrates strong reasoning segmentation ability using only a single segmentation token.

Performance of SELF1E on ReasonSeg benchmark

BibTeX

@inproceedings{zhang2026self1e, author = {Zhang, Anqi and Ji, Xiaokang and Gao, Guangyu and Jiao, Jianbo and Liu, Chi Harold and Wei, Yunchao}, title = {SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2026}, }

SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Highlights

SELF1E unlocks segmentation ability directly from MLLM with a single segmentation token, eliminating the need for external decoders.

Abstract

RFR & RFA

Self-Replication

Visualization

Performance

BibTeX