SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang1,2, Xiaokang Ji1, Guangyu Gao1*, Jianbo Jiao2, Chi Harold Liu1, Yunchao Wei3,4
1School of Computer Science, Beijing Institute of Technology 2The MIx group, School of Computer Science, University of Birmingham 3WEI Lab, Institute of Information Science, Beijing Jiaotong University 4Beijing Academy of Artificial Intelligence

Highlights

  • ✅️ No external expert decoder for text-guided referring segmentation.
  • ✅️ Only 1 [SEG] token for segmentation.
  • ✅️ First method integrating two characteristics above with solid and competitive performance.
  • 🚀 Step forward for integrating segmentation ability inside MLLM.
SELF1E comparison with existing methods

SELF1E unlocks segmentation ability directly from MLLM with a single segmentation token, eliminating the need for external decoders.

Abstract

Our project aims to investigate whether and how we can unlock segmentation ability from MLLM itself with one segmentation embedding (SELF1E) while achieving competitive performance, thus eliminating the need for external decoders.

First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision.

Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features.

Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. SELF1E serves as a step forward in integrating segmentation ability inside MLLM.

RFR & RFA

Resolution refinement pipeline (RFRRFA)

Self-Replication

SELF1E segmentation token interaction

Visualization

SELF1E produces competitive and interpretable segmentation results across semantic, referring, and reasoning segmentation benchmarks.

SELF1E visualizations on multiple datasets

Performance

On RefCOCO family benchmarks, SELF1E achieves solid and competitive performance compared with methods that rely on external decoders.

Performance of SELF1E on RefCOCO benchmarks

On ReasonSeg, SELF1E demonstrates strong reasoning segmentation ability using only a single segmentation token.

Performance of SELF1E on ReasonSeg benchmark

BibTeX

@inproceedings{zhang2026self1e,
  author    = {Zhang, Anqi and Ji, Xiaokang and Gao, Guangyu and Jiao, Jianbo and Liu, Chi Harold and Wei, Yunchao},
  title     = {SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
}