ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

CVPR 2026 Main Track Poster

1 Kiel University, Germany 2 Hamburg University of Technology (TUHH), Germany 3 UNU-INWEH, Germany
{ali.hojjat, olaf.landsiedel}@tuhh.de {janek.haberer, soeren.pirk}@cs.uni-kiel.de

A nested Vision Transformer that spends computation adaptively: easy inputs exit early, difficult inputs activate more attention heads and reuse earlier token embeddings.

ThinkingViT progressive thinking stages and token recycling animation

Nested progressive inference with Token Recycling in ThinkingViT. After embedding the input, ThinkingViT first activates a subset of the model, including the first attention heads, to produce an initial prediction. These heads capture the most important features due to the training procedure. If certainty exceeds a threshold, easy inputs terminate early to save computation. Otherwise, the resulting tokens are fused back into the input through a projection and a learnable scaling factor α, which controls how much prior knowledge is recycled. The model then thinks more by reprocessing the fused tokens with a larger subset of attention heads for a refined prediction. ThinkingViT enables elastic inference across hardware budgets by adjusting the confidence threshold, and the number of stages and attention-head proportions can be configured for efficiency and accuracy trade-offs.

Abstract

Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies.

ThinkingViT introduces progressive thinking stages that dynamically adjust inference computation based on input difficulty. It first activates a small subset of the most important attention heads and exits early when predictions are sufficiently certain. Otherwise, it activates additional attention heads and re-evaluates the input. Token Recycling conditions each subsequent stage on embeddings from the previous stage, enabling progressive improvement while preserving the backbone.

Method

Progressive Thinking Stages

Inference starts with a smaller attention-head budget. If confidence is high enough, the sample exits early; otherwise, ThinkingViT activates a larger head subset and performs another stage.

Token Recycling

Later stages are not independent reruns. They are conditioned on embeddings from previous stages, helping the model refine predictions while preserving the ViT backbone.

Backbone-Preserving Upgrade

The design works as a plugin-style upgrade for vanilla ViT and extends to Swin Transformers, giving one model multiple accuracy-compute operating points.

ImageNet-1K Results

ThinkingViT ImageNet validation accuracy versus throughput
ThinkingViT improves the accuracy-throughput tradeoff over nested baselines.
ThinkingViT ImageNet validation accuracy versus GMACs
Threshold-controlled early exits expose multiple compute budgets from the same model.
ThinkingViT-Swin accuracy versus GMACs
The progressive thinking mechanism transfers to Swin-S.
Visualization of images sorted by first-round entropy from very confident to not confident
Visualization of images sorted by first-round entropy. ThinkingViT confidently classifies simple, clear images in one round, while complex cases with occlusion or clutter show higher entropy and trigger a second round.
ThinkingViT segmentation after one and two rounds of thinking
ThinkingViT segmentation after one and two rounds of thinking. Example outputs from the ADE20K dataset. The second round refines object boundaries and improves segmentation quality compared to the first round.
Entropy distribution after the first inference round across ImageNet validation sets
Entropy distribution after the first inference round across ImageNet validation sets. Simpler datasets like ImageNet-V2 show confident early predictions, while harder ones like ImageNet-A and ImageNet-R show greater uncertainty and trigger another round of thinking.

BibTeX

@inproceedings{hojjat2026thinkingvit,
  title={ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference},
  author={Hojjat, Ali and Haberer, Janek and Pirk, Soren and Landsiedel, Olaf},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
  url={https://arxiv.org/abs/2507.10800}
}

Acknowledgements

The code builds on pytorch-image-models (timm) and draws inspiration from the HydraViT repository. We thank the timm maintainers, DeiT authors, and HydraViT authors for their open-source implementations.