top of page

What is Chain-of-Visual-Thought (CoVT)? A Complete Guide to Visual Reasoning in AI

Chain-of-Visual-Thought (CoVT) illustration showing visual reasoning in AI with neural network and connected screens.

Imagine asking an AI to count the exact number of coffee cups on a cluttered desk, or to figure out which object sits farthest from the camera in a crowded room. These tasks sound simple to us humans, but they've long been nightmares for artificial intelligence. Vision-language models can describe images beautifully, but when it comes to precise spatial reasoning or counting objects, they stumble badly. Chain-of-Visual-Thought is changing that story.

 

Don’t Just Read About AI — Own It. Right Here

 

TL;DR

  • CoVT introduces continuous visual tokens that let AI models reason visually, not just through text

  • Performance jumps by 3% to 16% across major benchmarks like CV-Bench, MMVP, and HRBench

  • Uses only ~20 compact tokens to encode depth, segmentation, edges, and semantic features

  • Works with popular models like Qwen2.5-VL and LLaVA without external tools

  • Solves the "text bottleneck" where rich visual data gets compressed into inadequate language descriptions

  • Real-world impact: Better depth perception, spatial awareness, and object counting in AI systems


Chain-of-Visual-Thought (CoVT) is a framework that enables vision-language models to reason using continuous visual tokens—compact representations encoding perceptual cues like depth, segmentation, and edges. Unlike traditional text-only reasoning, CoVT lets models "think visually" by generating and processing roughly 20 visual tokens during inference, improving performance on spatial reasoning and fine-grained perception tasks by 3-16% across multiple benchmarks.





Table of Contents

Understanding the Core Problem

Vision-language models have achieved remarkable success in recent years. GPT-4V, Claude 3, and similar systems can describe images, answer questions about visual content, and even generate creative interpretations of what they see. But there's a catch.


When you ask these models to count objects, determine precise spatial relationships, or understand depth ordering, they often fail spectacularly. A model might confidently tell you there are "several" cups on a table when the exact count is seven. It might struggle to explain which object is closer to the camera or miss fine-grained details like the edge structure of a transparent glass.


The root cause? The text bottleneck.


Current vision-language models compress rich, continuous visual information—things like precise boundaries, depth gradients, and spatial layouts—into discrete text tokens (Qin et al., arXiv, November 2024). Even if the vision encoder "sees" everything perfectly, the language model can only reason with a simplified textual summary. This process destroys the perceptual cues needed for fine-grained tasks.


Think of it this way: imagine trying to describe the exact three-dimensional arrangement of objects in a room using only words. You'd lose precision. That's exactly what happens inside these AI models every time they try to reason about complex visual scenes.


According to research from the Emergent Mind platform, models relying solely on text-based Chain-of-Thought for perception-heavy tasks often see degraded performance compared to having no reasoning chain at all (Emergent Mind, 2024).


What Exactly is Chain-of-Visual-Thought?

Chain-of-Visual-Thought (CoVT) is a framework that fundamentally changes how vision-language models reason about images. Instead of forcing models to "talk" about visual data using only text, CoVT creates a new pathway: reasoning through continuous visual tokens.


Here's the elegant simplicity: CoVT introduces compact latent representations—roughly 20 tokens—that encode rich perceptual cues directly into the model's reasoning process. These tokens capture:

  • 2D appearance (what objects look like)

  • 3D geometry (spatial depth and structure)

  • Spatial layout (where things are positioned)

  • Edge structure (boundaries and contours)


The framework was introduced in November 2024 by researchers Yiming Qin, Bomin Wei, Jiaxin Ge, and colleagues, and is designed to work with strong existing models like Qwen2.5-VL and LLaVA without requiring external vision tools (arXiv, 2511.19418, November 2024).


The Key Innovation

Rather than restricting VLM reasoning to discrete language space, CoVT forms a visual thought chain that enables models to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues—segmentation masks, depth maps, instance boundaries, and edge structures—CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding (GitHub, Wakals/CoVT, November 2024).


The result? Models can now "see" and "think" simultaneously while remaining efficient and self-contained.


The Evolution from Chain-of-Thought to CoVT

To understand CoVT, we need to look at its predecessor: Chain-of-Thought (CoT) prompting.


Chain-of-Thought Prompting: The Foundation

Chain-of-Thought prompting, introduced by Google researchers Jason Wei and colleagues in January 2022, revolutionized how large language models tackle complex reasoning (arXiv, 2201.11903, January 2023). Instead of jumping straight to an answer, CoT prompting encourages models to show their work—generating intermediate reasoning steps that lead to the final answer.


For example, instead of directly answering a math problem, a model using CoT might write:


"After Jane gives 2 flowers to her mom she has 10... then after she gives 3 to her dad she will have 7... so the answer is 7."


This approach dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. A 540-billion parameter model using just eight CoT examples achieved state-of-the-art accuracy on the GSM8K math benchmark, even surpassing fine-tuned models (Wei et al., 2022).


Extending CoT to Vision

Researchers quickly realized that chain-of-thought reasoning could benefit visual tasks too. In April 2023, Jiaxin Ge and colleagues proposed Chain-of-Thought Prompt Tuning for vision-language models (arXiv, 2304.07919, June 2023). Their work showed that conducting effective reasoning is important in visual tasks, not just language.


However, early attempts to apply CoT to vision faced critical limitations:

  1. Text-only CoT accumulates errors in long reasoning chains

  2. Lack of perceptual grounding means models lose fine-grained visual details

  3. Over-reliance on textual priors instead of actual visual understanding

  4. Inability to model transitions between visual states


Enter CoVT: The Next Generation

Chain-of-Visual-Thought addresses these limitations by fundamentally rethinking what "thinking" means for visual AI. Rather than converting all visual information into text tokens, CoVT maintains visual reasoning in a space that preserves continuous perceptual cues (alphaXiv, November 2024).


Research from the 2024 NAACL conference highlighted that measuring and improving chain-of-thought reasoning in vision-language models requires both high-level inference AND detailed reasoning chains that preserve visual fidelity (Chen et al., NAACL 2024).


How CoVT Works: The Technical Architecture

CoVT's architecture is surprisingly elegant for its power. Let's break down the components.


The Overall Pipeline

The training pipeline consists of three main stages:

  1. Visual Token Generation: The vision-language model autoregressively predicts continuous visual tokens

  2. Dense Supervision: These tokens reconstruct dense supervision signals through lightweight decoders

  3. Task-Specific Reasoning: The model uses these visual tokens to reason about the task at hand


During training, CoVT learns to generate visual tokens that can reconstruct outputs from specialized vision experts:

  • SAM (Segment Anything Model) for instance segmentation

  • Depth Anything for depth estimation

  • PIDINet for edge detection

  • DINO for semantic features


Here's the crucial part: CoVT doesn't actually run these external models during inference. Instead, it learns to generate compact visual tokens (roughly 20 total) that implicitly contain the knowledge distilled from these experts (arXiv, 2511.19418, November 2024).


The Token Budget: Why ~20 Tokens?

You might wonder why roughly 20 tokens. This number reflects a careful balance:

  • 8 tokens for SAM-style mask prompts (instance recognition and 2D spatial perception)

  • 4 tokens for depth reconstruction (3D spatial relationships)

  • 4 tokens for edge detection (structural cues and geometry)

  • 4 tokens for DINO features (patch-level semantic representation)


This compact representation keeps the model efficient—adding minimal computational overhead—while capturing complementary perceptual properties (GitHub, Wakals/CoVT, November 2024).


Alignment Strategy

The alignment between visual tokens and their corresponding supervision signals is critical. CoVT uses a task-specific decoder approach where:

  • Each visual token learns to reconstruct specific aspects of the image (depth, edges, etc.)

  • The alignment forces tokens to learn representations useful for the final task

  • This captures richer detail than simply matching intermediate features


The framework uses autoregressive prediction during training, meaning the model learns to generate visual tokens in sequence, conditioning each token on previous ones. At inference, the model reasons directly in the continuous visual token space while optionally decoding dense predictions for interpretability (arXiv HTML, November 2024).


The Four Visual Token Categories

CoVT selects its token categories based on core perception abilities identified in vision-centric research. Let's examine each category and its role.


1. Instance Segmentation Tokens (SAM-based)

These tokens provide instance-level position and shape information. They're based on the Segment Anything Model (SAM), released by Meta AI in April 2023.


What SAM does: SAM can identify the precise location of either specific objects or every object in an image. It was trained on SA-1B, a massive dataset containing over 1 billion masks across 11 million images (Meta AI Research, April 2023). SAM's promptable design allows it to transfer to new image distributions zero-shot, often matching or exceeding supervised methods.


How CoVT uses it: The 8 SAM-based tokens act as mask prompts, encoding where objects are and what shapes they take. This endows the vision-language model with:

  • Instance recognition capability

  • 2D spatial perception

  • Object boundary awareness


2. Depth Tokens (Depth Anything-based)

These tokens provide pixel-level depth information, giving the model 3D spatial understanding.


What Depth Anything does: Depth Anything, introduced in January 2024 and improved to V2 in June 2024, represents a breakthrough in monocular depth estimation. The model was trained on roughly 62 million unlabeled images and achieves impressive zero-shot generalization (arXiv, 2401.10891, April 2024; 2406.09414, October 2024).


Depth Anything V2 produces significantly finer and more robust depth predictions through three key practices:

  1. Replacing labeled real images with synthetic images

  2. Scaling up the teacher model capacity

  3. Teaching via large-scale pseudo-labeled real images


How CoVT uses it: The 4 depth tokens enable the model to:

  • Figure out 3D spatial relationships

  • Determine which objects are closer or farther

  • Understand depth ordering in complex scenes


This directly addresses one of the biggest weaknesses in text-only reasoning: the inability to precisely communicate spatial depth.


3. Edge Tokens (PIDINet-based)

These tokens provide geometry-level structural details, helping detect edges and boundaries.


What PIDINet does: PIDINet focuses on pixel-level edge detection, identifying boundaries between objects and regions. Edge information is crucial for understanding object structure and spatial layout.


How CoVT uses it: The 4 edge tokens help the model:

  • Detect structural cues

  • Identify thin objects (like chair legs)

  • Provide additional 2D spatial information

  • Understand transparent or reflective surfaces better


4. Semantic Feature Tokens (DINO-based)

These tokens provide patch-level semantic representations of the image.


What DINO does: DINO (self-DIstillation with NO labels) is a self-supervised learning method introduced by Facebook AI and Inria in April 2021. The method produces Vision Transformer features that contain explicit information about semantic segmentation—something that doesn't emerge as clearly with supervised training or convolutional networks (arXiv, 2104.14294, May 2021).


DINO features are excellent for:

  • Unsupervised object segmentation

  • Fine-grained classification

  • Semantic understanding without labels


Remarkably, DINO achieved 78.3% top-1 accuracy on ImageNet using just a simple k-NN classifier (Meta AI Blog, 2021).


How CoVT uses it: The 4 DINO tokens capture:

  • Deep semantic information mining

  • Patch-level representations of image content

  • Rich perceptual understanding


Together, these four token categories provide complementary perceptual properties that span from low-level edges to high-level semantics.


Training Pipeline and Data Framework

CoVT employs a sophisticated four-stage training curriculum designed to progressively teach the model how to understand, generate, and use visual tokens effectively.


Stage 1: Token Comprehension

In the first stage, the model learns the meaning of visual tokens. The training data includes examples where visual tokens are paired with their corresponding visual outputs (depth maps, segmentation masks, etc.).


This stage helps the model understand what each type of visual token represents without yet generating them.


Stage 2: Token Generation

The second stage teaches the model to generate visual tokens. The VLM learns to autoregressively predict visual tokens that, when decoded, reconstruct the supervision signals from the vision experts.


At this point, the model begins to internalize the relationship between images and their corresponding visual representations.


Stage 3: Reasoning Integration

Stage three is where the magic happens: the model learns to integrate visual tokens into its reasoning process. Instead of just generating tokens, the model now uses them to condition its next-token predictions and reason toward the final answer.


This creates the actual "chain of visual thought"—a sequence of visual and textual reasoning steps.


Stage 4: Robust Adaptation

The final stage trains the model to efficiently select and utilize visual thinking tokens even when some types are missing. This improves robustness and prevents over-reliance on any single token category.


LoRA Fine-Tuning

The VLM backbone is fine-tuned using LoRA (Low-Rank Adaptation), while all projection layers remain trainable. This approach keeps the training efficient while preserving the base model's capabilities (arXiv PDF, November 2024).


During inference, dense predictions are decoded only when interpretability is desired. Otherwise, reasoning occurs entirely in the latent visual space, maintaining efficiency.


Performance Results and Benchmarks

The proof is in the numbers. CoVT delivers consistent, substantial improvements across more than ten diverse perception benchmarks.


CV-Bench (Cambrian Vision-Centric Benchmark)

CV-Bench tests vision-centric perceptual abilities across multiple dimensions. CoVT achieved:

  • 5.5% overall gain on CV-Bench

  • 14.0% improvement on the depth sub-task specifically


This massive jump in depth estimation demonstrates that CoVT's visual tokens are genuinely capturing 3D spatial information that text-only models miss (arXiv, 2511.19418, November 2024).


HRBench (High-Resolution Benchmark)

HRBench evaluates models on high-resolution images requiring fine-grained perception:

  • 4.5% overall gain on HRBench


This improvement suggests CoVT excels at capturing fine details—thin objects, small holes, and intricate layouts.


Additional Benchmark Performance

When integrated into strong VLMs like Qwen2.5-VL and LLaVA, CoVT consistently improved performance:

  • MMVP (Multimodal Visual Perception): 3-10% improvement

  • RealWorldQA: 5-12% gain

  • MMStar: 4-8% improvement

  • WorldMedQA: 6-11% gain


These gains span diverse domains—from medical imaging to real-world visual question answering—demonstrating CoVT's generalizability (Hugging Face Papers, November 2024).


Comparison with Text-Only Chain-of-Thought

Here's a startling finding: forcing a VLM to use text-only Chain-of-Thought on perception-heavy tasks often degrades performance compared to no reasoning chain at all.


This strongly supports CoVT's hypothesis: you cannot adequately discretize continuous visual information into text without losing critical fidelity. The "text bottleneck" is real, and it's costing models their perceptual accuracy (alphaXiv, November 2024).


BLINK Benchmark Results

Other research comparing perception tokens (like Aurora, a concurrent approach) found impressive gains on BLINK:

  • Aurora achieved +10.8% on BLINK counting tasks

  • +6% improvement on relative depth perception


While these numbers come from a different but related approach (Aurora uses VQ-VAE tokenization), they validate the broader concept: visual reasoning tokens dramatically improve perceptual performance (Aurora-Perception.github.io, 2024).


Real-World Applications

CoVT's improvements translate to tangible real-world benefits across numerous domains.


1. Robotics and Autonomous Navigation

Robots need precise spatial understanding to navigate cluttered environments. CoVT's enhanced depth perception and spatial reasoning enable:

  • Better obstacle avoidance through accurate distance estimation

  • Improved manipulation of objects by understanding 3D spatial relationships

  • More reliable path planning in complex indoor environments


Segmenting objects accurately helps robots distinguish between navigable spaces and obstacles, a capability that's critical but often fails with text-only reasoning (Meta AI Blog, Segment Anything, April 2023).


2. Medical Image Analysis

Medical imaging demands fine-grained visual understanding. Research has already demonstrated CoVT's potential in chest X-ray diagnosis.


CoVT-CXR, a specialized application of Chain-of-Visual-Thought, emulates doctors' multi-modal reasoning by:

  • Breaking clinical reports into individual descriptions

  • Connecting rationales to visual prompts (masks, landmarks, bounding boxes)

  • Illuminating the visual reasoning behind diagnoses


This makes automated radiograph analysis more interpretable and accurate (OpenReview, CoVT-CXR, October 2024).


3. Augmented Reality

AR applications could leverage CoVT to:

  • Identify everyday items through AR glasses

  • Provide contextual reminders and instructions

  • Understand spatial relationships in real-time

  • Segment scenes for realistic object placement


Meta's vision includes using models like SAM—which powers CoVT's instance tokens—to identify objects via AR glasses, prompting users with relevant information (Meta AI Blog, April 2023).


4. Agricultural Technology

Farmers could benefit from CoVT-powered systems that:

  • Count plants or fruits with high precision

  • Assess crop health through fine-grained visual analysis

  • Detect pest damage or disease patterns

  • Monitor spatial distribution of crops for optimal harvesting


The ability to accurately count and locate objects—a weakness in text-only models—becomes a strength with visual tokens.


5. Content Creation and Video Editing

CoVT's segmentation and depth understanding enable:

  • Automated background removal with better edge handling

  • Depth-aware video effects

  • Precise object selection in cluttered scenes

  • Better understanding of scene composition


6. Scientific Research

Biologists and other scientists analyzing visual data could use CoVT for:

  • Counting cells or organisms in microscopy images

  • Understanding spatial relationships in experimental setups

  • Detecting subtle structural changes

  • Analyzing complex visual patterns


Comparison with Other Methods

CoVT doesn't exist in isolation. Let's see how it stacks up against related approaches.


Visual Chain-of-Thought (VCoT)

Earlier visual CoT methods attempted to use text descriptions of visual reasoning steps. However, they suffered from the text bottleneck—losing fine-grained perceptual information in the conversion to language.


CoVT advantage: Maintains reasoning in continuous visual space.


Multimodal Chain-of-Thought (MCoT)

MCoT methods combine text and images but typically don't create intermediate visual representations that are specifically optimized for perceptual tasks.


CoVT advantage: Purpose-built visual tokens encoding depth, edges, segmentation, and semantics.


Aurora

Aurora (released in 2024) shares conceptual similarities with CoVT. Both use visual tokens to enhance perception. However, Aurora employs VQ-VAE (Vector Quantized Variational Autoencoder) to transform intermediate representations into discrete tokenized formats for depth and bounding boxes.


Key differences:

  • CoVT uses continuous visual tokens while Aurora uses discrete VQ-VAE tokens

  • CoVT integrates tokens directly into the reasoning chain

  • Aurora focuses specifically on depth and counting tasks


Aurora achieved impressive gains: +10.8% on BLINK counting, +11.3% on CVBench, and +8.3% on SEED-Bench (Aurora-Perception.github.io, 2024).


CoVT advantage: Continuous representations may preserve more fine-grained information, and CoVT's approach is more general-purpose.


Mirage

Mirage uses latent imagination for visual reasoning tasks, generating intermediate visual representations internally.


CoVT advantage: More structured approach with specific token categories aligned to vision experts.


VisReason Dataset

The VisReason dataset (2025) provides visual chain-of-thought annotations for training, but focuses on explicit visual rationale generation rather than compact continuous tokens.


CoVT advantage: More efficient representation (20 tokens vs. full images) and faster inference.


Comparison Summary Table

Method

Token Type

External Tools

3D Awareness

Dense Cues

Performance on Depth

Text-only CoT

Discrete text

No

No

No

Poor

VCoT

Text + images

Sometimes

Limited

No

Fair

Aurora

Discrete (VQ-VAE)

No

Yes

Yes

Good

Mirage

Continuous latent

No

Limited

Limited

Fair

CoVT

Continuous visual

No

Yes

Yes

Excellent

CoVT uniquely satisfies all desired properties: continuous visual reasoning, dense cues, 3D awareness, and tool-free operation (arXiv PDF, Table 1, November 2024).


Limitations and Challenges

Despite its impressive results, CoVT faces several challenges and limitations.


1. Scalability Concerns

High-dimensional visual states stress attention and memory budgets in chain architectures. Current approaches like hierarchical masking and dynamic token selection help mitigate these overheads, but scaling to very long visual reasoning chains remains challenging (Emergent Mind, 2024).


2. Interpretability vs. Fidelity Trade-off

While explicit images or code-driven diagrams maximize interpretability, continuous token chains can become opaque to end-users. There's a tension between:

  • Human interpretability: Understanding what the model is "thinking"

  • Model fidelity: Maintaining precise perceptual information


CoVT addresses this by optionally decoding visual tokens into human-readable dense predictions (depth maps, segmentation masks), but this adds computational cost.


3. Limited to Visual Modality

CoVT focuses exclusively on visual reasoning. Extending the approach to other modalities (audio, sensor data, time-series information) would require substantial adaptation.


4. Training Complexity

The four-stage training curriculum, while effective, adds complexity compared to simpler training approaches. Researchers need:

  • Multiple vision expert models (SAM, Depth Anything, PIDINet, DINO)

  • Carefully designed data at each stage

  • Significant computational resources for training


5. Token Budget Constraints

The ~20 token budget is fixed. For extremely complex scenes or tasks requiring very fine-grained reasoning, this budget might prove insufficient. However, increasing tokens would reduce efficiency—one of CoVT's key advantages.


6. Dependency on Vision Experts

CoVT's performance depends on the quality of the vision experts used during training. If SAM, Depth Anything, or other models have biases or limitations, those could transfer to CoVT.


7. Human-Model Collaboration

The practical fusion of human intervention (error pruning, scaffolding) with automated visual-thought chains needs further protocol standardization. How should humans interact with these systems? When should they intervene? These questions remain open (Emergent Mind, 2024).


8. Adaptive Chain Length

Current systems rarely learn when or how to terminate the reasoning chain or switch modalities. Adaptive strategies via gating networks or learned halting mechanisms remain nascent.


Future Outlook

The trajectory of visual reasoning research suggests several promising directions.


Unified Backbones

The trend toward transformers capable of joint image, text, and token reasoning with dynamic expert routing will accelerate scalable CoVT realization across domains. Models like BAGEL, Qwen2.5-VL, and Emu3 point toward unified architectures (Emergent Mind, 2024).


Tool Integration and Modular Reasoning

Embedding external visual tools (object detectors, segmenters, super-resolution models) directly as chain-of-thought steps could enhance CoVT's capabilities while maintaining its self-contained approach.


Video Understanding

Extending CoVT to video—maintaining temporal consistency across frames while reasoning with visual tokens—represents a natural next step. Early work on models like SAM 2 (Segment Anything for video) demonstrates the feasibility (Meta AI, SAM 2, 2024).


Multi-Modal Fusion

Combining CoVT with other modalities (audio, text, sensor data) could enable more comprehensive scene understanding. Imagine a robot that reasons visually about depth while simultaneously processing audio cues about location.


Efficiency Improvements

Research into more efficient visual token representations, perhaps using techniques like Finite Scalar Quantization (FSQ) or improved vector quantization methods, could reduce the computational overhead further (arXiv, 2309.15505, October 2023).


Enhanced Interpretability

Developing better methods for visualizing and explaining visual thought chains would make CoVT more trustworthy and debuggable. This is particularly important for safety-critical applications like medical diagnosis or autonomous driving.


Domain-Specific Applications

We'll likely see specialized CoVT variants for specific domains:

  • Medical CoVT: Optimized for radiological analysis

  • Industrial CoVT: Focused on manufacturing inspection

  • Agricultural CoVT: Tailored for crop monitoring

  • Retail CoVT: Specialized for inventory and shelf analysis


Standardization and Benchmarks

The community needs standardized benchmarks specifically designed to evaluate visual reasoning capabilities. Datasets like VisReason (2025) represent early steps, but more comprehensive evaluation frameworks will emerge.


Practical Implementation Guide

Want to work with CoVT or similar visual reasoning approaches? Here's how to get started.


Prerequisites

Knowledge Requirements:

  • Understanding of vision-language models

  • Familiarity with transformers and attention mechanisms

  • Basic knowledge of computer vision concepts (segmentation, depth estimation)


Technical Requirements:

  • Python 3.8+

  • PyTorch (typically 2.0+)

  • GPU with at least 24GB VRAM (for training)

  • Access to vision expert models (SAM, Depth Anything, etc.)


Step 1: Set Up Your Environment

# Clone the CoVT repository
git clone https://github.com/Wakals/CoVT.git
cd CoVT

# Install dependencies
pip install -r requirements.txt

# Download pre-trained vision experts
# SAM checkpoint
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# Depth Anything checkpoint
# Available from: https://github.com/LiheYoung/Depth-Anything

Step 2: Prepare Your Data

CoVT training requires datasets formatted for each of the four stages. Typical structure:

data/
├── stage1_comprehension/
│   ├── images/
│   └── visual_tokens/
├── stage2_generation/
├── stage3_reasoning/
└── stage4_adaptation/

Step 3: Train Vision Experts (or Use Pre-trained)

Most implementations use pre-trained vision experts:

  • SAM: Available from Meta AI's Segment Anything repository

  • Depth Anything: Download from the official GitHub

  • PIDINet: Available from edge detection repositories

  • DINO: Use Facebook Research's DINO models


Step 4: Four-Stage Training

Follow the curriculum:

# Pseudocode for training stages
# Stage 1: Token Comprehension
train_stage1(model, comprehension_data)

# Stage 2: Token Generation  
train_stage2(model, generation_data)

# Stage 3: Reasoning Integration
train_stage3(model, reasoning_data)

# Stage 4: Robust Adaptation
train_stage4(model, adaptation_data)

Each stage typically requires thousands of training iterations.


Step 5: Inference

For inference, you can use visual tokens with or without decoding:

# Efficient inference (visual tokens only)
output = model.generate(image, use_visual_tokens=True, decode_visual=False)

# Interpretable inference (decode visual tokens)
output, visual_predictions = model.generate(
    image, 
    use_visual_tokens=True, 
    decode_visual=True
)
# visual_predictions contains depth maps, segmentation, etc.

Best Practices

  1. Start with pre-trained models: Training from scratch is resource-intensive

  2. Use LoRA fine-tuning: More efficient than full fine-tuning

  3. Monitor token usage: Ensure the model uses all token categories

  4. Validate on diverse benchmarks: Test across CV-Bench, MMVP, etc.

  5. Visualize visual tokens: Decode them periodically to ensure meaningful representations

  6. Experiment with token budgets: Try different allocations across categories


Common Pitfalls to Avoid

  • Skipping training stages: All four stages are important

  • Insufficient GPU memory: Visual token decoding is memory-intensive

  • Ignoring data quality: Poor vision expert predictions lead to poor tokens

  • Over-reliance on one token type: Ensure balanced usage

  • Not validating on held-out data: Overfitting is a real risk


FAQ


1. How does CoVT differ from standard Chain-of-Thought prompting?

Standard Chain-of-Thought uses only text tokens for reasoning. CoVT introduces continuous visual tokens that preserve perceptual information (depth, edges, segmentation) that would be lost in text conversion. This enables much better performance on spatial reasoning and fine-grained perception tasks.


2. Can I use CoVT with my existing vision-language model?

Yes, but it requires training. CoVT can be integrated into existing models like LLaVA and Qwen2.5-VL through the four-stage training curriculum. However, this process requires computational resources and access to vision expert models.


3. How much does CoVT improve performance?

Performance improvements range from 3% to 16% across different benchmarks, with the largest gains (14%) on depth-related tasks. The exact improvement depends on the base model and the specific task.


4. Does CoVT require external tools during inference?

No. While CoVT learns from external vision experts (SAM, Depth Anything, etc.) during training, at inference time it generates visual tokens internally without calling external models. This makes it efficient and self-contained.


5. What's the computational cost of using CoVT?

CoVT adds minimal overhead at inference time because it uses only ~20 compact visual tokens. The main computational cost comes during training, where you need to run vision expert models to generate supervision signals.


6. Can CoVT handle video, or just images?

The current CoVT framework focuses on images. However, the architecture could potentially be extended to video by maintaining temporal consistency across frames, similar to how SAM 2 extended SAM to video segmentation.


7. Is the code open source?

Yes. The CoVT framework code is available on GitHub under an open-source license. Pre-trained models are also being released to facilitate research and applications.


8. What are the main limitations of CoVT?

Main limitations include: (1) scalability to very long reasoning chains, (2) interpretability of continuous tokens, (3) dependency on vision expert quality, (4) fixed token budget, and (5) training complexity. See the Limitations section for details.


9. How does CoVT compare to Aurora?

Both use visual reasoning tokens, but CoVT uses continuous tokens while Aurora uses discrete VQ-VAE tokens. CoVT provides a more general framework with four token categories, while Aurora focuses on depth and counting. Both show substantial performance gains.


10. Can CoVT explain its reasoning to humans?

Partially. CoVT can optionally decode its visual tokens into human-readable outputs (depth maps, segmentation masks, edge maps). However, the continuous token space itself is not directly interpretable. This represents a trade-off between model fidelity and human interpretability.


11. What datasets are used to train CoVT?

CoVT can be trained on standard vision-language datasets, but they need to be augmented with supervision signals from vision experts. The training data should span diverse scenes and scenarios to ensure good generalization.


12. Does CoVT work for all types of vision tasks?

CoVT excels at tasks requiring spatial reasoning, depth perception, object counting, and fine-grained visual understanding. It may offer less benefit for high-level semantic tasks where text-based reasoning already performs well.


13. How many parameters does CoVT add to a base model?

The projection layers for visual tokens add relatively few parameters (typically a few million) compared to the base VLM's billions. The exact number depends on the implementation and token dimensionality.


14. Can I fine-tune CoVT for my specific domain?

Yes. After pre-training with the four-stage curriculum, you can fine-tune CoVT on domain-specific data (medical images, satellite imagery, etc.) to improve performance for your application.


15. What's the inference speed of CoVT compared to standard VLMs?

CoVT maintains efficient inference by using only ~20 compact tokens. The speed difference is minimal (typically <10% slower) compared to baseline VLMs, especially when visual tokens aren't decoded to full predictions.


Key Takeaways

  1. CoVT solves the text bottleneck by enabling vision-language models to reason through continuous visual tokens instead of relying solely on text.

  2. Compact yet powerful: Just ~20 visual tokens encode rich perceptual cues including depth, segmentation, edges, and semantic features.

  3. Substantial performance gains: Improvements of 3-16% across major benchmarks demonstrate that visual reasoning beats text-only approaches for perception tasks.

  4. No external tools needed: CoVT generates visual tokens internally at inference, maintaining efficiency and self-containment.

  5. Four key token categories: Instance segmentation (SAM), depth (Depth Anything), edges (PIDINet), and semantics (DINO) provide complementary perceptual information.

  6. Structured training curriculum: The four-stage training process progressively teaches comprehension, generation, reasoning integration, and robust adaptation.

  7. Works with existing models: CoVT integrates with popular vision-language models like Qwen2.5-VL and LLaVA through fine-tuning.

  8. Real-world impact: Applications span robotics, medical imaging, AR, agriculture, and scientific research—anywhere precise spatial understanding matters.

  9. Text-only CoT can hurt: Research shows that forcing text-only reasoning on perception tasks often degrades performance compared to no reasoning chain.

  10. Future is unified: The trend toward unified backbones capable of seamless image-text-token reasoning will make visual reasoning approaches increasingly powerful.


Actionable Next Steps

  1. Explore the research paper: Read the full CoVT paper on arXiv (2511.19418) to understand the technical details and experimental methodology.

  2. Try the GitHub repository: Clone the CoVT repository and experiment with pre-trained models on your own images to see visual reasoning in action.

  3. Benchmark on your tasks: If you work with vision-language models, evaluate CoVT on your specific use cases to quantify potential improvements.

  4. Experiment with vision experts: Download and play with SAM, Depth Anything, DINO, and edge detection models to understand what CoVT learns from each.

  5. Join the community: Engage with researchers working on visual reasoning through GitHub issues, academic forums, and AI research communities.

  6. Contribute improvements: If you identify ways to enhance CoVT's efficiency, interpretability, or performance, contribute back to the open-source project.

  7. Develop domain applications: Consider how visual reasoning could benefit your specific domain (medical, agricultural, industrial, etc.) and prototype specialized implementations.

  8. Stay updated on related work: Follow research on visual reasoning, multimodal learning, and vision-language models to track rapid developments in this field.


Glossary

  1. Chain-of-Thought (CoT): A prompting technique that encourages large language models to generate intermediate reasoning steps before arriving at a final answer, significantly improving performance on complex reasoning tasks.

  2. Continuous Visual Tokens: Compact latent representations in continuous vector space that encode perceptual information like depth, segmentation, and edges, as opposed to discrete text tokens.

  3. DINO (self-DIstillation with NO labels): A self-supervised learning method for Vision Transformers that produces rich semantic features without requiring labeled training data.

  4. Depth Anything: A foundation model for monocular depth estimation that predicts the three-dimensional depth structure of a scene from a single 2D image, trained on 62+ million unlabeled images.

  5. Instance Segmentation: The task of identifying and delineating individual object instances in an image, providing both classification and precise pixel-level boundaries for each object.

  6. LoRA (Low-Rank Adaptation): An efficient fine-tuning technique that reduces trainable parameters by learning low-rank updates to model weights, enabling faster training with less memory.

  7. Multimodal Language Model (MLM): An AI model that can process and generate outputs across multiple modalities like text, images, audio, and video, understanding relationships between them.

  8. PIDINet: A pixel-level edge detection model that identifies boundaries and structural elements in images, useful for understanding object contours and spatial relationships.

  9. Segment Anything Model (SAM): A promptable segmentation foundation model from Meta AI that can identify and segment any object in an image based on various prompt types like points, boxes, or text.

  10. Text Bottleneck: The information loss that occurs when rich, continuous visual information is converted into discrete text tokens, limiting a model's ability to reason about fine-grained perceptual details.

  11. Vision-Language Model (VLM): An AI model that processes both visual (images/video) and textual information, enabling tasks like image captioning, visual question answering, and multimodal reasoning.

  12. Visual Token: A learned representation that encodes perceptual information about an image, such as depth, segmentation, edges, or semantic features, used in the reasoning process.

  13. VQ-VAE (Vector Quantized Variational Autoencoder): A neural network architecture that learns discrete latent representations of data by quantizing continuous vectors into a fixed codebook of embeddings.

  14. Zero-Shot Performance: A model's ability to perform tasks it wasn't explicitly trained for by leveraging knowledge learned during pre-training, without requiring task-specific fine-tuning.


Sources and References

  1. Qin, Y., Wei, B., Ge, J., Kallidromitis, K., Fu, S., Darrell, T., & Wang, X. (2024, November 24). Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens. arXiv. https://arxiv.org/abs/2511.19418

  2. Emergent Mind. (2024). Chain-of-Visual-Thought (COVT) in Multimodal AI. https://www.emergentmind.com/topics/chain-of-visual-thought-covt

  3. Wakals. (2024, November). CoVT GitHub Repository. https://github.com/Wakals/CoVT

  4. alphaXiv. (2024, November 30). Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens. https://www.alphaxiv.org/resources/2511.19418v1

  5. Hugging Face. (2024, November). Chain-of-Visual-Thought: Paper Discussion. https://huggingface.co/papers/2511.19418

  6. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022, January 10). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv. https://arxiv.org/abs/2201.11903

  7. Ge, J., Luo, Y., Sun, Y., Bai, Y., Zhang, H., Fu, J., & Han, J. (2023, April 16). Chain of Thought Prompt Tuning in Vision Language Models. arXiv. https://arxiv.org/abs/2304.07919

  8. Chen, Y., Sikka, K., Cogswell, M., Ji, H., & Divakaran, A. (2024, June). Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models. Proceedings of NAACL 2024. https://aclanthology.org/2024.naacl-long.11/

  9. Bigverdi, M., Luo, Z., Hsieh, C.-Y., Shen, E., Chen, D., Shapiro, L. G., & Krishna, R. (2024). Perception Tokens Enhance Visual Reasoning in Multimodal Language Models (Aurora). https://aurora-perception.github.io/

  10. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021, April 29). Emerging Properties in Self-Supervised Vision Transformers. arXiv. https://arxiv.org/abs/2104.14294

  11. Meta AI Research. (2021). DINO and PAWS: Advancing the state of the art in computer vision. https://ai.meta.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/

  12. Facebook Research. (2021). DINO GitHub Repository. https://github.com/facebookresearch/dino

  13. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., ... & Girshick, R. (2023, April 5). Segment Anything. arXiv. https://arxiv.org/abs/2304.02643

  14. Meta AI Research. (2023, April 5). Introducing Segment Anything: Working toward the first foundation model for image segmentation. https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/

  15. Meta AI. (2024). Introducing Meta Segment Anything Model 2 (SAM 2). https://ai.meta.com/sam2/

  16. Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024, January 19). Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. arXiv. https://arxiv.org/abs/2401.10891

  17. Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024, June 13). Depth Anything V2. arXiv. https://arxiv.org/abs/2406.09414

  18. DepthAnything. (2024). Depth-Anything-V2 GitHub Repository. https://github.com/DepthAnything/Depth-Anything-V2

  19. OpenReview. (2024, October 4). CoVT-CXR: Building Chain of Visual Thought for Interpretable Chest X-Ray Diagnosis. https://openreview.net/forum?id=myZNJSpiK1

  20. Medium - CodeWinn. (2024, December). CoVT — The Research Breakthrough That Finally Lets VLMs Think in Images. https://medium.com/coding-nexus/covt-the-research-breakthrough-that-finally-lets-vlms-think-in-images-4e75046b0e7e




$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button

$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.

$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button.

Recommended Products For This Post
 
 
 

Comments


bottom of page