Probing the 3D Awareness of Visual Foundation Models
As visual foundation models increasingly find their way into SLAM and 3D reconstruction pipelines, an important question arises: are these representations actually 3D-aware, or merely strong 2D image features? This post builds on insights from the paper “Probing the 3D Awareness of Visual Foundation Models” (Google Research, 2024), which provides a systematic analysis of surface understanding and multi-view consistency in popular pretrained models.
Visual foundation models are amazing at recognizing, segmenting, and describing images.
But here’s the uncomfortable question:
Do these models actually understand the 3D world — or are they just very good at 2D pattern matching?
For anyone working on SLAM, SfM, multi-view stereo, pose estimation, or 3D reconstruction, this is not philosophical. If a representation is truly 3D-aware, you should be able to:
– infer surface geometry from a single view, and
– maintain consistency across viewpoints, which is the backbone of correspondence, triangulation, and mapping.
This paper sets out to test exactly that.
What do the authors mean by “3D awareness”?
They propose a clean, practical definition with two requirements:
1. Single-view surface understanding: From one image, the representation should encode:
• Depth
• Surface orientation (normals)
This is essentially: “Is there something like a 2.5D surface embedded in the features?”
2. Multi-view consistency: Across multiple views of the same scene or object:
• The same 3D point should map to consistent features
• Dense correspondence should work across viewpoint changes
This is the heart of SLAM, SfM, and reconstruction.
How they test it (and why the setup is important)
Instead of fine-tuning (which would muddy the waters), they:
• Freeze pretrained models
• Probe their internal representations using:
◦ Trainable dense decoders (for depth & normals)
◦ Zero-shot dense feature matching (for correspondence)
This matters because they’re testing what the representation already contains, not what it can be trained to do later.
Models span very different training regimes:
• Self-supervised: DINO, DINOv2, iBOT, MAE
• Vision–language: CLIP, SigLIP
• Generative: Stable Diffusion
• Dense supervision: MiDaS (depth), SAM (segmentation)
Scenes and objects are evaluated, which is key – objects remove many “scene priors” that SLAM systems often lean on.
Result 1: Single-view geometry — better than you might expect
Here’s the first surprise: Many foundation models do encode meaningful surface geometry, despite never being trained on 3D data.
• DINOv2 stands out:
◦ Sharp, detailed depth
◦ Accurate surface normals
◦ Competitive with dedicated monocular depth models
• Stable Diffusion is a close second (consistent with recent findings)
• CLIP and MAE largely fail:
◦ They rely on coarse priors (“floors are flat”, “walls are vertical”)
◦ Especially bad on object-centric data


Key insight for SLAM: Self-supervised discriminative learning produces features that are more geometrically meaningful than:
• classification training
• vision-language supervision
• even explicit depth supervision in some cases
That’s… non-obvious and important.
Result 2: Multi-view consistency — and this is where things break
Now the critical part. When asked to do dense correspondence across views:
• All models work reasonably well for small viewpoint changes
• Performance collapses rapidly as viewpoint differences grow
• This happens for:
◦ Objects
◦ Indoor scenes
◦ Even simple in-plane rotations

Some patterns:
• Stable Diffusion and SAM degrade very sharply
• DINOv2 and DeiT are more robust — but still far from reliable
• Absolute performance for wide-baseline matching is low across the board
This is the core message of the paper:
Current foundation models are view-consistent, not 3D-consistent.
They encode surfaces, but not a stable 3D world model.
Semantic vs geometric correspondence (a crucial distinction)
You might object: “But diffusion and DINO features work great for semantic correspondence!”
The authors agree — and explain why this is misleading.
• Semantic correspondence matches parts, not points
• It’s biased by:
◦ semantics
◦ symmetry
◦ dataset priors
• Under large viewpoint changes, models make systematic errors:
◦ confusing symmetric parts
◦ sticking to “the right leg” rather than the same leg
This explains:
• Why semantic matching benchmarks look good
• Why 3D reconstruction pipelines still break
For SLAM / SfM
Semantic consistency is not enough. You need geometric consistency, and current representations don’t reliably provide it.
Cross-task analysis: what correlates — and what doesn’t
The authors quantify this intuition:
• Depth ↔ surface normals: strongly correlated
• Single-view geometry ↔ semantic correspondence: moderately correlated
• Single-view geometry ↔ multi-view correspondence: weakly correlated
In other words:
• Knowing “what the surface looks like here”
does not imply
• knowing “where this surface is in 3D across views”
This mirrors what many of us see in practice.
What this means for SLAM and 3D reconstruction
This paper quietly explains a lot of real-world pain:
• Why monocular depth priors help tracking but not loop closure
• Why diffusion-based 3D reconstruction suffers from Janus artifacts
• Why learned features often fail under wide baselines
• Why we still need geometry-heavy pipelines despite powerful vision models
The takeaway
Foundation models today are excellent 2.5D image models, not 3D world models.
They’re great at:
• local surface reasoning
• semantics
• short-baseline matching
They struggle with:
• global pose
• wide-baseline correspondence
• consistent 3D aggregation
Final thought
This is not a negative paper — it’s a clarifying one. It tells us:
• What foundation models already give us for free
• Where classical geometry is still indispensable
• Where future representation learning needs to go if we want truly SLAM-ready features
If you’re building perception systems today, this paper is basically saying:
“Use these models — but don’t assume they solved geometry for you.”
References
📌 Self-Supervised Vision Models
These are models trained without labels and often form the backbone of dense features.
- Caron, M. et al. — Emerging Properties in Self-Supervised Vision Transformers (DINO). arXiv (2021)
- Oquab, M. et al. — DINOv2: Learning Robust Visual Features without Supervision. arXiv (2024)
- He, K. et al. — Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR (2022)
- Bao, H. et al. — iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR (2022)
📌 Vision–Language Models
Representations grounded in both images and language.
- Radford, A. et al. — Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML (2021)
- Zhai, X. et al. — Sigmoid Loss for Language-Image Pre-Training (SigLIP). NeurIPS (2023)
📌 Generative Models with Vision Features
These models are trained to generate images and implicitly capture visual structure.
- Rombach, R. et al. — High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion / LDM). CVPR (2022)
- Sinha, R. et al. — Unsupervised Dense Correspondence via Co-registered Diffusion Features. arXiv (2024)
📌 Dense Supervision Models
Explicit supervision for geometric outputs—important baselines for probing.
- Ranftl, R. et al. — Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer (MiDaS). TPAMI (2020)
- Kirillov, A. et al. — Segment Anything (SAM). ICCV (2023)
