Computer VisionDeep LearningMachine LearningPerceptionPostsRoboticsSensors

Probing the 3D Awareness of Visual Foundation Models

As visual foundation models increasingly find their way into SLAM and 3D reconstruction pipelines, an important question arises: are these representations actually 3D-aware, or merely strong 2D image features? This post builds on insights from the paper “Probing the 3D Awareness of Visual Foundation Models” (Google Research, 2024), which provides a systematic analysis of surface understanding and multi-view consistency in popular pretrained models.

Paper: Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun,
Leonidas Guibas, Justin Johnson, Varun Jampani

Visual foundation models are amazing at recognizing, segmenting, and describing images.
But here’s the uncomfortable question:

Do these models actually understand the 3D world — or are they just very good at 2D pattern matching?

For anyone working on SLAM, SfM, multi-view stereo, pose estimation, or 3D reconstruction, this is not philosophical. If a representation is truly 3D-aware, you should be able to:
– infer surface geometry from a single view, and
– maintain consistency across viewpoints, which is the backbone of correspondence, triangulation, and mapping.
This paper sets out to test exactly that.

What do the authors mean by “3D awareness”?

They propose a clean, practical definition with two requirements:
1. Single-view surface understanding: From one image, the representation should encode:
    • Depth
    • Surface orientation (normals)
     This is essentially: “Is there something like a 2.5D surface embedded in the features?”
2. Multi-view consistency: Across multiple views of the same scene or object:
    The same 3D point should map to consistent features
    Dense correspondence should work across viewpoint changes
     This is the heart of SLAM, SfM, and reconstruction.

How they test it (and why the setup is important)

Instead of fine-tuning (which would muddy the waters), they:
• Freeze pretrained models
Probe their internal representations using:
    Trainable dense decoders (for depth & normals)
    Zero-shot dense feature matching (for correspondence)
This matters because they’re testing what the representation already contains, not what it can be trained to do later.

Models span very different training regimes:
Self-supervised: DINO, DINOv2, iBOT, MAE
Vision–language: CLIP, SigLIP
Generative: Stable Diffusion
Dense supervision: MiDaS (depth), SAM (segmentation)
Scenes and objects are evaluated, which is key – objects remove many “scene priors” that SLAM systems often lean on.

Result 1: Single-view geometry — better than you might expect

Here’s the first surprise: Many foundation models do encode meaningful surface geometry, despite never being trained on 3D data.
• DINOv2 stands out:
    ◦  Sharp, detailed depth
    Accurate surface normals
    Competitive with dedicated monocular depth models
Stable Diffusion is a close second (consistent with recent findings)
CLIP and MAE largely fail:
    ◦  They rely on coarse priors (“floors are flat”, “walls are vertical”)
    ◦ 
Especially bad on object-centric data

Depth estimation comparison
Surface normals comparison

Key insight for SLAM: Self-supervised discriminative learning produces features that are more geometrically meaningful than:
classification training
vision-language supervision
even explicit depth supervision in some cases
That’s… non-obvious and important.

Result 2: Multi-view consistency — and this is where things break

Now the critical part. When asked to do dense correspondence across views:
All models work reasonably well for small viewpoint changes
Performance collapses rapidly as viewpoint differences grow
This happens for:
    ◦  Objects
    ◦ 
Indoor scenes
    ◦ 
Even simple in-plane rotations

Correspondence under viewpoint changes

Some patterns:
• Stable Diffusion and SAM degrade very sharply
DINOv2 and DeiT are more robust — but still far from reliable
Absolute performance for wide-baseline matching is low across the board

This is the core message of the paper:

Current foundation models are view-consistent, not 3D-consistent.

They encode surfaces, but not a stable 3D world model.

Semantic vs geometric correspondence (a crucial distinction)

You might object: “But diffusion and DINO features work great for semantic correspondence!”
The authors agree — and explain why this is misleading.
Semantic correspondence matches parts, not points
It’s biased by:
    ◦ 
semantics
    ◦ 
symmetry
    ◦ 
dataset priors
Under large viewpoint changes, models make systematic errors:
    ◦ 
confusing symmetric parts
    ◦ 
sticking to “the right leg” rather than the same leg

This explains:
Why semantic matching benchmarks look good
Why 3D reconstruction pipelines still break

For SLAM / SfM

Semantic consistency is not enough. You need geometric consistency, and current representations don’t reliably provide it.

Cross-task analysis: what correlates — and what doesn’t

The authors quantify this intuition:
Depth ↔ surface normals: strongly correlated
Single-view geometry ↔ semantic correspondence: moderately correlated
Single-view geometry ↔ multi-view correspondence: weakly correlated

In other words:
Knowing “what the surface looks like here
   does not imply
knowing “where this surface is in 3D across views

This mirrors what many of us see in practice.

What this means for SLAM and 3D reconstruction

This paper quietly explains a lot of real-world pain:
Why monocular depth priors help tracking but not loop closure
Why diffusion-based 3D reconstruction suffers from Janus artifacts
Why learned features often fail under wide baselines
Why we still need geometry-heavy pipelines despite powerful vision models

The takeaway

Foundation models today are excellent 2.5D image models, not 3D world models.

They’re great at:
local surface reasoning
semantics
short-baseline matching

They struggle with:
global pose
wide-baseline correspondence
consistent 3D aggregation

Final thought

This is not a negative paper — it’s a clarifying one. It tells us:
What foundation models already give us for free
Where classical geometry is still indispensable
Where future representation learning needs to go if we want truly SLAM-ready features

If you’re building perception systems today, this paper is basically saying:

“Use these models — but don’t assume they solved geometry for you.”


References

These are models trained without labels and often form the backbone of dense features.

Representations grounded in both images and language.

These models are trained to generate images and implicitly capture visual structure.

Explicit supervision for geometric outputs—important baselines for probing.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.