[논문 리뷰] VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

핵심 요약

VGGT-SLAM은 VGGT가 만든 local submap을 긴 RGB monocular sequence에 맞게 이어 붙이기 위해, Sim(3)이 아니라 SL(4) homography를 factor graph에서 최적화하는 dense RGB SLAM 시스템이다.

문제긴 RGB sequence에서 VGGT memory 한계

해결SL(4) submap factor graph

근거pose, dense map, loop closure, ablation

한 문장 요약

VGGT-SLAM의 핵심은 VGGT submap 사이의 불일치를 단순 scale/rotation/translation 문제가 아니라 uncalibrated camera에서 생기는 projective ambiguity로 보고, 이를 SL(4) factor graph로 푸는 것이다.

Contribution 01

VGGT Submaps

GPU memory 한계 때문에 긴 sequence를 여러 VGGT submap으로 나눠 생성.

Contribution 02

Projective Ambiguity

uncalibrated reconstruction은 일반적으로 15-DOF projective transform까지 모호할 수 있음을 SLAM 정렬 문제로 연결.

Contribution 03

SL(4) Factor Graph

relative homography와 loop closure를 SL(4) manifold 위에서 전역 최적화.

Contribution 04

Uncalibrated RGB SLAM

camera intrinsics나 consistent calibration 없이 monocular RGB로 dense mapping 수행.

내가 얻은 인사이트

이 논문은 VGGT를 SLAM에 붙이는 단순 engineering이 아니다. feed-forward reconstruction의 실패 형태를 classical projective geometry 언어로 해석하고, 그 해석에 맞는 manifold optimization을 설계한 점이 핵심이다.

처리 흐름

01RGB Framesuncalibrated monocular input

02Keyframesdisparity-based selection

03VGGT Submapdepth + pose + confidence

04Relative H5-point RANSAC homography

05Loop ClosureSALAD retrieval

06SL(4) Graphglobally aligned dense map

접근 방식 비교

VGGT

짧은 batch에 강함

dense reconstruction quality는 높지만, 긴 video는 GPU memory 한계로 한 번에 처리하기 어려움.

Sim(3) Alignment

때로는 부족함

translation, rotation, scale만 맞추면 projective distortion이 남을 수 있음.

SL(4) Alignment

projective까지 보정

15-DOF homography로 shear, stretch, perspective ambiguity까지 다룸.

논문 상세 정리

아래부터는 기존 논문 내용을 최대한 담은 상세 해석이다. 핵심 흐름에서 벗어나는 배경지식, notation, 부가 자료는 접어두었다.

Problem: VGGT를 왜 긴 RGB SLAM으로 확장하나

초록은 VGGT-SLAM을 uncalibrated monocular RGB camera만으로 dense SLAM을 수행하는 시스템으로 제시한다. VGGT가 만든 submap들을 incremental/global하게 align하되, 단순 Sim(3)이 아니라 projective ambiguity를 고려한 SL(4) 최적화를 사용한다.

Fig. 1. Sim(3) vs SL(4) submap alignment. — Fig. 1. Sim(3) alignment가 실패하는 사례와 SL(4) alignment로 보정되는 사례.VGGT submap이 항상 metric reconstruction을 보장하지 않기 때문에 Sim(3)보다 넓은 SL(4) alignment가 필요한 상황을 보여준다.

Abstract 핵심 주장

VGGT의 강한 dense prior를 긴 sequence로 확장하려면, submap fusion의 geometry를 다시 봐야 한다.

문제	논문의 해석	해결책
VGGT memory limit	긴 video를 한 번에 처리하기 어려움	submap 단위로 나누고 전역 정렬
Uncalibrated input	metric reconstruction이 아니라 projective ambiguity가 남을 수 있음	SL(4) homography alignment
Long sequence drift	sequential factor만으로는 누적 오차 발생	SALAD 기반 loop closure 추가

Context: feed-forward submap은 어디서 길이 한계가 생기나

Introduction은 VGGT가 많은 view를 한 번에 처리하기 어렵다는 현실적 한계에서 출발한다. 단순히 overlapping frame을 공유하는 submap들을 만들고 Sim(3)로 붙이면 충분해 보이지만, 논문은 uncalibrated camera의 경우 reconstruction이 projective transform까지 모호할 수 있음을 지적한다.

문제 전개

핵심은 “VGGT를 크게 돌릴 수 없으니 나눠서 돌리자”에서 끝나지 않는다는 점이다.

단계	직관적 생각	논문의 반박
Submap split	VGGT를 여러 window로 실행	각 submap의 coordinate ambiguity가 달라질 수 있음
Overlap frame	공유 frame으로 대응점 확보	대응점은 충분해도 transform group 선택이 문제
Sim(3)	scale/rotation/translation만 맞추면 됨	uncalibrated reconstruction에서는 shear/perspective도 필요할 수 있음

Related Work는 classical scene reconstruction, feed-forward scene reconstruction, MASt3R-SLAM, SL group optimization으로 나뉜다. VGGT-SLAM은 feed-forward dense prior를 쓰면서도, backend는 classical projective geometry와 Lie group optimization으로 설계한다.

Related Work 흐름 보기

문헌 위치

비슷한 SLAM 시스템과 비교할 때, transform group이 무엇인지가 가장 중요한 차별점이다.

AClassical SLAM

feature, matching, BA, SE(3)/Sim(3) backend 중심.

BFeed-forward reconstruction

DUSt3R, MASt3R, VGGT처럼 dense point/depth를 직접 예측.

CMASt3R-SLAM

uncalibrated dense monocular SLAM이지만 Sim(3) alignment 중심.

DSL group optimization

SL(3)는 image homography에 흔하지만, SL(4) factor graph SLAM은 새로움.

VGGT-SLAM의 선택지

문헌 흐름은 dense prior를 쓰더라도 backend 좌표계와 transformation group을 어떻게 잡을지가 핵심임을 보여준다.

비교 축	기존 접근	VGGT-SLAM의 선택
Geometry source	feature matching 또는 dense pair prediction	VGGT dense prior를 frame-level constraint로 사용
Alignment group	SE(3), Sim(3), homography 기반 정렬	SL(4) projective transform으로 dense prediction 정렬
Backend	pose graph / BA / dense alignment	projective factor graph로 prior와 SLAM state 결합
차별점	feed-forward result를 후처리로 쓰는 경우가 많음	prediction 자체를 factor graph objective에 끼워 넣음

Mechanism: SL(4) submap graph로 어떻게 정렬하나

Method는 keyframe selection, local submap generation, relative homography estimation, loop closure, SL(4) backend로 이어진다. VGGT에서는 dense depth와 confidence를 사용하고, point cloud는 depth와 camera head를 inverse projection해 만든다.

VGGT review / dense outputs. — SL(4) backend를 읽기 위한 Kronecker-product 선형화 보조 메모.SL(4) factor graph의 residual/Jacobian을 구성하기 위한 보조 수식으로, 구현 관점에서 backend optimization과 연결된다.

System Thread Summary

submap을 만드는 단계와 submap을 붙이는 단계가 분리되어 있고, 붙이는 단계에서 SL(4)가 등장한다.

구성	담당	읽는 포인트
Keyframe selection	Lucas-Kanade disparity가 threshold보다 클 때 keyframe 선택	충분한 parallax가 depth/reconstruction 안정성에 중요
Submap input	I_latest = {M_prior} ∪ I_latest ∪ I_loop	이전 submap frame과 loop frame을 함께 넣음
Relative homography	overlap frame의 dense correspondence로 Hⁱ_j ∈ SL(4) 추정	별도 correspondence estimation이 필요 없음
Loop closure	SALAD descriptor로 이전 keyframe retrieval	non-sequential homography factor 추가
Backend	absolute homography를 SL(4) manifold에서 MAP 최적화	submap들을 global reconstruction으로 정렬

Mechanism: SL(4) objective와 homography는 어떻게 쓰이나

수식은 relative homography 관계, closed-form homography estimation, SL(4) factor graph objective, tangent-space linearization으로 읽으면 된다. 핵심 수식은 먼저 펼쳐두고, 보조 번호 수식은 보조 수식 정리에서 이어서 확인할 수 있게 했다.

$$\mathbf{X}_a^{\mathcal{S}_i}=\mathbf{H}_j^i\mathbf{X}_b^{\mathcal{S}_j}\tag{1}$$

Eq. (1). Overlapping submap alignment.overlapping submap 사이의 3D point alignment 관계를 나타낸다.

$$\hat{\mathcal{H}}=\arg\min_{\mathbf{H}\in\mathrm{SL}(4)}\sum_{(i,j)\in\mathcal{L}}\Big\| \operatorname{Log}(\mathbf{H}_i^{-1}\mathbf{H}_j(\mathbf{H}_j^{i})^{-1}) \Big\|^2_{\Omega^{\mathbf{H}}_{ij}}\tag{3}$$

Eq. (3). SL(4) factor graph objective.relative homography constraints와 loop closure를 포함한다.

$$\hat{\mathcal{D}}=\arg\min_{\boldsymbol{\delta}\in\mathcal{D}}\sum_{(i,j)\in\mathcal{L}}\| e_{ij}+\mathbf{J}_i\boldsymbol{\delta}_i+\mathbf{J}_j\boldsymbol{\delta}_j \|^2_{\Omega_{ij}^{\mathbf{H}}},\quad e_{ij}=\operatorname{Log}(\mathbf{H}_i^{-1}\mathbf{H}_j(\mathbf{H}^i_j)^{-1}).\tag{5}$$

Eq. (5). LM update on the Lie group.linearized residual을 LM으로 풀고 Lie group에서 update한다.

Transform group 비교

왜 SL(4)인지 이해하려면 각 group이 표현하는 ambiguity 범위를 비교하면 된다.

Group	DOF	다루는 변환	한계/역할
SE(3)	6	rotation + translation	metric scale이 맞는 rigid alignment
Sim(3)	7	SE(3) + scale	monocular scale drift 보정에는 적합
SL(4)	15	3D projective homography	uncalibrated projective ambiguity까지 보정

Homography / tangent update 보조 수식 보기

Homography / tangent update 보조 수식

Core Equations에서 크게 묶었던 homography estimation과 tangent-space update를 번호 흐름에 맞춰 이어서 정리한다.

$$\mathbf{A}_k\mathbf{h}=0\tag{2}$$

Eq. (2). Closed-form relative homography system.overlapping point correspondence로 relative 4×4 homography를 추정한다.

$$h(\boldsymbol{\xi}_i\oplus\boldsymbol{\delta}_i,\boldsymbol{\xi}_j\oplus\boldsymbol{\delta}_j)\simeq h(\boldsymbol{\xi}_i\oplus\boldsymbol{\xi}_j)\oplus{\{\mathbf{J}_i\boldsymbol{\delta}_i+\mathbf{J}_j\boldsymbol{\delta}_j\}},\ \boldsymbol{\xi}\oplus\boldsymbol{\delta}=\mathbf{H}\operatorname{Exp}(\boldsymbol{\delta})\tag{4}$$

Eq. (4). SL(4) tangent-space update.tangent-space increment와 manifold update를 정의한다.

Evidence: pose/dense map/loop closure를 어떻게 검증했나

실험은 pose estimation, dense reconstruction, qualitative map, ablation으로 구성된다. 핵심 결과는 SL(4)가 일부 어려운 장면에서 Sim(3)보다 유리하고, 전반적으로 uncalibrated setting에서 MASt3R-SLAM 수준 또는 그 이상으로 경쟁력 있다는 점이다.

Evaluation Brief

Table 1/2는 pose, Table 3은 dense map quality, Fig. 3은 loop closure와 confidence threshold ablation을 보여준다.

7-Scenes Pose

MASt3R-SLAM*와 비슷한 평균 ATE, uncalibrated dense SLAM 경쟁력 확인.

TUM RGB-D Pose

SL(4), w=32가 uncalibrated 평균 error에서 가장 좋은 결과.

Dense Reconstruction

7-Scenes에서 accuracy와 Chamfer가 우수.

Qualitative Loop

office/corridor loop에서 여러 submap을 global map으로 연결.

Loop Closure Ablation

submap 수가 늘수록 loop closure의 error reduction 효과 증가.

Limitations

planar scene degeneracy, outlier homography, 15-DOF drift 가능성.

Table 1. 7-Scenes ATE RMSE. 회색 행은 calibrated 기준선, *는 uncalibrated 평가, 녹색은 best/light-green은 second-best.7-Scenes에서는 VGGT-SLAM이 uncalibrated setting에서도 MASt3R-SLAM급 ATE를 보이는지 확인한다.

Fig. 2. Qualitative reconstruction and pose estimates. — Fig. 2. 7-Scenes office와 custom 55m corridor loop의 dense reconstruction 및 pose estimate.submap color가 frame 소속을 나타내며, 여러 submap이 loop closure를 통해 하나의 dense map으로 정렬되는지를 보여준다.

Table 2. TUM RGB-D ATE RMSE. SL(4), w=32가 uncalibrated 평균 error에서 가장 낮은 값을 보임.TUM에서는 SL(4), w=32가 평균적으로 가장 강하지만, planar/floor scene처럼 homography가 불안정한 경우가 한계로 남는다.

Table 3. Dense reconstruction evaluation. — Table 3. 7-Scenes dense reconstruction 평가. VGGT-SLAM은 accuracy와 Chamfer에서 강한 결과를 보임.trajectory만이 아니라 dense point-cloud reconstruction 품질도 함께 확인하는 표다.

Fig. 3. Ablation studies. — Fig. 3. Loop closure, submap size, confidence threshold ablation.loop closure는 submap 수가 많을수록 ATE 감소 효과가 커지고, confidence threshold는 reconstruction accuracy와 completeness trade-off를 만든다.

Usage / Limits: planar degeneracy와 outlier 조건에서 무엇을 조심하나

원문 limitations는 planar point에서 15-DOF homography estimation이 degeneracy를 일으킬 수 있고, VGGT point outlier에 취약하다는 점을 강조한다. 따라서 depth error에 더 강한 ray-based matching, Sim(3)로 충분한 조건과 SL(4)가 필요한 조건의 판별, 두 optimization을 함께 쓰는 real-time unified system이 남은 과제다.

느낀점

(진행중...)

Problem: why extend VGGT into long RGB SLAM?

VGGT-SLAM is presented as a dense RGB SLAM system for uncalibrated monocular cameras. It incrementally and globally aligns VGGT submaps, but uses SL(4) optimization rather than only Sim(3) because projective ambiguity can remain.

Abstract Core Claims

To scale VGGT's dense prior to long sequences, submap fusion geometry must be reconsidered.

Problem	Interpretation	Solution
VGGT memory limit	Long videos are hard to process in one inference.	Split into submaps and globally align them.
Uncalibrated input	Projective ambiguity may remain beyond metric reconstruction.	SL(4) homography alignment.
Long-sequence drift	Sequential factors alone accumulate error.	Add SALAD-based loop closures.

Context: where feed-forward submaps hit length limits

The introduction starts from a practical limit: VGGT cannot process very long videos in one shot. Splitting a sequence into overlapping submaps sounds natural, but the paper argues that uncalibrated cameras can leave projective ambiguity that Sim(3) cannot fully resolve.

Problem Development

The paper does not stop at “run VGGT in windows”; the alignment group is the real question.

Step	Naive intuition	Paper's correction
Submap split	Run VGGT over multiple windows.	Each submap may carry a different coordinate ambiguity.
Overlap frame	Shared frames give correspondences.	Correspondences are not enough; the transform group matters.
Sim(3)	Scale, rotation, and translation should suffice.	Uncalibrated reconstruction may need shear and perspective correction.

The related work connects classical scene reconstruction, feed-forward scene reconstruction, MASt3R-SLAM, and SL group optimization. VGGT-SLAM uses a feed-forward dense prior, but its backend is driven by classical projective geometry and Lie-group optimization.

Related Work Details

Literature position

Compared with similar SLAM systems, the transform group is the key difference.

AClassical SLAM

Feature, matching, BA, and SE(3)/Sim(3)-style backends.

BFeed-forward reconstruction

DUSt3R, MASt3R, and VGGT directly predict dense point/depth outputs.

CMASt3R-SLAM

Uncalibrated dense monocular SLAM with Sim(3)-centered alignment.

DSL group optimization

SL(3) appears in image homography, but SL(4) factor-graph SLAM is new here.

VGGT-SLAM design choice

The literature shows that once a dense prior is available, the central backend question becomes which coordinate and transformation model should align it.

Axis	Prior approach	VGGT-SLAM choice
Geometry source	Feature matching or dense pair prediction.	Uses VGGT dense priors as frame-level constraints.
Alignment group	SE(3), Sim(3), or homography-style alignment.	Uses SL(4) projective transforms for dense predictions.
Backend	Pose graph, BA, or dense alignment.	Combines priors and SLAM states in a projective factor graph.
Difference	Feed-forward output is often treated as post-processing.	Places the prediction directly inside the factor-graph objective.

Mechanism: how SL(4) submaps are aligned in a graph

The method proceeds through keyframe selection, local submap generation, relative homography estimation, loop closure, and an SL(4) backend. It uses VGGT depth and confidence, then inverse-projects depth with camera outputs to form dense point clouds.

System Thread Summary

Submap creation and submap alignment are separate; SL(4) enters at the alignment stage.

Component	Role	Reading point
Keyframe selection	Selects keyframes when Lucas-Kanade disparity passes a threshold.	Parallax stabilizes depth and reconstruction.
Submap input	I_latest = {M_prior} ∪ I_latest ∪ I_loop	Includes previous and loop-closure frames.
Relative homography	Estimates Hⁱ_j ∈ SL(4) from dense overlap correspondences.	No separate correspondence estimation is needed.
Loop closure	Retrieves prior keyframes with SALAD descriptors.	Adds non-sequential homography factors.
Backend	Optimizes absolute homographies on the SL(4) manifold.	Aligns submaps into a global reconstruction.

Mechanism: how the SL(4) objective and homography are used

The equations cover relative homography alignment, closed-form homography estimation, the SL(4) factor-graph objective, and tangent-space linearization. Core equations are kept visible first, and the auxiliary numbered equations are kept in the supporting-equation section.

$$\mathbf{X}_a^{\mathcal{S}_i}=\mathbf{H}_j^i\mathbf{X}_b^{\mathcal{S}_j}\tag{1}$$

Eq. (1). Overlapping submap alignment.3D point alignment relationship between overlapping submaps.

Eq. (3). SL(4) factor graph objective.Covers relative homography and loop constraints.

Eq. (5). LM update on the Lie group.Solves the linearized residual with LM and updates on the Lie group.

Transform group comparison

Compare the ambiguity each group can express to understand why SL(4) appears.

Group	DOF	Transform	Role
SE(3)	6	Rotation + translation.	Rigid alignment when metric scale is fixed.
Sim(3)	7	SE(3) + scale.	Good for monocular scale drift.
SL(4)	15	3D projective homography.	Can correct uncalibrated projective ambiguity.

Homography / tangent update support formulas

This section separates the homography-estimation and tangent-space update equations that were summarized in the Core Equations section.

$$\mathbf{A}_k\mathbf{h}=0\tag{2}$$

Eq. (2). Closed-form relative homography system.Linear system for relative 4×4 homography estimation.

Eq. (4). SL(4) tangent-space update.Defines the tangent-space increment and manifold update.

Evidence: how pose, dense map, and loop closure are tested

The experiments cover pose estimation, dense reconstruction, qualitative maps, and ablations. The main result is that SL(4) is helpful in difficult cases while remaining competitive overall in uncalibrated SLAM.

Evaluation Brief

Tables 1/2 evaluate pose, Table 3 evaluates dense map quality, and Fig. 3 studies loop closure and confidence threshold effects.

7-Scenes Pose

Competitive with MASt3R-SLAM* in uncalibrated average ATE.

TUM RGB-D Pose

SL(4), w=32 obtains the best uncalibrated average error.

Dense Reconstruction

Strong accuracy and Chamfer on 7-Scenes.

Qualitative Loop

Joins multiple submaps into global office/corridor reconstructions.

Loop Closure Ablation

Loop closure becomes more useful as the number of submaps grows.

Limitations

Planar degeneracy, outlier homographies, and possible 15-DOF drift remain.

Usage / Limits: what to watch under planar degeneracy and outliers

The paper highlights planar degeneracy in 15-DOF homography estimation and sensitivity to VGGT point outliers. Remaining directions include ray-based matching for depth-error-robust homography estimation, automatic criteria for when Sim(3) is sufficient versus when SL(4) is needed, and a unified real-time system that can use both.

Takeaway

(In progress...)

핵심 요약

VGGT Submaps

Projective Ambiguity

SL(4) Factor Graph

Uncalibrated RGB SLAM

짧은 batch에 강함

때로는 부족함

projective까지 보정

논문 상세 정리

Problem: VGGT를 왜 긴 RGB SLAM으로 확장하나

Context: feed-forward submap은 어디서 길이 한계가 생기나

Gap: local reconstruction과 global SLAM 사이에 무엇이 비어 있나

Mechanism: SL(4) submap graph로 어떻게 정렬하나

Mechanism: SL(4) objective와 homography는 어떻게 쓰이나

Homography / tangent update 보조 수식

Evidence: pose/dense map/loop closure를 어떻게 검증했나

Usage / Limits: planar degeneracy와 outlier 조건에서 무엇을 조심하나

느낀점

Problem: why extend VGGT into long RGB SLAM?

Context: where feed-forward submaps hit length limits

Gap: what is missing between local reconstruction and global SLAM?

Mechanism: how SL(4) submaps are aligned in a graph

Mechanism: how the SL(4) objective and homography are used

Homography / tangent update support formulas

Evidence: how pose, dense map, and loop closure are tested

Usage / Limits: what to watch under planar degeneracy and outliers

Takeaway

Comments