[논문 리뷰] DROID-SLAM in the Wild

핵심 요약

DROID-W는 DROID-SLAM을 실세계 동적 RGB 환경으로 확장하기 위해, multi-view feature inconsistency에서 pixel-wise uncertainty를 추정하고 이를 differentiable BA 안에 넣는 SLAM 시스템이다.

문제동적 관측 → BA 불안정해결residual 신뢰도 가중근거tracking + ablation

한 문장 요약

DROID-W의 핵심은 동적 영역을 명시적으로 segment해서 버리는 것보다, 어떤 pixel을 BA에서 얼마나 믿을지 결정하는 uncertainty를 optimization loop 안에서 함께 갱신한다는 점이다.

Contribution 01

Uncertainty-aware BA

dynamic/static inconsistency가 큰 pixel의 BA 영향력을 낮춰 pose와 geometry update를 안정화.

Contribution 02

Feature Inconsistency

multi-view visual feature similarity를 이용해 per-pixel dynamic uncertainty를 추정.

Contribution 03

DROID-W Dataset

도심 outdoor, YouTube video, reflections, shadows, small dynamic objects 등 in-the-wild 조건 포함.

Evidence 04

Runtime Evidence

DROID-SLAM backbone 위에 DINOv2, Metric3D, uncertainty optimization을 추가했을 때의 실제 runtime 근거를 제시.

내가 얻은 인사이트

DROID-W는 ‘무엇을 동적 객체로 보고 제거할까’보다, BA가 어떤 관측을 얼마나 믿어야 하는가를 묻는다. 그래서 dynamic SLAM을 segmentation 문제가 아니라 weighted optimization 문제로 다시 읽게 만든다.

처리 흐름

01RGB Videoin-the-wild dynamic scenes

02DROID Featuresmulti-view feature alignment

03Inconsistencydynamic/static mismatch cue

04Uncertaintypixel-wise confidence weight

05Uncertainty-aware BApose + depth update

06Tracking / Mappingrobust RGB SLAM

접근 방식 비교

DROID-SLAM

정적 장면에 강한 기준선

DBA 기반 pose-depth update는 강력하지만, 동적 객체와 feature inconsistency가 큰 장면에서 흔들릴 수 있음.

Mask-based SLAM

hard filtering 기반 처리

동적이라고 판단한 관측을 제거하거나 masking하지만, 사전 class, segmentation 품질, geometry threshold에 영향을 받을 수 있음.

DROID-W

uncertainty 기반 신뢰도 조절

dynamic 여부를 hard mask로 끝내지 않고, BA objective 안에서 pixel별 영향력을 조절.

논문 상세 정리

아래부터는 기존 논문 내용을 최대한 담은 상세 해석이다. 핵심 흐름에서 벗어나는 배경지식, notation, 부가 자료는 접어두었다.

Problem: dynamic scene에서 무엇이 깨지는가

DROID-W의 문제 제기는 단순하다. 기존 SLAM은 correspondence가 같은 rigid scene을 본다고 가정하지만, in-the-wild video에서는 사람, 차량, 반사, 그림자, 작은 동적 객체가 이 가정을 계속 깨뜨린다.

Fig. 1. DROID-W overview / dynamic in-the-wild RGB SLAM setting.논문이 다루는 입력은 clean benchmark가 아니라 crowd, reflection, shadow, small dynamic object가 섞인 in-the-wild RGB video다.

Problem Flow

논문은 dynamic SLAM을 “무엇을 지울까”보다 “어떤 관측을 얼마나 믿을까”로 재정의한다.

01정적 장면 가정

DROID-SLAM의 DBA는 rigid correspondence가 안정적일 때 강력.

02동적 객체와 반사

residual과 feature alignment가 실제 camera motion과 어긋남.

03masking의 한계

사전 class, segmentation quality, object boundary에 민감.

04DROID-W의 재정의

각 pixel residual의 신뢰도를 uncertainty로 최적화.

Problem / Proposal

초록, 도입부, 관련 연구는 모두 관측 신뢰도를 BA 안에서 조절해야 한다는 주장으로 모인다.

문제 축	기존 접근의 병목	DROID-W의 관점
Unknown dynamics	사전 dynamic class나 segmentation failure에 민감	multi-view feature inconsistency로 uncertainty 추정
Optimization	dynamic residual이 pose/depth update를 흔듦	uncertainty-aware BA로 residual 영향력 조절
Real-world RGB	도심, YouTube, 반사, 그림자, 작은 객체가 동시에 등장	DROID-W dataset과 web video로 stress test

Mechanism: uncertainty-aware BA로 어떻게 푸나

앞에서 정의한 문제는 dynamic object가 correspondence residual을 흔든다는 점이다. DROID-W는 이를 object 제거 문제가 아니라 BA가 각 관측을 얼마나 믿을지 조절하는 문제로 바꾸고, uncertainty를 pose-depth optimization 안에 넣는다.

Fig. 2. DROID-W system overview. — Fig. 2. System Overview.Metric3D depth prior, DROID feature, DINO feature, ConvGRU, uncertainty optimization, uncertainty-aware BA가 pose/point-cloud update로 이어지는 전체 흐름이다.

Mechanism Thread Summary

논문은 static-scene DROID-SLAM을 출발점으로 삼고, dynamic uncertainty를 BA와 번갈아 최적화해 in-the-wild video에서도 tracking과 geometry를 안정화한다.

구간	무엇을 해결하나	핵심 장치
Preliminaries	DROID-SLAM의 pose, inverse depth, frame graph, DBA 구조를 베이스라인으로 사용	rigid correspondence와 Gauss-Newton 기반 pose-depth update
Uncertainty-aware BA	dynamic object가 만드는 unreliable residual이 BA를 흔드는 문제 완화	pixel-wise uncertainty를 Mahalanobis weight에 반영
Uncertainty optimization	reprojection residual만으로 dynamic region을 판단하기 어려운 문제 보완	FiT3D/DINOv2 feature similarity와 logarithmic prior
SLAM system	실시간 video stream에서 초기화, tracking, local/global BA를 안정적으로 운용	Metric3D depth prior, local uncertainty update, global pose-depth BA

Design Choice

이 논문에서 중요한 선택은 dynamic object를 binary mask로 제거하지 않고, residual weight를 연속적으로 조절하는 것이다.

버린 방향

사전 class나 segmentation mask에만 의존.

채택한 방향

feature inconsistency에서 uncertainty를 만들고 BA에 삽입.

얻는 효과

unknown dynamic object도 soft하게 downweight.

DROID-SLAM을 베이스라인으로 유지하는 이유

논문은 먼저 DROID-SLAM의 기본 상태를 그대로 가져온다. 각 frame의 pose와 inverse depth로 rigid correspondence를 만들고, predicted correspondence와의 residual을 confidence map으로 weighted BA에 넣는다. 이 출발점이 있어야 DROID-W의 변화가 “새 SLAM을 만든 것”이 아니라 DROID-SLAM의 DBA를 dynamic scene에 맞게 바꾼 것임이 분명해진다.

\mathbf{p}_{ij}=\Pi_c(\mathbf{G}^{\prime}_{ij}\cdot\Pi_c^{-1}(\mathbf{p}_i,\mathbf{d}^{\prime}_i))

(1)

Eq. (1). Rigid correspondence projection.frame i의 pixel/depth를 상대 pose로 frame j에 투영한다.

\mathbf{E}(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}},\\ \Sigma_{ij}=\operatorname{diag}(w_{ij})

DROID-SLAM BA.원래 DROID-SLAM은 correspondence residual을 network confidence weight로 최적화한다.

\begin{align} &\begin{aligned} \begin{bmatrix} \mathbf{B} & \mathbf{E} \\ \mathbf{E}^{\top} & \mathbf{C}\end{bmatrix} \begin{bmatrix}\Delta\xi \\ \Delta \mathbf{d} \end{bmatrix} = \begin{bmatrix}\mathbf{v} \\ \mathbf{w}\end{bmatrix}, \end{aligned} \\ &\begin{aligned} &\Delta\xi = [\mathbf{B}-\mathbf{E}\mathbf{C}^{-1}\mathbf{E}^{\top}]^{-1}(\mathbf{v}-\mathbf{E}\mathbf{C}^{-1}\mathbf{w}),\\ &\Delta \mathbf{d} = \mathbf{C}^{-1}(\mathbf{w}-\mathbf{E}^{\top}\Delta\xi) \end{aligned} \end{align}

(2)-(3)

Eq. (2)-(3). Differentiable BA update.pose-depth normal equation과 Schur complement update를 나타낸다.

Eq. (1)은 현재 pose/depth가 다른 frame에서 어디로 투영되는지를 정의하고, DROID-SLAM BA objective와 Eq. (2)-(3)은 이 residual로 pose update $Δ ξ$ 와 disparity update $Δ d$ 를 푸는 절차를 보여준다. DROID-W는 이 구조를 버리지 않고 residual의 covariance를 uncertainty-aware 형태로 바꾸는 쪽을 택한다.

Uncertainty로 dynamic residual의 영향 낮추기

dynamic object는 rigid-motion assumption을 깨기 때문에, 같은 residual이라도 static background의 residual과 동일하게 믿으면 BA가 잘못된 방향으로 움직일 수 있다. DROID-W는 per-pixel dynamic uncertainty $u^{t}$ 를 두고, uncertainty가 큰 관측의 영향력을 낮추는 weighted Mahalanobis term을 사용한다.

\Sigma_{ij}^{\mathrm{uncer}}=\operatorname{diag}(\mathbf{w}_{ij}\cdot\frac{1}{\mathbf{u}^{\prime}_i})

\hat{\mathbf{E}}(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}^\mathrm{uncer}}.

(4)(5)

Eq. (4)-(5). Uncertainty-aware BA energy.uncertainty가 큰 pixel의 correspondence residual 영향력을 낮춘다.

Eq. (4)는 confidence $w^{ij}$ 에 uncertainty를 곱해 covariance를 다시 정의하고, Eq. (5)는 그 weight로 pose와 depth를 최적화하는 objective다. 중요한 점은 object를 hard mask로 제거하지 않고, residual을 soft하게 덜 믿는 방향으로 BA 안에 넣는다는 것이다.

Feature inconsistency에서 uncertainty evidence 만들기

큰 dynamic motion에서는 reprojection error 자체가 불안정할 수 있다. 그래서 논문은 FiT3D로 refined된 DINOv2 feature를 사용해 multi-view feature similarity를 측정하고, view 간 feature consistency가 낮은 지점을 dynamic uncertainty의 근거로 삼는다.

\mathbf{E}_{\mathrm{sim}}(\mathbf{u}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\frac{1-\frac{\mathbf{F}_i\cdot\mathbf{F}_{ij}}{\|\mathbf{F}_i\|_2\|\mathbf{F}_{ij}\|_2}}{\mathbf{u}_i'\cdot\mathbf{u}_{ij}'}.

(6)

Eq. (6). Feature-similarity uncertainty.DINOv2/FiT3D feature similarity가 낮은 multi-view correspondence를 높은 uncertainty로 유도한다.

\mathbf{E}_{\mathrm{prior}}(\mathbf{u}^{\prime})=\sum_i\log(\mathbf{u}_i'+1.0).

\mathbf{E}_\mathrm{uncer}(\mathbf{u}^{\prime})=\mathbf{E}_{\mathrm{sim}}(\mathbf{u}^{\prime})+\gamma_\text{prior}\mathbf{E}_{\mathrm{prior}}(\mathbf{u}^{\prime}).

(7)(8)

Eq. (7)-(8). Uncertainty prior.uncertainty가 무한히 커지는 trivial solution을 log prior로 제어한다.

Eq. (6)은 대응 feature의 cosine similarity가 낮을수록 uncertainty cost가 커지도록 만들고, Eq. (7)-(8)은 모든 uncertainty를 무한히 키워 residual을 무시하는 trivial solution을 막는다. 즉 uncertainty는 “동적일 것 같은 곳을 크게 잡는 변수”이면서 동시에 prior로 제어되는 최적화 변수다.

Decoupling과 affine mapping으로 local uncertainty 안정화

uncertainty를 pose/depth와 함께 Gauss-Newton으로 직접 풀면 계산량과 불안정성이 커진다. DROID-W는 pose-depth refinement와 uncertainty optimization을 번갈아 수행하고, DINOv2 feature에서 uncertainty로 가는 local affine mapping을 학습해 작은 window 안에서 uncertainty가 과도하게 흔들리지 않도록 한다.

\begin{align} \boldsymbol{g}_t&=\sum_{i=0}^{N}\frac{\partial\mathbf{E}_\mathrm{uncer}}{\partial\mathbf{u}^{\prime}_i}\cdot\frac{\partial\mathbf{u}^{\prime}_i}{\partial\theta_{t-1}}\notag \\ &=\sum_{i=0}^{N}\frac{\partial\mathbf{E}_\mathrm{uncer}}{\partial\mathbf{u}^{\prime}_i}\cdot\frac{1}{1+\exp(-\theta_{t-1}\cdot\mathbf{F}_i)}\cdot\mathbf{F}_i,\notag \\ \theta_t&=\theta_{t-1}-\lambda\cdot\boldsymbol{g}_t-\eta\cdot\theta_{t-1}.\notag \end{align}

(9)

Eq. (9). Affine uncertainty mapping.affine mapping parameter θ를 gradient descent와 weight decay로 갱신한다.

Eq. (9)는 affine mapping parameter $θ$ 를 gradient descent와 weight decay로 업데이트하는 과정을 나타낸다. 이 설계 덕분에 uncertainty는 dense SLAM state처럼 반복적으로 개선되지만, global BA 단계에서는 local regularizer의 역할을 넘지 않도록 freeze된다.

SLAM system에서 pose-depth-update를 운용하는 방식

시스템은 DROID-SLAM처럼 충분한 motion을 가진 12개 keyframe으로 초기화한다. 다만 dynamic scene에서는 constant disparity 초기값이 tracking을 불안정하게 만들 수 있으므로, Metric3D의 metric monodepth를 disparity regularization으로 사용해 초기 pose-depth optimization을 보강한다.

\mathbf{E}^+(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}^\mathrm{uncer}}+\gamma_d\sum_i\|\mathbf{d}_i-\mathbf{D}_i\|^2.

Depth-regularized BA.SLAM system에서는 Metric3D depth prior를 추가해 depth를 안정화한다.

이후 새 keyframe이 들어오면 local BA와 uncertainty update를 sliding window 안에서 수행한다. frontend tracking에서는 pose, disparity, uncertainty를 함께 갱신하고, tracking 이후 global BA는 모든 keyframe의 pose와 disparity를 정리하되 dynamic uncertainty parameter는 고정한다. 결과적으로 DROID-W의 방법론은 depth prior로 초기화를 안정화하고, uncertainty-aware BA로 dynamic residual을 낮추며, global BA로 전체 trajectory를 정리하는 흐름으로 읽으면 된다.

Mechanism Brief

이 방법론에서 가장 중요한 선택은 dynamic object를 binary mask로 버리지 않고, correspondence의 신뢰도를 연속적인 uncertainty로 만들어 BA 안에 남기는 것이다.

문제

dynamic object가 rigid correspondence residual을 왜곡해 camera pose와 depth update를 흔듦.

해결

feature inconsistency 기반 uncertainty를 residual weight로 사용해 unreliable observation의 영향 축소.

운용

pose-depth refinement와 uncertainty optimization을 번갈아 수행하고, global BA에서는 pose/depth만 재정리.

Evidence: 어떤 claim을 검증했나

평가는 DROID-W의 핵심 claim을 어떤 근거로 확인했는지 따라가면 읽기 쉽다. tracking과 qualitative reconstruction은 핵심 평가 task이고, runtime, ablation, dataset 구성은 설계 타당성을 보강하는 추가 근거다.

Evaluation Evidence

핵심 평가와 보조 근거를 분리하면, 각 표와 그림이 어떤 claim을 지지하는지 더 빠르게 읽힌다.

핵심 평가

평가 축	근거	확인할 것
Tracking robustness	Table 1-4: Bonn, TUM, DyCheck, DROID-W	Bonn에서는 전체 baseline 중 bestTUM은 평균 best/second-best, DyCheck와 DROID-W outdoor에서도 복잡한 동적 환경 안정성 확인
Qualitative geometry	Fig. 3-4	Fig. 3: coherent uncertainty mapFig. 4: scale drift/noisy distractor를 줄인 reconstruction

보조 근거

근거 축	근거	확인할 것
Runtime	Table 5	RTX 3090/16-core 기준 약 10 FPSWildGS-SLAM 대비 40× speedup. DINOv2/Metric3D 비용으로 DROID-SLAM보다는 느림
Ablation	Table 6	Full system이 모든 variant보다 우수uncertainty-aware BA 제거가 가장 큰 성능 저하를 만들고, decouple/affine/weight decay도 안정성에 기여
Custom Dataset	Table 7-9	Table 7: Downtown dataset overview / Table 8: FAST-LIVO2 reference checkTable 9: YouTube qualitative stress-test video overview

대표 근거 요약

논문 본문은 정량 tracking, uncertainty map, reconstruction, runtime, ablation을 서로 다른 claim의 근거로 해석한다.

정량 tracking

Bonn/TUM/DyCheck/DROID-W에서 DROID-SLAM 대비 dynamic scene robustness 확인.

정성 reconstruction

uncertainty map이 dynamic distractor를 어디서 낮게 신뢰하는지 확인.

runtime

WildGS-SLAM보다 빠르게 동작한다는 실시간성 claim 확인.

ablation

각 설계 요소가 tracking 성능에 미치는 영향 분리.

Tracking / Runtime / Ablation Evidence

Table 1. Bonn RGB-D Dynamic Dataset tracking performance (ATE RMSE ↓, cm).Bonn은 핵심 quantitative tracking benchmark이며, DROID-W가 dynamic indoor RGB sequences에서 baseline 대비 가장 낮은 평균 error를 보이는지 확인한다.

Table 2. TUM RGB-D Dataset tracking performance (ATE RMSE ↓, cm).TUM은 classic dynamic RGB-D benchmark로, DROID-W가 기존 DROID-SLAM 계열보다 average robustness를 유지하는지 보는 표다.

Table 3. DyCheck Dataset tracking performance (ATE RMSE ↓).DyCheck는 real-world dynamic scene에서 tracking robustness를 보는 보조 benchmark이며, camera/object motion이 섞인 조건을 확인한다.

Table 4. DROID-W Dataset tracking performance (ATE RMSE ↓, m).논문이 직접 구성한 outdoor benchmark로, downtown dynamic RGB video에서도 uncertainty-aware weighting이 tracking을 안정화하는지 본다.

Table 5. Runtime comparisons (average FPS ↑).runtime은 practical evidence다. DROID-W는 WildGS-SLAM 대비 큰 속도 이득을 보이며, 추가 비용은 DINOv2와 Metric3D에서 주로 발생한다.

Table 6. Ablation studies on Bonn RGB-D Dataset.full model이 가장 좋고, uncertainty-aware BA 제거가 가장 큰 손실을 만든다. decoupling, affine mapping, weight decay는 stability를 보강한다.

Fig. 3. Uncertainty estimation compared with MonST3R and WildGS-SLAM.DROID-W는 동적 영역을 더 일관된 uncertainty map으로 분리하고, static 영역의 confidence를 안정적으로 유지한다.

Fig. 4. 3D reconstruction comparisons on YouTube sequences.DROID-SLAM의 scale drift/noisy distractor 문제와 달리, DROID-W는 동적 outdoor video에서도 더 일관된 point cloud를 만든다.

Dataset Contribution Evidence

DROID-W dataset은 정량 tracking을 위한 outdoor sequence와 정성 검증을 위한 YouTube video를 분리해 제공한다. 본문에서는 평가에서 직접 필요한 역할만 먼저 보고, sensor/GT 세부 조건은 토글에서 확인한다.

Quantitative benchmarkDROID-W Downtown

Downtown 1-7로 구성된 outdoor RGB sequence. 일부 sequence는 RTK, 일부는 LiDAR-inertial reference trajectory로 ATE를 평가한다.

역할: 실제 도심 dynamic RGB SLAM에서 tracking robustness 확인.

Qualitative stress testYouTube videos

crowd, reflection, moving object, moving camera 조건이 섞인 web video sequence.

역할: ground-truth trajectory보다 uncertainty map과 reconstruction 품질 확인.

Dataset / GT 세부 조건 보기

Dataset 구성

정량 benchmark와 정성 stress test를 분리해서 보면, 어떤 표가 어떤 평가 역할을 맡는지 명확해진다.

구분	포함 내용	평가 역할
DROID-W Downtown	RGB 1200×1600, 20 FPS LiDAR/IMU 동기화 RTK 또는 FAST-LIVO2 reference trajectory	outdoor dynamic RGB SLAM의 ATE 비교
YouTube videos	FPS와 duration이 서로 다른 6개 web video crowd, reflection, moving object 조건 moving camera 조건 포함	uncertainty map과 reconstruction의 정성 비교

Reference trajectory 확인

RTK가 없는 Downtown 1-2에서는 FAST-LIVO2 trajectory를 reference로 사용하므로, 논문은 RTK가 있는 구간에서 FAST-LIVO2의 오차를 별도로 제시한다.

Table 7. DROID-W dataset overview.Downtown 1-7의 sensor, FPS, GT source를 구분해 quantitative benchmark의 조건을 확인하는 표다.

Table 8. FAST-LIVO2 ground-truth quality check.RTK가 없는 sequence에서 FAST-LIVO2 trajectory를 reference로 쓸 수 있는지, RTK가 있는 구간에서 별도 확인한 근거다.

Table 9. Downloaded YouTube video overview.YouTube videos는 정량 benchmark가 아니라 crowd/reflection/moving-camera 조건에서 qualitative stress test로 쓰인다.

Usage / Limits: 언제 쓰기 좋은가

DROID-W는 동적 영역을 완전히 제거하기보다 관측 신뢰도를 조절해야 하는 RGB SLAM 상황에 잘 맞는다. 반대로 초기 pose가 불안정하거나 frame-to-frame alignment 자체가 약하면 uncertainty estimation도 함께 흔들릴 수 있다.

When to Use / Avoid

결과와 한계는 적용 조건을 이렇게 정리할 수 있다.

구분	정리	이유
잘 맞는 상황	unknown dynamic object가 많은 monocular/RGB video SLAM	class prior보다 feature inconsistency 기반 uncertainty가 유리
필요한 가정	DROID-SLAM backbone, feature correspondence, Metric3D depth prior	pose-depth update와 local uncertainty update가 이 기반 위에서 동작
약한 조건	초기 pose가 불안정하거나 alignment가 크게 흔들리는 구간	uncertainty optimization도 frame-to-frame alignment에 의존

느낀점

(작성중...)

Problem: what breaks in dynamic scenes?

DROID-W starts from a simple failure mode: standard SLAM assumes correspondences observe the same rigid scene, while in-the-wild video violates that assumption with people, vehicles, reflections, shadows, and small moving objects.

Problem Flow

The paper reframes dynamic SLAM from “what should be removed?” to “which observations should BA trust?”

01Static-scene assumption

DROID-SLAM DBA is strong when rigid correspondence is stable.

02Dynamic objects and reflections

Residuals and feature alignment diverge from the true camera motion.

03Limits of masking

Sensitive to class priors, segmentation quality, and object boundaries.

04DROID-W reframing

Optimizes the trust of each pixel residual as uncertainty.

Problem / Proposal

The abstract, introduction, and related work all point to the claim that observation trust should be controlled inside BA.

Problem axis	Bottleneck in prior approaches	DROID-W view
Unknown dynamics	Sensitive to predefined dynamic classes or segmentation failures	Estimate uncertainty from multi-view feature inconsistency
Optimization	Dynamic residuals destabilize pose/depth updates	Control residual influence with uncertainty-aware BA
Real-world RGB	Downtown scenes, YouTube videos, reflections, shadows, and small objects appear together	Stress-tested with the DROID-W dataset and web videos

View related-work families

Related-work families

The paper is easier to position by separating prior work into how dynamics are detected and how they are reflected in the SLAM objective.

AResidual-based dynamic SLAM

Estimate dynamic regions from large residuals or frame-to-model alignment.

BObject segmentation

Remove dynamic objects with detector or segmentation networks, as in DynaSLAM/DS-SLAM.

CObject-level dynamic mapping

Track or reconstruct independent objects, as in Co-Fusion and MaskFusion.

DUncertainty optimization

Reflect dynamic/static ambiguity in BA through weights or uncertainty.

Mechanism: how uncertainty-aware BA solves it

The problem above is that dynamic objects distort correspondence residuals. DROID-W treats this not as object removal, but as a problem of controlling how much BA should trust each observationand injects uncertainty into pose-depth optimization.

Mechanism Thread Summary

The paper starts from static-scene DROID-SLAMand alternates dynamic uncertainty optimization with BAto stabilize tracking and geometry in in-the-wild videos.

Stage	What it solves	Key mechanism
Preliminaries	Uses DROID-SLAM pose, inverse depth, frame graph, and DBA as the baseline	Rigid correspondence and Gauss-Newton pose-depth updates
Uncertainty-aware BA	Mitigates unreliable dynamic-object residuals that destabilize BA	Injects pixel-wise uncertainty into the Mahalanobis weight
Uncertainty optimization	Compensates for the difficulty of judging dynamics from reprojection residuals alone	FiT3D/DINOv2 feature similarity and a logarithmic prior
SLAM system	Runs initialization, tracking, and local/global BA stably in a real-time video stream	Metric3D depth prior, local uncertainty update, global pose-depth BA

Design Choice

The important design choice is not to remove dynamic objects with a binary mask, but to continuously control residual weights.

Rejected direction

Rely only on class priors or segmentation masks.

Chosen direction

Build uncertainty from feature inconsistency and insert it into BA.

Effect

Softly downweight unknown dynamic objects.

Why DROID-SLAM remains the baseline

The paper first keeps DROID-SLAM’s basic state formulation. It builds rigid correspondences from each frame’s pose and inverse depth, then feeds the residual against predicted correspondences into weighted BA with a confidence map. This starting point makes it clear that DROID-W is not a new SLAM system from scratch, but a dynamic-scene adaptation of DROID-SLAM DBA.

\mathbf{p}_{ij}=\Pi_c(\mathbf{G}^{\prime}_{ij}\cdot\Pi_c^{-1}(\mathbf{p}_i,\mathbf{d}^{\prime}_i))

(1)

Eq. (1). Rigid correspondence projection.Projects a pixel/depth from frame i into frame j using the relative pose.

\mathbf{E}(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}},\\ \Sigma_{ij}=\operatorname{diag}(w_{ij})

DROID-SLAM BA.Original DROID-SLAM optimizes correspondence residuals with network confidence weights.

\begin{align} &\begin{aligned} \begin{bmatrix} \mathbf{B} & \mathbf{E} \\ \mathbf{E}^{\top} & \mathbf{C}\end{bmatrix} \begin{bmatrix}\Delta\xi \\ \Delta \mathbf{d} \end{bmatrix} = \begin{bmatrix}\mathbf{v} \\ \mathbf{w}\end{bmatrix}, \end{aligned} \\ &\begin{aligned} &\Delta\xi = [\mathbf{B}-\mathbf{E}\mathbf{C}^{-1}\mathbf{E}^{\top}]^{-1}(\mathbf{v}-\mathbf{E}\mathbf{C}^{-1}\mathbf{w}),\\ &\Delta \mathbf{d} = \mathbf{C}^{-1}(\mathbf{w}-\mathbf{E}^{\top}\Delta\xi) \end{aligned} \end{align}

(2)-(3)

Eq. (2)-(3). Differentiable BA update.Pose-depth normal equation and Schur complement update in differentiable BA.

Eq. (1) defines where the current pose/depth projects into another frame. The DROID-SLAM BA objective and Eq. (2)-(3) show how this residual solves the pose update $Δ ξ$ and disparity update $Δ d$ . DROID-W keeps this structure and changes the residual covariance into an uncertainty-aware form.

Reducing dynamic residual influence with uncertainty

Dynamic objects violate the rigid-motion assumption, so BA can move in the wrong direction if their residuals are trusted like static-background residuals. DROID-W introduces per-pixel dynamic uncertainty $u^{t}$ and uses a weighted Mahalanobis term that lowers the influence of observations with high uncertainty.

\Sigma_{ij}^{\mathrm{uncer}}=\operatorname{diag}(\mathbf{w}_{ij}\cdot\frac{1}{\mathbf{u}^{\prime}_i})

\hat{\mathbf{E}}(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}^\mathrm{uncer}}.

(4)(5)

Eq. (4)-(5). Uncertainty-aware BA energy.Lowers the influence of correspondence residuals from high-uncertainty pixels.

Eq. (4) combines confidence $w^{ij}$ with uncertainty to redefine covariance, and Eq. (5) optimizes pose and depth with that weight. The key point is that the object is not removed as a hard mask; the residual is trusted less in a soft wayinside BA.

Building uncertainty evidence from feature inconsistency

With large dynamic motion, reprojection error itself can be unstable. The paper therefore uses FiT3D-refined DINOv2 features to measure multi-view feature similarity and treats low cross-view feature consistency as evidence for dynamic uncertainty.

\mathbf{E}_{\mathrm{sim}}(\mathbf{u}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\frac{1-\frac{\mathbf{F}_i\cdot\mathbf{F}_{ij}}{\|\mathbf{F}_i\|_2\|\mathbf{F}_{ij}\|_2}}{\mathbf{u}_i'\cdot\mathbf{u}_{ij}'}.

(6)

Eq. (6). Feature-similarity uncertainty.Pushes low DINOv2/FiT3D feature-similarity correspondences toward higher uncertainty.

\mathbf{E}_{\mathrm{prior}}(\mathbf{u}^{\prime})=\sum_i\log(\mathbf{u}_i'+1.0).

\mathbf{E}_\mathrm{uncer}(\mathbf{u}^{\prime})=\mathbf{E}_{\mathrm{sim}}(\mathbf{u}^{\prime})+\gamma_\text{prior}\mathbf{E}_{\mathrm{prior}}(\mathbf{u}^{\prime}).

(7)(8)

Eq. (7)-(8). Uncertainty prior.Controls the trivial solution of infinitely increasing uncertainty with a log prior.

Eq. (6) increases the uncertainty cost when corresponding features have low cosine similarity. Eq. (7)-(8) prevent the trivial solution of increasing all uncertainty values to ignore residuals. In other words, uncertainty is both a variable that grows in likely dynamic regions and an optimization variable controlled by a prior.

Stabilizing local uncertainty with decoupling and affine mapping

Directly solving uncertainty with pose/depth through Gauss-Newton increases cost and instability. DROID-W alternates pose-depth refinement and uncertainty optimization, then learns a local affine mapping from DINOv2 features to uncertainty so uncertainty does not fluctuate excessively in a small window.

\begin{align} \boldsymbol{g}_t&=\sum_{i=0}^{N}\frac{\partial\mathbf{E}_\mathrm{uncer}}{\partial\mathbf{u}^{\prime}_i}\cdot\frac{\partial\mathbf{u}^{\prime}_i}{\partial\theta_{t-1}}\notag \\ &=\sum_{i=0}^{N}\frac{\partial\mathbf{E}_\mathrm{uncer}}{\partial\mathbf{u}^{\prime}_i}\cdot\frac{1}{1+\exp(-\theta_{t-1}\cdot\mathbf{F}_i)}\cdot\mathbf{F}_i,\notag \\ \theta_t&=\theta_{t-1}-\lambda\cdot\boldsymbol{g}_t-\eta\cdot\theta_{t-1}.\notag \end{align}

(9)

Eq. (9). Affine uncertainty mapping.Updates the affine mapping parameter θ with gradient descent and weight decay.

Eq. (9) shows how the affine mapping parameter $θ$ is updated with gradient descent and weight decay. This lets uncertainty improve iteratively like a dense SLAM state, while freezing it during global BA so it remains a local regularizer.

How the SLAM system runs pose-depth updates

Like DROID-SLAM, the system initializes with 12 keyframes that contain enough motion. Because a constant-disparity initialization can destabilize tracking in dynamic scenes, it strengthens early pose-depth optimization by using Metric3D metric monodepth as disparity regularization.

\mathbf{E}^+(\mathbf{G}^{\prime},\mathbf{d}^{\prime})=\sum_{(i,j)\in\mathcal{E}}\|\mathbf{p}_{ij}^*-\mathbf{p}_{ij}\|^2_{\Sigma_{ij}^\mathrm{uncer}}+\gamma_d\sum_i\|\mathbf{d}_i-\mathbf{D}_i\|^2.

Depth-regularized BA.The SLAM system stabilizes depth by adding a Metric3D depth prior.

When a new keyframe arrives, local BA and uncertainty updates run inside a sliding window. Frontend tracking updates pose, disparity, and uncertainty together. After tracking, global BA refines all keyframe poses and disparities while keeping dynamic uncertainty parameters fixed. Overall, DROID-W should be read as a flow that stabilizes initialization with a depth prior, reduces dynamic residuals with uncertainty-aware BA, and refines the whole trajectory with global BA.

Mechanism Brief

The key methodological choice is not to discard dynamic objects with a binary mask, but to leave correspondences in BA with continuous uncertainty-based trust.

Problem

Dynamic objects distort rigid correspondence residuals and destabilize camera-pose and depth updates.

Solution

Use feature-inconsistency-based uncertainty as residual weights to reduce unreliable observation influence.

Operation

Alternate pose-depth refinement and uncertainty optimization, then refine only pose/depth in global BA.

Evidence: which claims are tested?

The evaluation is clearest when each result is tied to a claim. Tracking and qualitative reconstruction are the core evaluation tasks, while runtime, ablation, and dataset construction support the design argument.

Evaluation Evidence

Separating core evaluation from supporting evidence makes each table and figure easier to tie back to the paper's claims.

Core evaluation

Evaluation axis	Evidence	What to check
Tracking robustness	Table 1-4: Bonn, TUM, DyCheck, DROID-W	Table 1 reports the best Bonn resultTable 2 is best/second-best on average, and Tables 3-4 support stability in diverse dynamic scenes.
Qualitative geometry	Fig. 3-4	Fig. 3: coherent uncertainty mapsFig. 4: fewer scale drift, geometry errors, and noisy distractors.

Supporting evidence

Evidence axis	Evidence	What to check
Runtime	Table 5	About 10 FPS on RTX 3090 / 16-core CPU40× speedup over WildGS-SLAM, with extra cost from DINOv2 and Metric3D.
Ablation	Table 6	The full system outperforms all variantsRemoving uncertainty-aware BA causes the largest drop, while decoupling, affine mapping, and weight decay add stability.
Custom Dataset	Table 7-9	Table 7: Downtown dataset overview / Table 8: FAST-LIVO2 reference checkTable 9: YouTube qualitative stress-test video overview

Representative Evidence

The paper uses tracking, uncertainty maps, reconstruction, runtime, and ablation as evidence for different claims.

Quantitative tracking

Bonn/TUM/DyCheck/DROID-W results verify robustness over prior baselines in dynamic scenes.

Qualitative reconstruction

Uncertainty maps reveal where dynamic distractors should be trusted less.

Runtime

Runtime results support the real-time claim relative to WildGS-SLAM.

Ablation

Ablations separate the effect of each design component on tracking accuracy.

Tracking / Runtime / Ablation Evidence

Dataset Contribution Evidence

The DROID-W dataset separates outdoor sequences for quantitative tracking from YouTube videos for qualitative stress tests. The main text keeps the evaluation role visible first, while sensor and GT details are folded into the toggle.

Quantitative benchmarkDROID-W Downtown

Outdoor RGB sequences Downtown 1-7. Some sequences use RTK and others use LiDAR-inertial reference trajectories for ATE evaluation.

Role: tracking robustness in real outdoor dynamic RGB SLAM.

Qualitative stress testYouTube videos

Web video sequences containing crowds, reflections, moving objects, and moving cameras.

Role: qualitative inspection of uncertainty maps and reconstruction quality.

Dataset / GT details

Dataset composition

Keeping the quantitative benchmark and qualitative stress test separate makes the role of each table clear.

Group	Included content	Evaluation role
DROID-W Downtown	RGB 1200×1600, 20 FPS Synchronized LiDAR/IMU RTK or FAST-LIVO2 reference trajectory	ATE comparison for outdoor dynamic RGB SLAM.
YouTube videos	Six web videos with different FPS/duration Crowds, reflections, and moving objects Moving-camera conditions	Qualitative comparison of uncertainty maps and reconstruction.

Reference trajectory check

For Downtown 1-2, where RTK is unavailable, the paper uses FAST-LIVO2 as the reference and reports a separate check on sequences with RTK.

Usage / Limits: when is it useful?

DROID-W is most useful when RGB SLAM should keep operating in dynamic scenes without trusting every correspondence equally. It is weaker when early pose alignment is unreliable, because uncertainty estimation also depends on frame-to-frame alignment.