[논문 리뷰] WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

핵심 요약

WildGS-SLAM은 monocular RGB만으로 동적 distractor의 영향을 줄이기 위해, per-pixel uncertainty를 DROID식 tracking과 3D Gaussian mapping 양쪽에 넣는 SLAM 시스템이다.

문제동적 distractor → static SLAM 붕괴 해결uncertainty-weighted tracking/mapping 근거tracking / rendering / ablation

한 문장 요약

이 논문은 dynamic-scene SLAM을 “무엇을 지울까”보다 “어떤 pixel을 얼마나 믿을까”의 문제로 바꾸고, distractor일 가능성이 큰 pixel의 최적화 영향력을 낮춘다.

Contribution 01

Monocular 3DGS SLAM

dynamic environment의 monocular RGB video에서 static 3D Gaussian map 구성.

Contribution 02

Uncertainty MLP

3D-aware DINOv2 feature에서 per-pixel uncertainty를 예측하고 sequence별 online 적응.

Contribution 03

Tracking + Mapping Weight

동일한 uncertainty를 DBA와 rendering loss에 사용해 dynamic distractor 영향 축소.

Contribution 04

Wild-SLAM Dataset

MoCap RGB-D sequence와 iPhone RGB video로 in-the-wild dynamic 평가 제공.

내가 얻은 인사이트

WildGS-SLAM은 DROID-W식 uncertainty-weighted tracking을 3D Gaussian map optimization까지 확장한 논문으로 읽으면 이해가 쉽다. 같은 uncertainty 신호가 pose estimation과 rendering을 연결하는 공통 언어가 된다.

처리 흐름

01RGB Sequencemonocular dynamic video

02DINOv2 Feature3D-aware image feature

03Uncertainty MLPper-pixel β map

04DBA Trackingβ-weighted pose update

053DGS Mappingβ-weighted render loss

06Static Mapartifact-reduced rendering

접근 방식 비교

Semantic / Mask SLAM

class prior 기반 제거

known movable category에는 강하지만 unseen distractor, shadow, 복잡한 motion pattern에는 취약.

Static 3DGS SLAM

dense하지만 static 가정

static scene에서는 reconstruction/view synthesis가 강하지만, dynamic object가 drift와 artifact 유발.

WildGS-SLAM

geometric uncertainty

semantic label이나 RGB-D에 직접 의존하지 않고 learned uncertainty를 soft geometric weight로 사용.

논문 상세 정리

아래부터는 기존 논문 내용을 최대한 담은 상세 해석이다. 핵심 흐름에서 벗어나는 배경지식, related work, dataset 세부 조건, baseline 출처 메모는 접어두었다.

Problem: dynamic distractor가 tracking과 rendering을 동시에 흔든다

WildGS-SLAM의 문제 제기는 static-world assumption에 있다. 기존 monocular SLAM과 3DGS SLAM은 camera와 static scene이 일관되게 관측된다고 가정하지만, 실제 video에는 사람, 그림자, occlusion, 조명 변화처럼 pose update와 map optimization을 동시에 오염시키는 dynamic distractor가 들어온다.

Figure 1. WildGS-SLAM overview. — Figure 1. WildGS-SLAM.dynamic distractor가 있는 monocular video에서 camera trajectory를 추적하고, static element만 3D Gaussian map으로 재구성하는 목표를 보여준다.

Problem Flow

논문은 dynamic scene을 segmentation 문제가 아니라 tracking과 mapping 모두에서 관측 신뢰도를 낮추는 문제로 재정의한다.

01Static-scene assumption

feature matching과 photometric consistency가 정적 장면을 전제.

02Dynamic distractor

moving object, shadow, occlusion이 pose와 render loss를 흔듦.

03Mask/semantic 한계

predefined class나 RGB-D cue에 의존하면 일반화가 제한.

04WildGS-SLAM의 재정의

per-pixel uncertainty를 tracking과 mapping의 공통 weight로 사용.

Problem / Proposal

Introduction과 Abstract는 모두 dynamic distractor를 hard remove하지 않고 uncertainty로 downweight해야 한다는 주장으로 이어진다.

문제 축	기존 접근의 병목	WildGS-SLAM의 관점
Tracking	moving pixel을 camera motion으로 오해	DBA residual을 $ \beta_i $로 downweight
Mapping	dynamic object가 Gaussian map에 artifact로 남음	rendering loss에 uncertainty weight 적용
Generalization	semantic class, RGB-D depth, optical-flow mask에 의존	DINOv2 feature 기반 online uncertainty MLP 사용

Related Work 맥락 자세히 보기

기존 연구를 보는 축

Related Work는 “dynamic region을 어떻게 알아내는가”와 “3DGS/NeRF representation을 SLAM에 어떻게 쓰는가”로 나누면 논문의 위치가 분명해진다.

연구 흐름	얻는 점	남는 한계
Traditional Visual SLAM	feature/geometry 기반 pose 추정	dynamic object 제거에 semantic/RGB-D cue가 필요한 경우 많음
Dynamic SLAM	mask, optical flow, object motion으로 distractor 처리	predefined class나 motion pattern에 의존
Neural / 3DGS SLAM	dense reconstruction과 view synthesis에 강함	static scene assumption이 강해 dynamic scene에서 artifact 발생
Uncertainty NeRF/GS	ambiguity를 uncertainty로 모델링	sparse-view와 known camera pose 전제가 많음

Mechanism: uncertainty를 tracking과 mapping에 어떻게 넣나

방법론의 핵심은 uncertainty map $ \beta_i $를 한 번만 예측하고 끝내는 것이 아니라, DBA tracking의 residual weight와 3DGS mapping의 rendering loss weight로 함께 사용한다는 점이다. 그래서 dynamic distractor는 pose update에서도 약해지고, Gaussian map에도 덜 남는다.

Figure 2. System Overview.DINOv2 feature에서 uncertainty MLP가 per-pixel uncertainty를 예측하고, tracking의 DBA와 mapping의 rendering loss에 같은 uncertainty가 들어간다.

Mechanism Thread Summary

Method는 3D Gaussian rendering, uncertainty prediction, uncertainty-guided DBA, uncertainty-guided map update로 나뉜다.

구간	무엇을 담당하나	핵심 장치
3DGS rendering	static scene을 differentiable Gaussian map으로 표현	color/depth alpha blending
Uncertainty prediction	dynamic distractor 가능성이 큰 pixel을 낮은 신뢰도로 표시	3D-aware DINOv2 feature + shallow MLP
Tracking	dynamic pixel이 pose/disparity update에 주는 영향 축소	$ \Sigma_{ij}/\beta_i^2 $ weighted DBA + metric depth regularization
Mapping	dynamic object가 Gaussian map에 남는 현상 완화	uncertainty-weighted color/depth rendering loss

Design Choice

WildGS-SLAM의 중요한 선택은 uncertainty MLP와 Gaussian map을 독립적으로 최적화한다는 점이다.

공유 신호

$ \beta $는 tracking과 mapping 양쪽에서 사용.

분리 최적화

map과 uncertainty MLP 사이 gradient를 detach.

얻는 효과

uncertainty가 map 품질을 망치지 않고 distractor 영향만 줄임.

1. 3D Gaussian rendering

Static scene은 Gaussian set $ \mathcal{G}=\{g_i\}_{i=1}^{K} $로 표현된다. 각 Gaussian은 color, opacity, mean, covariance를 가지며, camera plane으로 projection된 뒤 color와 depth가 alpha blending으로 렌더링된다.

$$\alpha_i = o_i \exp\left(-\frac{1}{2}(x' - \mu'_i)^T {\Sigma'_i}^{-1}(x' - \mu'_i)\right)\tag{1}$$

Eq. (1). Gaussian opacity contribution.projected Gaussian이 pixel $x'$에 주는 opacity contribution.

$$\hat{I}=\sum_{i\in\mathcal{G}'} c_i\alpha_i\prod_{j=1}^{i-1}(1-\alpha_j),\qquad \hat{D}=\sum_{i\in\mathcal{G}'} \hat{d}_i\alpha_i\prod_{j=1}^{i-1}(1-\alpha_j)\tag{2}$$

Eq. (2). Color-depth alpha blending.overlapping Gaussian을 depth 순서로 blending해 rendered color와 depth를 계산.

2. Uncertainty prediction

입력 image $I_i$에서 3D-aware DINOv2 feature $ \mathcal{F}_i=F(I_i) $를 추출하고, shallow MLP $P$가 uncertainty map $ \beta_i=P(\mathcal{F}_i) $를 예측한다. MLP는 streamed frame으로 online 학습되어 sequence별 distractor와 occlusion에 적응한다.

$$\mathcal{L}_{\mathrm{depth}}=\lvert \hat{D}_i-\tilde{D}_i\rvert_1\tag{3}$$

Eq. (3). Metric-depth L1 supervision.rendered depth와 Metric3D V2 metric depth 사이의 L1 depth signal.

$$\mathcal{L}_{\mathrm{uncer}}=\mathcal{L}'_{\mathrm{SSIM}}+\lambda_1\frac{\mathcal{L}_{\mathrm{uncer\_D}}}{\beta_i^2}+\lambda_2\mathcal{L}_{\mathrm{reg\_V}}+\lambda_3\mathcal{L}_{\mathrm{reg\_U}}\tag{4}$$

Eq. (4). Uncertainty learning objective.modified SSIM, depth uncertainty term, feature-similarity regularization, uncertainty growth regularization을 결합한 uncertainty objective. 원문 표기는 $ \mathcal{L}_{\mathrm{uncer\_D}} $이며, 본문 설명상 Eq. (3)의 L1 depth signal을 uncertainty loss에 넣은 custom depth term으로 보면 된다.

NeRF On-the-go 기반 loss term 보기

NeRF On-the-go에서 가져온 부분

WildGS-SLAM의 Eq. (4)는 NeRF On-the-go의 uncertainty 학습 아이디어를 SLAM/3DGS setting에 맞게 재사용한다. 그대로 복사한 loss라기보다는, modified SSIM과 regularization 항을 따르고 depth term을 추가한 형태로 읽는 것이 정확하다.

항	NeRF On-the-go의 의미	WildGS-SLAM에서의 역할
$ \mathcal{L}'_{\mathrm{SSIM}} $	patch의 luminance, contrast, structure 차이를 곱해 dynamic/static 차이를 더 강하게 드러냄	RGB 차이만으로는 비슷해 보이는 distractor도 구조 차이로 높은 uncertainty를 받게 함
$ \mathcal{L}_{\mathrm{uncer\_D}}/\beta_i^2 $	NeRF On-the-go에는 없는 WildGS-SLAM 추가 항. 원문 표기상 $ \mathcal{L}_{\mathrm{uncer\_D}} $는 L1 depth signal과 연결된 custom depth uncertainty term	rendered depth와 Metric3D depth가 맞지 않는 영역의 영향력을 uncertainty로 조절
$ \mathcal{L}_{\mathrm{reg\_V}} $	DINO feature가 비슷한 ray/pixel끼리는 uncertainty도 비슷해야 한다는 consistency regularization	비슷한 appearance/semantic region에서 uncertainty가 들쭉날쭉해지는 것을 완화
$ \mathcal{L}_{\mathrm{reg\_U}} $	$ \log \beta $ 형태의 growth regularizer	모든 pixel의 $ \beta $를 무한히 키워 loss를 회피하는 trivial solution 방지

핵심 수식만 보면

NeRF On-the-go는 RGB error만 쓰면 배경과 색이 비슷한 distractor를 놓칠 수 있다고 보고, SSIM의 세 구성요소를 분리해 uncertainty를 학습한다.

$$\mathcal{L}'_{\mathrm{SSIM}}=(1-L(P,\hat{P}))(1-C(P,\hat{P}))(1-S(P,\hat{P}))$$

Auxiliary. Modified SSIM uncertainty cue.patch $P$와 rendered patch $\hat{P}$ 사이의 luminance, contrast, structure 차이를 모두 반영.

$$\mathcal{N}(r)=\{r'\mid \cos(f,f')>\eta\},\qquad \bar{\beta}(r)=\frac{1}{|\mathcal{N}(r)|}\sum_{r'\in\mathcal{N}(r)}\beta(r')$$

Auxiliary. Feature-neighbor uncertainty average.DINO feature가 비슷한 ray 집합을 만들고, 그 안에서 평균 uncertainty를 계산.

$$\mathcal{L}_{\mathrm{reg\_V}}(r)=\frac{1}{|\mathcal{N}(r)|}\sum_{r'\in\mathcal{N}(r)}(\bar{\beta}(r)-\beta(r'))^2,\qquad \mathcal{L}_{\mathrm{reg\_U}}=\log\beta_i$$

Auxiliary. Uncertainty regularization terms.feature-wise uncertainty consistency와 uncertainty growth 억제를 위한 보조 항.

Rendering / uncertainty notation 보기

Notation: Gaussian rendering과 uncertainty

WildGS-SLAM의 방법론은 Gaussian rendering, learned uncertainty, DROID-SLAM식 tracking이 한 수식 흐름 안에 섞인다. 각 변수가 rendering 쪽인지, uncertainty 쪽인지, tracking/mapping 쪽인지 나눠 읽으면 loss의 역할이 명확해진다.

Notation	의미	읽는 포인트
$g_i$, $\mathcal G$, $\mathcal G'$	3D Gaussian, map Gaussian set, pixel에 기여하는 projected/sorted Gaussian set	rendering은 depth order에 따라 front-to-back alpha blending으로 계산.
$o_i$, $\mu'_i$, $\Sigma'_i$, $x'$	opacity, projected mean/covariance, image-plane pixel	Eq. (1)은 projected Gaussian이 한 pixel에 주는 opacity contribution.
$\alpha_i$, $c_i$, $\hat d_i$	per-Gaussian alpha, color, depth contribution	Eq. (2)의 color/depth rendering에 들어가는 기본 단위.
$\hat I$, $\hat D$, $\tilde D$	rendered image/depth와 Metric3D depth	metric depth는 monocular tracking과 mapping depth regularization에 사용.
$\mathcal F_i$, $P$, $\beta_i$	DINOv2 feature, shallow MLP, predicted uncertainty map	sequence별 distractor와 occlusion에 online으로 적응하는 부분.
$\mathcal{L}'_{\mathrm{SSIM}}$, $\mathcal{L}_{\mathrm{uncer\_D}}$	modified SSIM term과 depth uncertainty term	RGB 구조 차이와 depth mismatch를 uncertainty objective에 반영.
$\mathcal N(r)$, $f$, $\eta$, $\bar\beta(r)$	feature-neighbor set, DINO feature, similarity threshold, average uncertainty	비슷한 feature를 가진 ray/pixel의 uncertainty가 일관되도록 regularization.
$\Sigma_{ij}/\beta_i^2$, $M_i$	uncertainty-scaled covariance와 metric-depth mask	tracking에서 dynamic distractor와 unreliable depth의 영향을 낮춤.

3. Uncertainty-guided tracking

Tracking은 DROID-SLAM의 recurrent optical-flow update와 DBA를 기반으로 한다. WildGS-SLAM은 여기에 uncertainty와 metric depth를 넣어, moving distractor가 flow residual에 주는 영향력을 줄이고 초기 tracking을 안정화한다.

$$\arg\min_{\omega,d}\sum_{(i,j)\in\mathcal{E}}\left\|\tilde{p}_{ij}-\Pi_c\left(\omega_j^{-1}\omega_i\Pi_c^{-1}(p_i,d_i)\right)\right\|_{\Sigma_{ij}/\beta_i^2}^{2}+\lambda_4\sum_{i\in\mathcal{V}}\left\|M_i\left(d_i-1/\tilde{D}_i\right)\right\|^2\tag{5}$$

Eq. (5). Uncertainty-aware DBA objective.첫 항은 uncertainty-aware DBA, 두 번째 항은 monocular metric depth 기반 disparity regularization.

4. Uncertainty-guided mapping

새 keyframe이 들어오면 pose, RGB, metric depth를 사용해 3D Gaussian map을 확장한다. 이후 local window keyframe을 샘플링해 rendered color/depth를 계산하고, uncertainty-weighted rendering loss로 map을 업데이트한다.

$$\mathcal{L}_{\mathrm{render}}=\frac{\lambda_5\mathcal{L}_{\mathrm{color}}+\lambda_6\mathcal{L}_{\mathrm{depth}}}{\beta^2}+\lambda_7\mathcal{L}_{\mathrm{iso}}\tag{6}$$

Eq. (6). Uncertainty-weighted rendering loss.uncertainty map으로 color/depth rendering loss를 함께 조절하고 isotropic regularization을 더한 Gaussian map rendering loss.

Gaussian Splatting SLAM의 $ \mathcal{L}_{\mathrm{iso}} $ 보기

Isotropic regularization

WildGS-SLAM의 $ \mathcal{L}_{\mathrm{iso}} $는 Gaussian Splatting SLAM [30]의 isotropic shape regularization을 따르는 항이다. 핵심 목적은 관측이 부족한 방향으로 Gaussian ellipsoid가 과도하게 길어져 rendering artifact와 tracking 불안정을 만드는 것을 막는 것이다.

$$\mathcal{L}_{\mathrm{iso}}=\sum_{i=1}^{|\mathcal{G}|}\left\|s_i-\tilde{s}_i\mathbf{1}\right\|_1$$

Auxiliary. Isotropic scale regularization.Gaussian scale vector $s_i$가 평균 scale $\tilde{s}_i$를 모든 축에 복제한 isotropic shape에 가까워지도록 유도.

Color rendering loss 세부 보기

$$\mathcal{L}_{\mathrm{color}}=(1-\lambda_{\mathrm{ssim}})\|\hat{I}-I\|_1+\lambda_{\mathrm{ssim}}\mathcal{L}_{\mathrm{ssim}}\tag{7}$$

Eq. (7). Color rendering loss.L1 color error와 SSIM loss를 결합한 color rendering term.

Evidence: 어떤 task에서 검증했나

평가는 크게 tracking, novel view synthesis, ablation으로 읽으면 된다. WildGS-SLAM의 핵심 claim은 동적 distractor를 uncertainty로 낮춰 tracking과 rendering을 동시에 개선한다는 것이므로, ATE와 rendering metric을 함께 봐야 한다.

평가 조건 보기

평가 구성

WildGS-SLAM의 평가는 새로 수집한 Wild-SLAM dataset과 기존 dynamic SLAM benchmark를 함께 사용해 tracking, rendering, ablation을 확인한다.

구분	세부 조건	의미
Wild-SLAM MoCap	Intel RealSense D455 RGB-D OptiTrack ground truth 10 dynamic sequences	tracking ATE와 novel view synthesis를 정량 평가하는 핵심 자체 dataset.
Wild-SLAM iPhone	iPhone 14 Pro RGB 7 non-staged in-the-wild sequences ground truth trajectory 없음	monocular-only 환경에서 distractor, shadow, uncertainty map을 정성적으로 확인.
Bonn / TUM	기존 RGB-D dynamic SLAM benchmark의 dynamic sequence 사용	새 dataset에만 맞춘 결과가 아니라 기존 benchmark에서도 tracking이 안정적인지 확인.

구현 / 지표

초기 tracking과 최종 map refinement는 평가 수치를 안정화하는 설정이므로, 결과를 볼 때 함께 기억하면 좋다.

항목	설정	의미
Initialization	첫 12 keyframe으로 DBA 초기화, 초기에는 uncertainty weight 비활성	uncertainty MLP가 아직 수렴하지 않은 초반 frame에서 tracking을 안정화.
Final refinement	final global BA 이후 모든 keyframe으로 Gaussian map refinement	pose 업데이트 이후 Eq. (6) 기반으로 map 품질을 다시 보정.
Metrics	Tracking: ATE RMSE Rendering: PSNR, SSIM, LPIPS Ablation: uncertainty/depth/disparity 제거 비교	WildGS-SLAM의 claim이 pose 안정성과 rendering 품질 양쪽에서 성립하는지 확인.
Baselines	classic SLAM, dynamic SLAM, neural/3DGS SLAM, feed-forward methods를 함께 비교	RGB-D/semantic prior 여부가 다른 방법들과 monocular RGB setting의 차이를 함께 읽어야 함.

Evaluation Evidence

core evaluation은 Wild-SLAM/Bonn/TUM tracking과 Wild-SLAM rendering이고, ablation은 uncertainty와 depth/disparity 설계가 실제로 필요한지 검증한다.

핵심 평가

평가 축	근거	확인할 점
Tracking	Table 1, 3, 4	Wild-SLAM MoCap, Bonn, TUM에서 ATE RMSE 비교.
Rendering	Table 2, Figure 3-6	distractor 제거와 static scene rendering 품질 확인.
Real-world generality	Figure 5	iPhone RGB sequence에서 shadow와 distractor까지 uncertainty로 처리.

보조 근거

분석 축	근거	의미
Ablation	Table 5	uncertainty mask, L1 depth loss, disparity regularization이 모두 tracking 안정성에 기여.
Dataset contribution	Wild-SLAM MoCap / iPhone	dynamic indoor/outdoor, occlusion, varied object motion 평가 조건 제공.

Tracking Evidence

Table 1. Tracking Performance on Wild-SLAM MoCap Dataset.ATE RMSE 기준으로 WildGS-SLAM은 평균 0.46cm를 기록하며, 대부분의 dynamic sequence에서 baseline보다 안정적이다.

Table 3. Tracking Performance on Bonn RGB-D Dynamic Dataset.RGB-D/semantic에 의존하는 방법과 monocular baselines를 함께 비교하며, WildGS-SLAM은 평균 ATE 2.31cm로 가장 낮다.

Table 4. Tracking Performance on TUM RGB-D Dataset.dynamic sequence subset에서 WildGS-SLAM은 가장 높은 overall performance를 보인다고 논문은 설명한다.

Rendering Evidence

Table 2. Novel View Synthesis Evaluation on Wild-SLAM MoCap Dataset.Splat-SLAM 대비 PSNR/SSIM은 높고 LPIPS는 낮아, uncertainty-aware mapping이 rendering quality를 개선함을 보여준다.

Figure 3. Input View Synthesis Results on Wild-SLAM MoCap Dataset.distractor 종류와 관계없이 distractor를 제거하고 static scene을 더 realistic하게 rendering한다고 설명한다.

Figure 4. Novel View Synthesis Results on Wild-SLAM MoCap Dataset.static scene subset에서 novel view synthesis를 평가하며, 이미지 안에 PSNR metric이 함께 표시된다.

Figure 5. Input View Synthesis Results on Wild-SLAM iPhone Dataset.monocular-only 비교를 보여주며, uncertainty map은 DINOv2 feature 해상도와 mapping downsampling 때문에 다소 blurry할 수 있다고 논문은 설명한다.

Figure 6. View Synthesis Results on Bonn RGB-D Dynamic Dataset.Balloon/Crowd sequence에서 ReFusion과 DynaSLAM의 artifact를 비교하며, WildGS-SLAM은 motion blur가 있어도 더 안정적인 rendering을 보인다.

Ablation Evidence

Table 5. WildGS-SLAM Ablation Study.uncertainty mask, L1 depth loss, disparity regularization을 제거한 variant와 full model을 비교하며, F는 해당 dataset 내 sequence failure를 의미한다.

Evidence Brief

결과는 uncertainty가 단순 mask가 아니라 tracking과 rendering 모두에 영향을 주는 공통 weight라는 점을 지지한다.

Tracking

Wild-SLAM, Bonn, TUM에서 평균 ATE 개선.

Rendering

static scene image subset에서 artifact-free rendering과 NVS 품질 개선.

Uncertainty

iPhone sequence에서도 distractor와 shadow에 높은 uncertainty 부여.

Ablation

uncertainty, depth signal, disparity regularization 모두 필요한 설계로 확인.

Usage / Limits: 언제 유용하고 어디서 약한가

WildGS-SLAM은 monocular RGB만 있는 dynamic scene에서 tracking과 rendering을 동시에 얻고 싶을 때 특히 유용하다. semantic class가 정해지지 않은 distractor나 shadow처럼 hard mask로 처리하기 애매한 요소를 uncertainty로 downweight할 수 있기 때문이다.

When to Use / Avoid

구분	정리	이유
잘 맞는 상황	monocular RGB dynamic scene에서 pose와 static 3DGS map이 모두 필요	tracking과 mapping 모두 uncertainty로 dynamic 영향 축소
강한 조건	semantic label 없이 다양한 distractor를 처리해야 하는 video	DINOv2 feature와 online MLP가 sequence별 pattern에 적응
약한 조건	같은 region을 본 view가 적거나, motion prior가 필요한 복잡한 dynamic scene	uncertainty predictor가 input frame 기반 online 학습에 의존

느낀점

(진행중...)

Problem: dynamic distractors destabilize both tracking and rendering

WildGS-SLAM starts from the static-world assumption. Monocular SLAM and 3DGS SLAM often assume that cameras observe a consistent static scene, but real videos contain people, shadows, occlusions, and lighting changes that contaminate both pose update and map optimization.

Problem Flow

The paper reframes dynamic scenes as a trust-weighting problem for both tracking and mapping, not only a segmentation problem.

01Static-scene assumption

Feature matching and photometric consistency assume a rigid scene.

02Dynamic distractor

Moving objects, shadows, and occlusion corrupt pose and render losses.

03Mask/semantic limit

Predefined classes or RGB-D cues limit generalization.

04WildGS-SLAM's reframing

Use per-pixel uncertainty as a shared weight for tracking and mapping.

Problem / Proposal

The introduction and abstract converge on the idea that dynamic distractors should be downweighted by uncertainty rather than hard-removed only by masks.

Problem axis	Bottleneck	WildGS-SLAM's view
Tracking	Moving pixels are mistaken for camera motion	Downweight DBA residuals with $ \beta_i $
Mapping	Dynamic objects remain as artifacts in the Gaussian map	Apply uncertainty weights to rendering loss
Generalization	Semantic classes, RGB-D depth, or optical-flow masks can be brittle	Use online uncertainty from DINOv2 features

Related work context

How to read prior work

The related work is best grouped by how each method detects dynamic regions and how it uses neural or 3DGS representations for SLAM.

Research line	Strength	Remaining limit
Traditional Visual SLAM	Feature/geometry-based pose estimation	Often requires semantic or RGB-D cues for dynamic-object removal
Dynamic SLAM	Handles distractors through masks, optical flow, or object motion	Depends on predefined classes or motion patterns
Neural / 3DGS SLAM	Strong dense reconstruction and view synthesis	Dynamic scenes create artifacts under static-scene assumptions
Uncertainty NeRF/GS	Models ambiguity using uncertainty	Often assumes sparse-view settings and known camera poses

Mechanism: how is uncertainty injected into tracking and mapping?

The key is not just predicting an uncertainty map $ \beta_i $. WildGS-SLAM uses the same uncertainty as a residual weight in DBA tracking and a rendering-loss weight in 3DGS mapping. Dynamic distractors therefore affect both pose update and Gaussian map optimization less.

Mechanism Thread Summary

The method consists of 3D Gaussian rendering, uncertainty prediction, uncertainty-guided DBA, and uncertainty-guided map update.

Part	Role	Core device
3DGS rendering	Represents the static scene as a differentiable Gaussian map	Color/depth alpha blending
Uncertainty prediction	Marks likely dynamic distractor pixels as lower-trust observations	3D-aware DINOv2 feature + shallow MLP
Tracking	Reduces dynamic-pixel influence on pose/disparity updates	$ \Sigma_{ij}/\beta_i^2 $ weighted DBA + metric depth regularization
Mapping	Prevents dynamic objects from remaining in the Gaussian map	Uncertainty-weighted color/depth rendering loss

Design Choice

The important design choice is to optimize the uncertainty MLP and Gaussian map independently.

Shared signal

$ \beta $ is used in both tracking and mapping.

Separated optimization

Gradients are detached between the map and uncertainty MLP.

Effect

Uncertainty reduces distractor influence without degrading map optimization.

1. 3D Gaussian rendering

The static scene is represented as a Gaussian set $ \mathcal{G}=\{g_i\}_{i=1}^{K} $. Each Gaussian has color, opacity, mean, and covariance, and rendered color/depth are obtained through alpha blending.

$$\alpha_i = o_i \exp\left(-\frac{1}{2}(x' - \mu'_i)^T {\Sigma'_i}^{-1}(x' - \mu'_i)\right)\tag{1}$$

Eq. (1). Gaussian opacity contribution.Opacity contribution of a projected Gaussian at pixel $x'$.

$$\hat{I}=\sum_{i\in\mathcal{G}'} c_i\alpha_i\prod_{j=1}^{i-1}(1-\alpha_j),\qquad \hat{D}=\sum_{i\in\mathcal{G}'} \hat{d}_i\alpha_i\prod_{j=1}^{i-1}(1-\alpha_j)\tag{2}$$

Eq. (2). Color-depth alpha blending.Rendered color and depth from depth-ordered Gaussian blending.

2. Uncertainty prediction

For an input image $I_i$, a 3D-aware DINOv2 feature $ \mathcal{F}_i=F(I_i) $ is extracted, and a shallow MLP $P$ predicts the uncertainty map $ \beta_i=P(\mathcal{F}_i) $. The MLP is trained online on streamed frames, so it can adapt to sequence-specific distractors and occlusion patterns.

$$\mathcal{L}_{\mathrm{depth}}=\lvert \hat{D}_i-\tilde{D}_i\rvert_1\tag{3}$$

Eq. (3). Metric-depth L1 supervision.L1 depth signal between rendered depth and Metric3D V2 metric depth.

Eq. (4). Uncertainty learning objective.Uncertainty objective combining modified SSIM, depth uncertainty, feature-similarity regularization, and uncertainty growth regularization. The paper denotes $ \mathcal{L}_{\mathrm{uncer\_D}} $ as a custom depth uncertainty term tied to the L1 depth signal in Eq. (3).

NeRF On-the-go loss terms

What comes from NeRF On-the-go?

Eq. (4) reuses the uncertainty-learning idea from NeRF On-the-go in a SLAM/3DGS setting. It should be read as modified SSIM and regularization terms from NeRF On-the-go plus a WildGS-SLAM depth term, not as a direct copy of the entire loss.

Term	Meaning in NeRF On-the-go	Role in WildGS-SLAM
$ \mathcal{L}'_{\mathrm{SSIM}} $	Combines luminance, contrast, and structure differences to separate dynamic distractors from static background.	Gives high uncertainty to distractors even when RGB color alone is ambiguous.
$ \mathcal{L}_{\mathrm{uncer\_D}}/\beta_i^2 $	WildGS-SLAM-specific addition, not from NeRF On-the-go. In the paper notation, $ \mathcal{L}_{\mathrm{uncer\_D}} $ is a custom depth uncertainty term tied to the L1 depth signal.	Uses uncertainty to control regions where rendered depth and Metric3D depth disagree.
$ \mathcal{L}_{\mathrm{reg\_V}} $	Feature-neighbor consistency: rays/pixels with similar DINO features should have similar uncertainty.	Prevents noisy uncertainty variation inside similar appearance or semantic regions.
$ \mathcal{L}_{\mathrm{reg\_U}} $	A $ \log \beta $-style growth regularizer.	Prevents the trivial solution where every pixel receives infinitely large uncertainty.

Core formulas

NeRF On-the-go argues that pure RGB error can miss distractors with similar colors, so uncertainty is learned from a modified SSIM signal and feature-based consistency.

$$\mathcal{L}'_{\mathrm{SSIM}}=(1-L(P,\hat{P}))(1-C(P,\hat{P}))(1-S(P,\hat{P}))$$

Auxiliary. Modified SSIM uncertainty cue.Uses luminance, contrast, and structure differences between patch $P$ and rendered patch $\hat{P}$.

$$\mathcal{N}(r)=\{r'\mid \cos(f,f')>\eta\},\qquad \bar{\beta}(r)=\frac{1}{|\mathcal{N}(r)|}\sum_{r'\in\mathcal{N}(r)}\beta(r')$$

Auxiliary. Feature-neighbor uncertainty average.Builds a DINO-feature-neighbor set and averages uncertainty within it.

$$\mathcal{L}_{\mathrm{reg\_V}}(r)=\frac{1}{|\mathcal{N}(r)|}\sum_{r'\in\mathcal{N}(r)}(\bar{\beta}(r)-\beta(r'))^2,\qquad \mathcal{L}_{\mathrm{reg\_U}}=\log\beta_i$$

Auxiliary. Uncertainty regularization terms.Feature-wise uncertainty consistency and uncertainty-growth regularization.

Rendering / uncertainty notation

Notation: Gaussian rendering and uncertainty

WildGS-SLAM mixes Gaussian rendering, learned uncertainty, and DROID-SLAM-style tracking in one method chain. Separating rendering variables from uncertainty and tracking variables makes the loss terms easier to read.

Notation	Meaning	How to read it
$g_i$, $\mathcal G$, $\mathcal G'$	3D Gaussian, map Gaussian set, and projected/sorted Gaussians contributing to a pixel	Rendering uses depth-ordered front-to-back alpha blending.
$o_i$, $\mu'_i$, $\Sigma'_i$, $x'$	Opacity, projected mean/covariance, and image-plane pixel	Eq. (1) is the opacity contribution of a projected Gaussian to one pixel.
$\alpha_i$, $c_i$, $\hat d_i$	Per-Gaussian alpha, color, and depth contribution	The basic units used by the color/depth rendering equation.
$\hat I$, $\hat D$, $\tilde D$	Rendered image/depth and Metric3D depth	Metric depth regularizes monocular tracking and mapping.
$\mathcal F_i$, $P$, $\beta_i$	DINOv2 feature, shallow MLP, and predicted uncertainty map	The part that adapts online to sequence-specific distractors and occlusions.
$\mathcal{L}'_{\mathrm{SSIM}}$, $\mathcal{L}_{\mathrm{uncer\_D}}$	Modified SSIM term and depth uncertainty term	Bring RGB structure differences and depth mismatch into the uncertainty objective.
$\mathcal N(r)$, $f$, $\eta$, $\bar\beta(r)$	Feature-neighbor set, DINO feature, similarity threshold, and average uncertainty	Regularizes uncertainty to be consistent among visually/semantically similar rays.
$\Sigma_{ij}/\beta_i^2$, $M_i$	Uncertainty-scaled covariance and metric-depth mask	Reduces the influence of dynamic distractors and unreliable depth in tracking.

3. Uncertainty-guided tracking

The tracking module is based on DROID-SLAM's recurrent optical-flow update and DBA. WildGS-SLAM adds uncertainty and metric depth, reducing the influence of moving distractors on flow residuals and stabilizing early tracking.

Eq. (5). Uncertainty-aware DBA objective.The first term is uncertainty-aware DBA; the second term is monocular metric-depth-based disparity regularization.

4. Uncertainty-guided mapping

When a keyframe is inserted, its pose, RGB image, and metric depth expand the 3D Gaussian map. The map is then updated with uncertainty-weighted rendering loss over sampled local-window keyframes.

$$\mathcal{L}_{\mathrm{render}}=\frac{\lambda_5\mathcal{L}_{\mathrm{color}}+\lambda_6\mathcal{L}_{\mathrm{depth}}}{\beta^2}+\lambda_7\mathcal{L}_{\mathrm{iso}}\tag{6}$$

Eq. (6). Uncertainty-weighted rendering loss.Gaussian-map rendering loss where the uncertainty map weights both color and depth terms before isotropic regularization is added.

$ \mathcal{L}_{\mathrm{iso}} $ from Gaussian Splatting SLAM

Isotropic regularization

WildGS-SLAM's $ \mathcal{L}_{\mathrm{iso}} $ follows the isotropic shape regularization from Gaussian Splatting SLAM [30]. Its role is to prevent Gaussian ellipsoids from becoming excessively elongated in weakly observed directions, which can create rendering artifacts and destabilize tracking.

$$\mathcal{L}_{\mathrm{iso}}=\sum_{i=1}^{|\mathcal{G}|}\left\|s_i-\tilde{s}_i\mathbf{1}\right\|_1$$

Auxiliary. Isotropic scale regularization.Encourages the Gaussian scale vector $s_i$ to stay close to an isotropic shape formed by copying the mean scale $\tilde{s}_i$ to every axis.

Color rendering loss details

$$\mathcal{L}_{\mathrm{color}}=(1-\lambda_{\mathrm{ssim}})\|\hat{I}-I\|_1+\lambda_{\mathrm{ssim}}\mathcal{L}_{\mathrm{ssim}}\tag{7}$$

Eq. (7). Color rendering loss.Color rendering term combining L1 color error and SSIM loss.

Evidence: which tasks are tested?

The evaluation is best read through tracking, novel view synthesis, and ablation. The core claim is that uncertainty reduces dynamic-distractor influence and improves both tracking and rendering, so ATE and rendering metrics should be considered together.

Evaluation setup

Evaluation Setup

WildGS-SLAM is evaluated with the newly collected Wild-SLAM dataset and existing dynamic SLAM benchmarks across tracking, rendering, and ablation settings.

Group	Details	How to read it
Wild-SLAM MoCap	Intel RealSense D455 RGB-D OptiTrack ground truth 10 dynamic sequences	Main custom dataset for quantitative tracking ATE and novel-view-synthesis evaluation.
Wild-SLAM iPhone	iPhone 14 Pro RGB 7 non-staged in-the-wild sequences No ground-truth trajectory	Qualitative monocular-only check for distractors, shadows, and uncertainty maps.
Bonn / TUM	Dynamic sequences from existing RGB-D dynamic SLAM benchmarks	Checks whether tracking remains stable beyond the newly collected dataset.

Implementation / metric

Initialization and final map refinement are part of why the reported tracking and rendering numbers are stable, so they are worth reading with the results.

Item	Setting	Meaning
Initialization	Initial DBA with 12 keyframes; uncertainty weight is disabled early	Stabilizes early tracking before the uncertainty MLP has converged.
Final refinement	Gaussian map refinement over all keyframes after final global BA	Improves map quality after pose updates using the Eq. (6) objective.
Metrics	Tracking: ATE RMSE Rendering: PSNR, SSIM, LPIPS Ablation: remove uncertainty/depth/disparity components	Tests whether the core claim holds for both pose stability and rendering quality.
Baselines	Classic SLAM, dynamic SLAM, neural/3DGS SLAM, and feed-forward methods	Read the comparison together with RGB-D/semantic-prior assumptions.

Evaluation Evidence

Core evaluation covers Wild-SLAM/Bonn/TUM tracking and Wild-SLAM rendering, while ablations test whether uncertainty, depth, and disparity regularization are necessary.

Core evaluation

Axis	Evidence	What to check
Tracking	Table 1, 3, 4	ATE RMSE on Wild-SLAM MoCap, Bonn, and TUM.
Rendering	Table 2, Figure 3-6	Distractor removal and static-scene rendering quality.
Real-world generality	Figure 5	Uncertainty on iPhone RGB sequences with shadows and distractors.

Supporting evidence

Axis	Evidence	Meaning
Ablation	Table 5	Uncertainty mask, L1 depth loss, and disparity regularization all support tracking robustness.
Dataset contribution	Wild-SLAM MoCap / iPhone	Dynamic indoor/outdoor scenes, occlusion, and varied object motion.

Tracking Evidence

Rendering Evidence

Ablation Evidence

Evidence Brief

The results support uncertainty as a shared weight that improves both tracking and rendering.

Tracking

Average ATE improves on Wild-SLAM, Bonn, and TUM.

Rendering

Artifact-reduced rendering and NVS quality improve on static-scene subsets.

Uncertainty

iPhone sequences show high uncertainty on distractors and shadows.

Ablation

Uncertainty, depth signal, and disparity regularization are all useful.

Usage / Limits: when is it useful?

WildGS-SLAM is useful when monocular RGB dynamic videos need both camera tracking and a static 3D Gaussian map. It is especially relevant when semantic categories are unknown or when shadows and ambiguous distractors make hard masks brittle.

When to Use / Avoid

Category	Summary	Reason
Good fit	Dynamic monocular RGB scenes requiring both pose and static 3DGS map	Uncertainty reduces dynamic influence in both tracking and mapping
Strong condition	Videos with diverse distractors and no semantic labels	DINOv2 features and online MLP adapt to sequence-specific patterns
Weak condition	Limited repeated views of the same region or scenes requiring explicit motion priors	The uncertainty predictor depends on online learning from input frames

Takeaway

(In progress...)

문제 축	기존 접근의 병목	WildGS-SLAM의 관점
Tracking	moving pixel을 camera motion으로 오해	DBA residual을 \( \beta_i \)로 downweight
Mapping	dynamic object가 Gaussian map에 artifact로 남음	rendering loss에 uncertainty weight 적용
Generalization	semantic class, RGB-D depth, optical-flow mask에 의존	DINOv2 feature 기반 online uncertainty MLP 사용

항	NeRF On-the-go의 의미	WildGS-SLAM에서의 역할
\( \mathcal{L}'_{\mathrm{SSIM}} \)	patch의 luminance, contrast, structure 차이를 곱해 dynamic/static 차이를 더 강하게 드러냄	RGB 차이만으로는 비슷해 보이는 distractor도 구조 차이로 높은 uncertainty를 받게 함
\( \mathcal{L}_{\mathrm{uncer\_D}}/\beta_i^2 \)	NeRF On-the-go에는 없는 WildGS-SLAM 추가 항. 원문 표기상 \( \mathcal{L}_{\mathrm{uncer\_D}} \)는 L1 depth signal과 연결된 custom depth uncertainty term	rendered depth와 Metric3D depth가 맞지 않는 영역의 영향력을 uncertainty로 조절
\( \mathcal{L}_{\mathrm{reg\_V}} \)	DINO feature가 비슷한 ray/pixel끼리는 uncertainty도 비슷해야 한다는 consistency regularization	비슷한 appearance/semantic region에서 uncertainty가 들쭉날쭉해지는 것을 완화
\( \mathcal{L}_{\mathrm{reg\_U}} \)	\( \log \beta \) 형태의 growth regularizer	모든 pixel의 \( \beta \)를 무한히 키워 loss를 회피하는 trivial solution 방지

Notation	의미	읽는 포인트
\(g_i\), \(\mathcal G\), \(\mathcal G'\)	3D Gaussian, map Gaussian set, pixel에 기여하는 projected/sorted Gaussian set	rendering은 depth order에 따라 front-to-back alpha blending으로 계산.
\(o_i\), \(\mu'_i\), \(\Sigma'_i\), \(x'\)	opacity, projected mean/covariance, image-plane pixel	Eq. (1)은 projected Gaussian이 한 pixel에 주는 opacity contribution.
\(\alpha_i\), \(c_i\), \(\hat d_i\)	per-Gaussian alpha, color, depth contribution	Eq. (2)의 color/depth rendering에 들어가는 기본 단위.
\(\hat I\), \(\hat D\), \(\tilde D\)	rendered image/depth와 Metric3D depth	metric depth는 monocular tracking과 mapping depth regularization에 사용.
\(\mathcal F_i\), \(P\), \(\beta_i\)	DINOv2 feature, shallow MLP, predicted uncertainty map	sequence별 distractor와 occlusion에 online으로 적응하는 부분.
\(\mathcal{L}'_{\mathrm{SSIM}}\), \(\mathcal{L}_{\mathrm{uncer\_D}}\)	modified SSIM term과 depth uncertainty term	RGB 구조 차이와 depth mismatch를 uncertainty objective에 반영.
\(\mathcal N(r)\), \(f\), \(\eta\), \(\bar\beta(r)\)	feature-neighbor set, DINO feature, similarity threshold, average uncertainty	비슷한 feature를 가진 ray/pixel의 uncertainty가 일관되도록 regularization.
\(\Sigma_{ij}/\beta_i^2\), \(M_i\)	uncertainty-scaled covariance와 metric-depth mask	tracking에서 dynamic distractor와 unreliable depth의 영향을 낮춤.

Problem axis	Bottleneck	WildGS-SLAM's view
Tracking	Moving pixels are mistaken for camera motion	Downweight DBA residuals with \( \beta_i \)
Mapping	Dynamic objects remain as artifacts in the Gaussian map	Apply uncertainty weights to rendering loss
Generalization	Semantic classes, RGB-D depth, or optical-flow masks can be brittle	Use online uncertainty from DINOv2 features

핵심 요약

Monocular 3DGS SLAM

Uncertainty MLP

Tracking + Mapping Weight

Wild-SLAM Dataset

class prior 기반 제거

dense하지만 static 가정

geometric uncertainty

논문 상세 정리

Problem: dynamic distractor가 tracking과 rendering을 동시에 흔든다

Mechanism: uncertainty를 tracking과 mapping에 어떻게 넣나

Evidence: 어떤 task에서 검증했나

Usage / Limits: 언제 유용하고 어디서 약한가

느낀점

Problem: dynamic distractors destabilize both tracking and rendering

Mechanism: how is uncertainty injected into tracking and mapping?

Evidence: which tasks are tested?

Usage / Limits: when is it useful?

Takeaway

Comments