[논문 리뷰] WildPose: A Unified Framework for Robust Pose Estimation in the Wild

핵심 요약

WildPose는 MASt3R식 3D-aware feature, differentiable BA, edge-specific motion mask를 결합해 dynamic scene과 static scene 모두에서 안정적인 monocular pose estimation을 목표로 하는 프레임워크다.

문제dynamic → static SLAM 불안정 해결3D-aware BA + motion mask 근거dynamic / static / depth / ablation

한 문장 요약

이 논문은 in-the-wild pose estimation을 stronger 3D-aware pair feature와 edge-dependent motion mask가 differentiable BA를 함께 안정화하는 문제로 본다.

Contribution 01

Unified Pose Estimator

dynamic, low-motion, static benchmark를 함께 겨냥하는 monocular pose estimator.

Contribution 02

3D-aware BA Operator

DROID식 optimization loop 안에 MASt3R encoder feature를 넣어 frontend prior 강화.

Contribution 03

Edge Motion Mask

pairwise motion mask로 특정 edge에서 outlier가 되는 dynamic region만 residual downweight.

Contribution 04

Cross-regime Evidence

Wild-SLAM/Bonn/TUM dynamic, Sintel, TUM static, 7-Scenes, depth estimation, ablation으로 검증.

내가 얻은 인사이트

WildPose는 DROID-SLAM을 semantic segmentation 시스템으로 바꾸지 않으면서 static-scene 의존성을 줄이려는 시도로 읽으면 좋다. 핵심은 dynamic을 단순히 지우는 것이 아니라, optimizer가 더 강한 3D-aware pair evidence를 사용하게 만드는 데 있다.

처리 흐름

01RGB Sequencecalibrated monocular video

02MASt3R Encoder3D-aware pair features

03Update Operatorflow / confidence / damping

04Motion Maskedge-specific dynamic weight

05Masked BApose + disparity optimization

06Loop / Global BAlong-term consistency

접근 방식 비교

Semantic Dynamic SLAM

class prior 기반 제거

known class에는 유용하지만 unseen movable object와 segmentation failure에 취약.

MegaSaM / WildGS-SLAM

motion/rendering cue

semantic prior는 줄이지만 synthetic gap, low-capacity motion decoding, online mapping cost가 남음.

WildPose

feature-rich optimizer

MASt3R의 3D-aware feature를 differentiable BA loop에 넣고 edge별 motion mask를 학습.

논문 상세 정리

아래부터는 기존 논문 내용을 최대한 담은 상세 해석이다. 핵심 흐름에서 벗어나는 related work, 학습 세부 조건, baseline 조건 메모는 접어두었다.

Problem: static-world 가정은 in-the-wild pose를 흔든다

WildPose의 출발점은 visual SLAM/SfM이 대체로 static world assumption에 기대고 있다는 점이다. 실제 영상에는 움직이는 사람, 이동하는 물체, 낮은 ego-motion, 조명 변화가 섞이는데, 이런 요소는 pose graph의 residual을 키우고 BA가 camera motion과 object motion을 혼동하게 만든다.

Figure 1. WildPose.calibrated video에서 moving distractor를 감지하고 camera trajectory를 추정하는 목표와, 여러 SLAM benchmark에서 안정적인 성능을 보인다는 주장을 함께 제시한다.

Problem Flow

논문은 dynamic scene만 잘하는 특수 방법보다 dynamic과 static을 모두 다루는 unified pose estimator를 목표로 둔다.

01Static assumption

기존 SLAM/SfM은 정적 장면의 reprojection consistency에 의존.

02Dynamic residual

moving object의 true flow가 ego-motion flow와 달라 BA residual 증가.

03Existing gap

semantic prior, online mapping, synthetic motion decoding은 각각 일반화/속도/용량 한계 보유.

04WildPose proposal

3D-aware feature와 edge mask를 BA에 직접 연결.

Problem / Proposal

문제 제기와 실험 결과는 모두 strong frontend + robust optimization의 결합이 필요하다는 주장으로 이어진다.

문제 축	기존 접근의 병목	WildPose의 관점
Dynamic scenes	semantic prior나 mask 품질에 민감	edge-dependent learned motion mask
Static scenes	dynamic 전용 method는 static-only input에서 성능 저하 가능	static/dynamic mixed curriculum
Long trajectory	local VO만으로 drift 누적	loop closure + periodic/final global BA

Related Work 맥락 자세히 보기

흐름	강점	WildPose가 보는 한계
Feed-forward 3D reconstruction	DUSt3R/MASt3R/VGGT 계열의 robust 3D prior	long sequence와 정밀 pose estimation은 optimizer 보강 필요
Differentiable SLAM	DROID-SLAM의 learnable update + differentiable BA	static training과 scratch-trained CNN encoder로 dynamic input에 취약
Dynamic pose estimation	semantic/geometric/motion cue로 dynamic object 처리	novel object, segmentation failure, synthetic-domain gap, online cost가 남음

Mechanism: MASt3R feature와 motion mask를 BA에 어떻게 넣나

방법론의 핵심은 DROID-SLAM의 differentiable BA pipeline을 유지하되, frontend feature와 residual weighting을 바꾸는 것이다. WildPose는 MASt3R encoder feature로 update operator를 강화하고, edge별 motion mask $ \hat{M}_{i,j} $를 이용해 dynamic pixel의 reprojection residual을 낮춘다.

Figure 2. System Overview.MASt3R encoder feature가 update operator와 motion detector에 들어가고, motion detector의 mask가 BA residual weight로 사용되는 전체 구조를 보여준다.

Mechanism Thread Summary

Method는 DROID-style BA state, MASt3R frontend, edge motion detector, curriculum training, online inference로 읽으면 된다.

구간	무엇을 담당하나	핵심 장치
State	keyframe pose와 downsampled disparity 공동 최적화	$ \hat{\omega}_i $, $ \hat{d}_i $, frame graph $ \mathcal{G} $
Update operator	predicted flow, confidence, damping, upsampling mask 반복 갱신	ConvGRU + correlation + MASt3R feature
Motion detector	edge별 dynamic residual downweight	MASt3R multi-level correspondence token + DPT-style fusion
Inference	online VO, metric depth initialization, loop/global BA	Moge2 depth prior + masked BA

Design Choice

WildPose가 강조하는 선택은 dynamic을 semantic class로 지우지 않고, BA edge에서 residual weight로 다루는 것이다.

왜 MASt3R?

pointmap precision과 2D correspondence pretraining이 flow/BA update와 잘 맞음.

왜 edge mask?

object가 특정 frame pair에서만 움직이면 그 edge에서만 outlier가 됨.

왜 global BA?

local sliding window만으로 생기는 long-sequence scale/pose drift를 보정.

1. DROID-SLAM preliminaries

각 keyframe의 state는 camera pose $ \hat{\omega}_i\in SE(3) $와 downsampled disparity $ \hat{d}_i $다. Edge $ (i,j)\in\mathcal{G} $에서 geometry-induced optical flow는 projection/back-projection으로 계산된다.

$$\tilde{f}_{i,j}=\Pi_c\left(\hat{\omega}_j^{-1}\hat{\omega}_i\Pi_c^{-1}(p_i,\hat{d}_i)\right)$$

Auxiliary. Geometry-induced flow baseline.static scene이라면 ego-motion으로 induced flow가 설명되어야 한다.

$$\left(\hat{f}^{t+1}_{i,j},\hat{w}^{t+1}_{i,j},\hat{\eta}^{t+1},\hat{u}^{t+1}\right)=\mathcal{F}\left(I_i,I_j,\tilde{f}^{t}_{i,j},r^{t}_{i,j}\right)\tag{1}$$

Eq. (1). Recurrent flow update operator.update operator가 image pair와 현재 geometry-induced flow, residual을 보고 flow, confidence, damping, upsampling mask를 반복적으로 갱신한다.

$$E(\hat{\omega},\hat{d})=\sum_{(i,j)\in\mathcal{G}}\left\|\hat{f}_{ij}-\tilde{f}_{ij}\right\|_{\Sigma_{ij}}^2,\qquad \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j})\tag{2}$$

Eq. (2). Differentiable BA objective.DROID-style differentiable BA는 predicted flow와 geometry-induced flow의 weighted residual을 최소화한다.

Notation: update operator와 BA에서 무엇이 오가나

Eq. (1)-(2)는 WildPose의 출발점이 되는 DROID-SLAM notation이다. 핵심은 update operator가 residual을 줄이는 방향으로 flow/confidence를 갱신하고, BA가 그 값을 pose/disparity 최적화에 쓰는 구조다.

Notation	의미	읽는 포인트
$ \hat{\omega}_i $, $ \hat{d}_i $	keyframe $i$의 camera pose와 downsampled disparity	BA가 직접 최적화하는 state.
$ \mathcal{G} $, $ (i,j) $	관측 overlap이 있는 keyframe들을 잇는 frame graph와 edge	모든 수식은 frame pair edge 단위로 residual을 쌓음.
$ \Pi_c $, $ p_i $	pinhole camera projection과 reference frame $i$의 pixel location	2D pixel을 3D로 back-project한 뒤 다시 frame $j$로 project.
$ \tilde{f}^{t}_{i,j} $	현재 pose/disparity가 유도하는 geometry-induced optical flow	static scene이라면 predicted flow와 가까워야 하는 기준 flow.
$ \hat{f}^{t}_{i,j} $	update operator가 예측하는 optical flow, 원문 기준 $ \mathbb{R}^{H/8 \times W/8 \times 2} $	BA objective에서 geometry-induced flow와 비교되는 learned flow.
$ \hat{w}^{t}_{i,j} $	predicted flow confidence, 원문 기준 $ \mathbb{R}^{H/8 \times W/8 \times 2} $	$ \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}) $로 BA weight가 됨.
$ r^{t}_{i,j} $	$ r^{t}_{i,j}=\hat{f}^{t}_{i,j}-\tilde{f}^{t}_{i,j} $	현재 learned flow와 geometry flow의 차이. update operator가 다음 갱신을 결정하는 핵심 residual.
$ \hat{\eta}^{t} $, $ \hat{u}^{t} $	BA 안정화를 위한 damping factor와 disparity upsampling mask	pose/disparity update가 과격해지지 않도록 보조하는 내부 변수.

2. Feed-forward transformer encoder

WildPose는 DROID의 scratch CNN encoder를 그대로 쓰지 않고, MASt3R의 pretrained ViT encoder feature를 adapter module로 변환해 ConvGRU가 사용할 local feature map과 global context로 넣는다. decoder token 전체를 쓰지 않는 이유는 ConvGRU와 직접 결합하면 학습/추론 비용이 커지기 때문이다.

3. Edge-specific motion mask

dynamic pixel의 true optical flow는 ego-motion뿐 아니라 object self-motion을 포함한다. 논문은 이를 $X_{i,j}(p_i)$로 표현하고, static BA objective가 이 residual을 그대로 맞추려 하면 pose와 disparity가 잘못 업데이트된다고 설명한다.

$$\tilde{f}^{\star}_{i,j}=\Pi_c\left(\hat{\omega}_j^{-1}\hat{\omega}_i\Pi_c^{-1}(p_i,\hat{d}_i)+X_{i,j}(p_i)\right)$$

Auxiliary. Dynamic-point flow model.object displacement가 들어오면 observed flow는 camera ego-motion만으로 설명되지 않는다.

Figure 3. Visualization of Motion Masks.per-edge mask는 frame-level mask보다 transient motion을 세밀하게 다룬다. 같은 object도 특정 edge에서 움직일 때만 BA outlier로 처리된다.

Training curriculum / loss notation 보기

4. Training curriculum

Training은 update operator와 motion detector를 한 번에 섞지 않고 단계적으로 학습한다. 먼저 static data로 ego-motion 기반 pairwise flow를 배우고, 이후 dynamic data와 mask를 넣어 BA robustness를 높이며, 마지막에는 frozen update operator와 BA에 predicted mask를 연결한다.

$$\mathcal{L}_1=w_{\mathrm{cam}}\mathcal{L}_{\mathrm{cam}}+w_{\mathrm{flow}}\mathcal{L}_{\mathrm{flow}}+w_{\mathrm{res}}\mathcal{L}_{\mathrm{res}}\tag{3}$$

Eq. (3). Update-operator training loss.첫 단계는 camera pose, flow, residual supervision으로 update operator를 학습한다.

$$\mathcal{L}_2=w_{\mathrm{cam}}\mathcal{L}_{\mathrm{cam}}+w_{\mathrm{mask}}\mathcal{L}_{\mathrm{BCE}}\tag{4}$$

Eq. (4). Motion-detector training loss.motion detector 학습 단계에서는 pose와 mask quality에 집중한다.

Training loss term 세부 보기

DROID-style loss와 motion mask supervision

Eq. (3)은 DROID-SLAM식 update-operator 학습 손실을 따르고, Eq. (4)는 motion detector를 붙이면서 mask supervision을 추가한다.

항	담당하는 신호	WildPose에서의 의미
$ \mathcal{L}_{\mathrm{cam}} $	camera pose supervision	update operator가 pose를 안정적으로 보정하도록 하는 기본 pose loss.
$ \mathcal{L}_{\mathrm{flow}} $	geometry-induced flow와 ground-truth flow 비교	static ego-motion으로 설명되는 flow를 먼저 학습해 BA update의 기준을 만든다.
$ \mathcal{L}_{\mathrm{res}} $	geometry-induced flow $ \tilde{f} $와 operator-predicted flow $ \hat{f} $의 residual supervise	ConvGRU update가 반복 단계에서 residual을 줄이도록 학습한다.
$ \mathcal{L}_{\mathrm{BCE}} $	motion mask binary supervision	dynamic residual을 BA에서 낮은 weight로 처리하기 위한 edge별 mask 학습 항.

5. Online inference and global consistency

Inference는 online visual odometry처럼 동작한다. 새 keyframe의 disparity는 Moge2 metric depth $D_i$로 초기화하고, co-visible keyframe과 edge를 만든 뒤 edge별 mask를 예측한다.

$$E(\hat{\omega},\hat{d})=\sum_{(i,j)\in\mathcal{G}}\left\|\hat{f}_{ij}-\tilde{f}_{ij}\right\|_{\bar{\Sigma}_{ij}}^2+\lambda\sum_{i\in\mathcal{G}}\left\|\hat{d}_i-1/D_i\right\|^2\tag{5}$$

Eq. (5). Mask-weighted online BA objective.reprojection residual은 predicted motion mask로 다시 가중되고, disparity는 metric depth prior와 일관되도록 regularize된다.

$$\bar{\Sigma}_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}\hat{M}_{i,j})$$

Auxiliary. Motion-mask BA weight.motion mask가 confidence와 함께 BA weight matrix에 들어간다. final global BA에서는 depth regularizer 항을 제거한다.

Notation: motion mask와 depth prior

WildPose가 DROID-SLAM에서 바꾸는 부분은 Eq. (5)의 weight와 depth prior다. dynamic residual은 mask로 낮추고, monocular scale은 metric depth로 붙잡는다.

Notation	의미	역할
$ X_{i,j}(p_i) $	frame $i$에서 $j$로 갈 때 moving point가 가진 3D displacement	dynamic pixel의 true flow가 ego-motion flow와 달라지는 원인.
$ \hat{M}_{i,j} $	edge $ (i,j) $별 predicted motion mask, 원문 기준 $ \mathbb{R}^{H/8 \times W/8} $	특정 frame pair에서 outlier가 되는 dynamic region만 downweight.
$ \bar{\Sigma}_{ij} $	motion mask가 들어간 BA covariance/weight	$ \hat{w}_{i,j}\hat{M}_{i,j} $로 confidence와 motion filtering을 함께 반영.
$ D_i $, $ \lambda $	Moge2 metric depth prior와 regularization weight	online BA에서 disparity가 metric depth와 크게 어긋나지 않도록 보조.

Training / inference 세부 조건 보기

학습 데이터와 절차

재현이나 결과 해석에 필요한 조건을 Data, Training, Inference로 나누어 정리한다.

Data

항목	내용
Training data	TartanAir V2, TartanGround, Dynamic Replica, OmniWorld-Game, Kubric-generated static/dynamic sequence
Kubric motion	linear translation, pure rotation, target-locked motion 보강

Training

항목	내용
Update operator	static pretraining 200K iteration, static+dynamic finetuning 100K iteration
Motion detector	frozen update operator + differentiable BA와 결합해 15K iteration 학습

Inference

항목	내용
Loop closure	temporal interval과 co-visibility 조건을 만족하는 past keyframe과 graph 구성 후 BA 수행

Evidence: 어떤 task에서 검증했나

평가는 동적 환경 pose estimation, static/low-motion robustness, downstream depth, ablation의 네 가지 claim으로 나뉜다.

Evaluation Setup

평가는 monocular SLAM pipeline의 표준처럼 Sim(3) Umeyama alignment 후 ATE RMSE를 계산한다.

평가 축	Dataset / metric	의미
Dynamic pose	Wild-SLAM MoCap, Bonn RGB-D Dynamic, TUM dynamic / ATE RMSE	moving distractor에서 pose 안정성 검증
Low-motion / static	Sintel, TUM static, 7-Scenes / ATE RMSE	dynamic 대응이 static 성능을 해치지 않는지 확인
Depth downstream	Bonn Dynamic / Abs.Rel., Log-RMSE, $ \delta_{1.25} $	pose와 mask가 long-video depth optimization에 주는 영향
Ablation	Mix. Ft., Motion Mask, GBA Depth Off	각 design choice의 기여 분리

Dynamic Pose Brief

주요 결과: Wild-SLAM MoCap에서는 평균 0.39cm로 가장 낮은 ATE를 보이고, TUM dynamic에서도 best average를 달성한다. Bonn에서는 WildGS-SLAM이 illumination handling에서 약간 앞서지만 WildPose는 전체적으로 competitive하다.

Table 1. Tracking Performance on Wild-SLAM MoCap Dataset. — Table 1. Tracking Performance on Wild-SLAM MoCap Dataset [60] (ATE RMSE ↓ [cm]).Table1/Table2 sequence에서 ViPE의 error가 커지는 이유를 논문은 semantic prior가 table motion을 static으로 취급하는 한계로 설명한다.

Table 2. Tracking Performance on Bonn RGB-D Dynamic Dataset. — Table 2. Tracking Performance on Bonn RGB-D Dynamic Dataset [34] (ATE RMSE ↓ [cm]).Bonn은 non-trivial illumination change가 포함되어 WildGS-SLAM이 강점을 보이는 조건으로 언급된다.

Table 3. Tracking Performance on TUM RGB-D dynamic Dataset. — Table 3. Tracking Performance on TUM RGB-D (dynamic) Dataset [42] (ATE RMSE ↓ [cm]).

Static / Low-motion Brief

주요 결과: Sintel처럼 짧고 low-motion인 sequence에서는 WildGS-SLAM이 충분한 map coverage를 만들기 어렵지만 WildPose는 가장 좋은 성능을 보인다. TUM static과 7-Scenes에서는 dynamic-aware method가 static에서 무너지는 문제를 피한다.

Table 4. Tracking Performance on Low-motion and Static benchmarks.Sintel은 MegaSaM을 따라 scale-normalized trajectory를 사용하며, dagger는 all-frame evaluation, asterisk는 all-frame BA를 의미한다.

Figure 4. Visualization of Camera Trajectories.estimated trajectory는 translation error(ATE)로 colorized되어 ground truth와의 alignment 차이를 보여준다.

Depth / Ablation Brief

주요 결과: MegaSaM의 long-video depth optimization pipeline에 WildPose pose와 mask를 넣으면 Abs.Rel., Log-RMSE, $ \delta_{1.25} $ 모두 개선된다. Ablation은 mixed finetuning, explicit motion mask, final global BA에서 depth regularizer 제거가 전체 시스템 성능에 기여함을 보여준다.

Table 5. Comparison on Long-video Depth Estimation on Bonn RGB-D Dynamic Dataset. — Table 5. Comparison on Long-video Depth Estimation on Bonn RGB-D Dynamic [34] Dataset.논문은 정확한 pose와 motion mask가 depth estimation에도 직접적인 이득을 준다고 해석한다.

Table 6. WildPose Ablation Study. — Table 6. WildPose Ablation Study (ATE RMSE ↓ [cm]).Mix. Ft., Mot. Mask, GBA Dep. Off가 full system 성능에 각각 기여한다.

Usage / Limits: 언제 유용하고 어디서 약한가

When to Use / Avoid

WildPose는 dynamic handling과 static robustness를 동시에 요구하는 monocular pose estimation에 적합하다.

잘 맞는 상황	주의할 상황
calibrated monocular video moving distractor가 pose graph residual을 흔드는 장면 static scene 성능도 함께 유지해야 하는 benchmark	lighting variation이 심한 Bonn류 장면에서는 WildGS-SLAM이 유리한 경우 존재 MASt3R feature와 detector/BA inference cost 고려 필요 final global BA의 depth prior는 refined estimate에서는 제거해야 할 수 있음

느낀점

(진행중...)

Problem: static-world assumptions destabilize in-the-wild pose estimation

WildPose starts from a common weakness in visual SLAM/SfM: many systems assume a static world. Real videos contain moving people, movable objects, low ego-motion, and illumination changes, which increase pose-graph residuals and make BA confuse camera motion with object motion.

Problem Flow

The paper aims for a unified pose estimator rather than a dynamic-scene-only special case.

01Static assumption

Classical consistency assumes a static scene.

02Dynamic residual

Moving objects create flow not explained by ego-motion.

03Existing gap

Semantic priors, online mapping, and low-capacity motion decoding each leave limitations.

04WildPose proposal

Connect 3D-aware features and edge masks directly to BA.

Problem / Proposal

The problem statement and experiments point to the same answer: strong frontend features plus robust optimization.

Axis	Existing bottleneck	WildPose view
Dynamic scenes	Sensitive to semantic priors or mask quality	Edge-dependent learned motion mask
Static scenes	Dynamic-only methods can degrade on static-only input	Static/dynamic mixed curriculum
Long trajectory	Local VO accumulates drift	Loop closure + periodic/final global BA

Related Work context

How to read prior work

WildPose is positioned between feed-forward 3D reconstruction, differentiable SLAM, and dynamic-scene pose estimation.

Line	Strength	Limitation identified by WildPose
Feed-forward 3D reconstruction	Robust 3D priors from DUSt3R/MASt3R/VGGT-style models	Long sequences and precise pose still benefit from optimization
Differentiable SLAM	DROID-SLAM's learnable update and differentiable BA	Static training and a scratch-trained CNN encoder are weak under dynamic input
Dynamic pose estimation	Handles dynamics through semantic, geometric, or motion cues	Unseen objects, segmentation failure, synthetic-domain gap, and online cost remain

Mechanism: how are MASt3R features and motion masks injected into BA?

The method keeps DROID-SLAM's differentiable BA pipeline, but changes the frontend and residual weighting. WildPose strengthens the update operator with MASt3R encoder features and uses an edge-specific motion mask $ \hat{M}_{i,j} $ to downweight dynamic-pixel reprojection residuals.

Mechanism Thread Summary

The method is easiest to read as DROID-style BA state, MASt3R frontend, edge motion detector, curriculum training, and online inference.

Part	Role	Key device
State	Jointly optimizes keyframe pose and downsampled disparity	$ \hat{\omega}_i $, $ \hat{d}_i $, frame graph $ \mathcal{G} $
Update operator	Iteratively updates predicted flow, confidence, damping, and upsampling mask	ConvGRU + correlation + MASt3R feature
Motion detector	Downweights dynamic residuals per graph edge	MASt3R multi-level correspondence token + DPT-style fusion
Inference	Online VO, metric depth initialization, loop/global BA	Moge2 depth prior + masked BA

Design Choice

The key choice is to treat dynamics as edge-level residual weights, not only semantic classes.

Why MASt3R?

Pointmap precision and 2D-correspondence pretraining align well with flow/BA updates.

Why edge masks?

An object is an outlier only for graph edges where it actually moves.

Why global BA?

It mitigates long-sequence scale and pose drift beyond the local sliding window.

1. DROID-SLAM preliminaries

Each keyframe state contains the camera pose $ \hat{\omega}_i\in SE(3) $ and downsampled disparity $ \hat{d}_i $. For an edge $ (i,j)\in\mathcal{G} $, geometry-induced flow is computed by projection and back-projection.

$$\tilde{f}_{i,j}=\Pi_c\left(\hat{\omega}_j^{-1}\hat{\omega}_i\Pi_c^{-1}(p_i,\hat{d}_i)\right)$$

Auxiliary. Geometry-induced flow baseline.In a static scene, ego-motion should explain the induced flow.

$$\left(\hat{f}^{t+1}_{i,j},\hat{w}^{t+1}_{i,j},\hat{\eta}^{t+1},\hat{u}^{t+1}\right)=\mathcal{F}\left(I_i,I_j,\tilde{f}^{t}_{i,j},r^{t}_{i,j}\right)\tag{1}$$

Eq. (1). Recurrent flow update operator.The update operator uses the image pair, current geometry-induced flow, and residual to iteratively refine flow, confidence, damping, and upsampling mask.

$$E(\hat{\omega},\hat{d})=\sum_{(i,j)\in\mathcal{G}}\left\|\hat{f}_{ij}-\tilde{f}_{ij}\right\|_{\Sigma_{ij}}^2,\qquad \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j})\tag{2}$$

Eq. (2). Differentiable BA objective.DROID-style differentiable BA minimizes weighted residuals between predicted and geometry-induced flow.

Notation: update operator and BA signals

Eq. (1)-(2) use the DROID-SLAM notation that WildPose builds on. The update operator reduces residuals by refining flow/confidence, and BA uses those values to optimize pose and disparity.

Notation	Meaning	How to read it
$ \hat{\omega}_i $, $ \hat{d}_i $	camera pose and downsampled disparity of keyframe $i$	The state optimized by BA.
$ \mathcal{G} $, $ (i,j) $	frame graph and an edge between keyframes with observation overlap	Residuals are accumulated per graph edge.
$ \Pi_c $, $ p_i $	pinhole camera projection and a pixel in reference frame $i$	Back-project the pixel to 3D, then project it into frame $j$.
$ \tilde{f}^{t}_{i,j} $	geometry-induced optical flow from the current pose/disparity at iteration $t$	The flow that should explain a static scene.
$ \hat{f}^{t}_{i,j} $	predicted optical flow, $ \mathbb{R}^{H/8 \times W/8 \times 2} $ in the paper	The learned flow compared against geometry-induced flow.
$ \hat{w}^{t}_{i,j} $	confidence of predicted flow, $ \mathbb{R}^{H/8 \times W/8 \times 2} $ in the paper	Becomes the BA weight through $ \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}) $.
$ r^{t}_{i,j} $	$ r^{t}_{i,j}=\hat{f}^{t}_{i,j}-\tilde{f}^{t}_{i,j} $	The residual between learned flow and geometry flow; this is the key input for the next update.
$ \hat{\eta}^{t} $, $ \hat{u}^{t} $	BA damping factor and disparity upsampling mask	Auxiliary variables that stabilize BA and upsample disparity.

2. Feed-forward transformer encoder

WildPose replaces DROID's scratch-trained CNN encoder with pretrained MASt3R ViT encoder features. A lightweight adapter converts these features into local feature maps and global context for the ConvGRU. The paper avoids feeding full decoder tokens into the ConvGRU because that would make training and inference expensive.

3. Edge-specific motion mask

The true flow of a dynamic pixel includes object self-motion in addition to camera ego-motion. The paper writes this as $X_{i,j}(p_i)$; if static BA tries to fit this residual directly, pose and disparity are updated incorrectly.

$$\tilde{f}^{\star}_{i,j}=\Pi_c\left(\hat{\omega}_j^{-1}\hat{\omega}_i\Pi_c^{-1}(p_i,\hat{d}_i)+X_{i,j}(p_i)\right)$$

Auxiliary. Dynamic-point flow model.With object displacement, observed flow cannot be explained by camera ego-motion alone.

Training curriculum / loss notation

4. Training curriculum

Training is staged rather than fully joint from the start. The update operator first learns ego-motion-induced pairwise flow on static data, then learns robustness with dynamic data and ground-truth masks, and finally the motion detector is trained with the frozen update operator and differentiable BA.

$$\mathcal{L}_1=w_{\mathrm{cam}}\mathcal{L}_{\mathrm{cam}}+w_{\mathrm{flow}}\mathcal{L}_{\mathrm{flow}}+w_{\mathrm{res}}\mathcal{L}_{\mathrm{res}}\tag{3}$$

Eq. (3). Update-operator training loss.The first stage trains the update operator with camera, flow, and residual supervision.

$$\mathcal{L}_2=w_{\mathrm{cam}}\mathcal{L}_{\mathrm{cam}}+w_{\mathrm{mask}}\mathcal{L}_{\mathrm{BCE}}\tag{4}$$

Eq. (4). Motion-detector training loss.The detector stage focuses on pose and mask quality.

Training loss terms

DROID-style loss and motion-mask supervision

Eq. (3) follows the DROID-SLAM-style update-operator training loss, while Eq. (4) adds motion-mask supervision for the detector.

Term	Signal	Role in WildPose
$ \mathcal{L}_{\mathrm{cam}} $	Camera-pose supervision	Trains the update operator to refine pose reliably.
$ \mathcal{L}_{\mathrm{flow}} $	Geometry-induced flow versus ground-truth flow	Builds the static ego-motion flow baseline for BA updates.
$ \mathcal{L}_{\mathrm{res}} $	Residual between geometry-induced flow $ \tilde{f} $ and operator-predicted flow $ \hat{f} $	Encourages the recurrent update to reduce residuals across iterations.
$ \mathcal{L}_{\mathrm{BCE}} $	Binary motion-mask supervision	Trains edge-specific masks that downweight dynamic residuals in BA.

5. Online inference and global consistency

At inference time, WildPose runs as online visual odometry. A new keyframe is initialized with Moge2 metric depth $D_i$, connected to co-visible keyframes, and assigned edge-specific motion masks.

$$E(\hat{\omega},\hat{d})=\sum_{(i,j)\in\mathcal{G}}\left\|\hat{f}_{ij}-\tilde{f}_{ij}\right\|_{\bar{\Sigma}_{ij}}^2+\lambda\sum_{i\in\mathcal{G}}\left\|\hat{d}_i-1/D_i\right\|^2\tag{5}$$

Eq. (5). Mask-weighted online BA objective.The reprojection residual is weighted by the predicted motion mask, and disparity is regularized by a metric depth prior.

$$\bar{\Sigma}_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}\hat{M}_{i,j})$$

Auxiliary. Motion-mask BA weight.The motion mask enters the BA weight matrix with confidence. The final global BA removes the depth-regularizer term.

Notation: motion mask and depth prior

WildPose modifies DROID-SLAM mainly through the BA weight and the metric-depth prior. Dynamic residuals are downweighted by masks, while monocular scale is anchored by depth.

Notation	Meaning	Role
$ X_{i,j}(p_i) $	3D displacement of a moving point from frame $i$ to $j$	Explains why dynamic-pixel flow cannot be modeled by ego-motion only.
$ \hat{M}_{i,j} $	edge-specific predicted motion mask, $ \mathbb{R}^{H/8 \times W/8} $ in the paper	Downweights only the regions that are outliers for a given frame pair.
$ \bar{\Sigma}_{ij} $	BA covariance/weight after applying the motion mask	Combines confidence and motion filtering through $ \hat{w}_{i,j}\hat{M}_{i,j} $.
$ D_i $, $ \lambda $	Moge2 metric-depth prior and regularization weight	Keeps online BA disparity close to metric depth.

Training / inference details

Training data and schedule

The setup is separated into data, training, and inference conditions.

Data

Item	Content
Training data	TartanAir V2, TartanGround, Dynamic Replica, OmniWorld-Game, Kubric-generated static/dynamic sequences
Kubric motion	linear translation, pure rotation, target-locked motion

Training

Item	Content
Update operator	static pretraining for 200K iterations, static+dynamic finetuning for 100K iterations
Motion detector	trained for 15K iterations with frozen update operator and differentiable BA

Inference

Item	Content
Loop closure	adds BA edges to past co-visible keyframes when temporal and co-visibility thresholds are met

Evidence: which tasks validate the method?

The evaluation is best read through four claims: dynamic-scene pose, static/low-motion robustness, downstream depth, and ablation.

Evaluation Setup

The paper follows the common monocular SLAM protocol: Sim(3) Umeyama alignment followed by ATE RMSE.

Axis	Dataset / metric	Meaning
Dynamic pose	Wild-SLAM MoCap, Bonn RGB-D Dynamic, TUM dynamic / ATE RMSE	Pose robustness under moving distractors
Low-motion / static	Sintel, TUM static, 7-Scenes / ATE RMSE	Checks whether dynamic handling preserves static-scene performance
Depth downstream	Bonn Dynamic / Abs.Rel., Log-RMSE, $ \delta_{1.25} $	Effect of pose and masks on long-video depth optimization
Ablation	Mix. Ft., Motion Mask, GBA Depth Off	Separates contributions of the design choices

Dynamic Pose Brief

Main result: WildPose reaches the lowest average ATE on Wild-SLAM MoCap and the best average on TUM dynamic. On Bonn, WildGS-SLAM is slightly ahead under illumination-heavy conditions, while WildPose remains competitive overall.

Static / Low-motion Brief

Main result: On short low-motion Sintel sequences, WildGS-SLAM struggles to build enough map coverage, while WildPose performs best. On TUM static and 7-Scenes, WildPose avoids the static-scene degradation seen in dynamic-specialized methods.

Depth / Ablation Brief

Main result: Using WildPose poses and masks inside MegaSaM's long-video depth optimization improves Abs.Rel., Log-RMSE, and $ \delta_{1.25} $. The ablation shows that mixed finetuning, explicit motion masks, and removing depth regularization in final global BA all contribute to the full system.

Usage / Limits: when is it useful?

When to Use / Avoid

WildPose is well suited to monocular pose estimation where dynamic handling and static robustness are both required.

Good fit	Use with care
calibrated monocular video moving distractors corrupt pose-graph residuals static-scene performance must be preserved	illumination-heavy Bonn-like scenes can favor WildGS-SLAM MASt3R features and detector/BA inference cost matter final global BA may need depth regularization disabled

Takeaway

(In progress...)

Notation	의미	읽는 포인트
\( \hat{\omega}_i \), \( \hat{d}_i \)	keyframe \(i\)의 camera pose와 downsampled disparity	BA가 직접 최적화하는 state.
\( \mathcal{G} \), \( (i,j) \)	관측 overlap이 있는 keyframe들을 잇는 frame graph와 edge	모든 수식은 frame pair edge 단위로 residual을 쌓음.
\( \Pi_c \), \( p_i \)	pinhole camera projection과 reference frame \(i\)의 pixel location	2D pixel을 3D로 back-project한 뒤 다시 frame \(j\)로 project.
\( \tilde{f}^{t}_{i,j} \)	현재 pose/disparity가 유도하는 geometry-induced optical flow	static scene이라면 predicted flow와 가까워야 하는 기준 flow.
\( \hat{f}^{t}_{i,j} \)	update operator가 예측하는 optical flow, 원문 기준 \( \mathbb{R}^{H/8 \times W/8 \times 2} \)	BA objective에서 geometry-induced flow와 비교되는 learned flow.
\( \hat{w}^{t}_{i,j} \)	predicted flow confidence, 원문 기준 \( \mathbb{R}^{H/8 \times W/8 \times 2} \)	\( \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}) \)로 BA weight가 됨.
\( r^{t}_{i,j} \)	\( r^{t}_{i,j}=\hat{f}^{t}_{i,j}-\tilde{f}^{t}_{i,j} \)	현재 learned flow와 geometry flow의 차이. update operator가 다음 갱신을 결정하는 핵심 residual.
\( \hat{\eta}^{t} \), \( \hat{u}^{t} \)	BA 안정화를 위한 damping factor와 disparity upsampling mask	pose/disparity update가 과격해지지 않도록 보조하는 내부 변수.

Notation	Meaning	How to read it
\( \hat{\omega}_i \), \( \hat{d}_i \)	camera pose and downsampled disparity of keyframe \(i\)	The state optimized by BA.
\( \mathcal{G} \), \( (i,j) \)	frame graph and an edge between keyframes with observation overlap	Residuals are accumulated per graph edge.
\( \Pi_c \), \( p_i \)	pinhole camera projection and a pixel in reference frame \(i\)	Back-project the pixel to 3D, then project it into frame \(j\).
\( \tilde{f}^{t}_{i,j} \)	geometry-induced optical flow from the current pose/disparity at iteration \(t\)	The flow that should explain a static scene.
\( \hat{f}^{t}_{i,j} \)	predicted optical flow, \( \mathbb{R}^{H/8 \times W/8 \times 2} \) in the paper	The learned flow compared against geometry-induced flow.
\( \hat{w}^{t}_{i,j} \)	confidence of predicted flow, \( \mathbb{R}^{H/8 \times W/8 \times 2} \) in the paper	Becomes the BA weight through \( \Sigma_{ij}^{-1}=\operatorname{diag}(\hat{w}_{i,j}) \).
\( r^{t}_{i,j} \)	\( r^{t}_{i,j}=\hat{f}^{t}_{i,j}-\tilde{f}^{t}_{i,j} \)	The residual between learned flow and geometry flow; this is the key input for the next update.
\( \hat{\eta}^{t} \), \( \hat{u}^{t} \)	BA damping factor and disparity upsampling mask	Auxiliary variables that stabilize BA and upsample disparity.

구간	무엇을 담당하나	핵심 장치
State	keyframe pose와 downsampled disparity 공동 최적화	\( \hat{\omega}_i \), \( \hat{d}_i \), frame graph \( \mathcal{G} \)
Update operator	predicted flow, confidence, damping, upsampling mask 반복 갱신	ConvGRU + correlation + MASt3R feature
Motion detector	edge별 dynamic residual downweight	MASt3R multi-level correspondence token + DPT-style fusion
Inference	online VO, metric depth initialization, loop/global BA	Moge2 depth prior + masked BA

항	담당하는 신호	WildPose에서의 의미
\( \mathcal{L}_{\mathrm{cam}} \)	camera pose supervision	update operator가 pose를 안정적으로 보정하도록 하는 기본 pose loss.
\( \mathcal{L}_{\mathrm{flow}} \)	geometry-induced flow와 ground-truth flow 비교	static ego-motion으로 설명되는 flow를 먼저 학습해 BA update의 기준을 만든다.
\( \mathcal{L}_{\mathrm{res}} \)	geometry-induced flow \( \tilde{f} \)와 operator-predicted flow \( \hat{f} \)의 residual supervise	ConvGRU update가 반복 단계에서 residual을 줄이도록 학습한다.
\( \mathcal{L}_{\mathrm{BCE}} \)	motion mask binary supervision	dynamic residual을 BA에서 낮은 weight로 처리하기 위한 edge별 mask 학습 항.

Notation	의미	역할
\( X_{i,j}(p_i) \)	frame \(i\)에서 \(j\)로 갈 때 moving point가 가진 3D displacement	dynamic pixel의 true flow가 ego-motion flow와 달라지는 원인.
\( \hat{M}_{i,j} \)	edge \( (i,j) \)별 predicted motion mask, 원문 기준 \( \mathbb{R}^{H/8 \times W/8} \)	특정 frame pair에서 outlier가 되는 dynamic region만 downweight.
\( \bar{\Sigma}_{ij} \)	motion mask가 들어간 BA covariance/weight	\( \hat{w}_{i,j}\hat{M}_{i,j} \)로 confidence와 motion filtering을 함께 반영.
\( D_i \), \( \lambda \)	Moge2 metric depth prior와 regularization weight	online BA에서 disparity가 metric depth와 크게 어긋나지 않도록 보조.

핵심 요약

Unified Pose Estimator

3D-aware BA Operator

Edge Motion Mask

Cross-regime Evidence

class prior 기반 제거

motion/rendering cue

feature-rich optimizer

논문 상세 정리

Problem: static-world 가정은 in-the-wild pose를 흔든다

Mechanism: MASt3R feature와 motion mask를 BA에 어떻게 넣나

Evidence: 어떤 task에서 검증했나

Usage / Limits: 언제 유용하고 어디서 약한가

느낀점

Problem: static-world assumptions destabilize in-the-wild pose estimation

Mechanism: how are MASt3R features and motion masks injected into BA?

Evidence: which tasks validate the method?

Usage / Limits: when is it useful?

Takeaway

Comments