[논문 리뷰] 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

핵심 요약

MinkowskiCNN은 3D/4D perception 데이터에서 predefined sparse output coordinate에 대해서만 출력을 계산하고, 실제 존재하는 input neighbor만 모아 dense voxel grid의 낭비 없이 CNN 구조를 고차원 sparse signal에 적용한다.

문제빈 voxel 계산 낭비 해결generalized sparse conv 근거3D / 4D segmentation

한 문장 요약

MinkowskiCNN은 좌표가 존재하는 곳만 convolution하도록 sparse tensor와 kernel map을 정의해, dense voxel 낭비 없이 3D/4D semantic segmentation을 수행하게 만든다.

Contribution 01

Generalized Sparse Convolution

dense convolution, sparse submanifold convolution, stride, dilation, arbitrary kernel shape를 하나의 식으로 통합.

Contribution 02

Minkowski Engine

coordinate quantization, coordinate manager, kernel map, pooling, transposed convolution을 고차원 sparse tensor용으로 제공.

Contribution 03

4D Spatio-temporal ConvNets

3D video를 sparse 4D signal로 보고, frame-wise aggregation 대신 convolution으로 temporal context를 직접 처리.

Contribution 04

Hybrid Kernel / TS-CRF

4D cost를 줄이는 non-hypercubic kernel과 spatio-temporal consistency를 위한 7D trilateral stationary CRF 제안.

내가 얻은 인사이트

핵심은 단순히 “4D convolution을 했다”가 아니다. CNN layer가 존재하는 coordinate set 위에서만 연산하도록 시스템을 만든 덕분에, 기존 CNN architecture 아이디어를 sparse 3D/4D domain에 재사용할 수 있게 된 점이 중요하다.

처리 흐름

013D/4D Pointsspace + time

02Quantizationcontinuous to lattice

03Sparse Tensorcoordinates + features

04Kernel Mapinput-output pairs

05MinkowskiNet3D / 4D sparse CNN

06Segmentationsemantic labels

접근 방식 비교

Dense 3D CNN

dense grid first

구현은 직관적이지만 빈 3D 공간까지 계산해 memory와 compute 낭비가 큼.

Point-based Networks

point set first

dense voxel은 피하지만 CNN식 local weight sharing과 hierarchy를 그대로 쓰기 어려움.

MinkowskiCNN

sparse coordinate first

관측 coordinate만 유지하고 offset별 kernel map으로 convolution을 수행.

논문 상세 정리

아래부터는 기존 논문 내용을 최대한 담은 상세 해석이다. 핵심 흐름에서 벗어나는 배경지식, notation, 부가 자료는 접어두었다.

Problem: 3D video를 dense grid 없이 어떻게 처리할까

논문의 출발점은 3D/4D perception 데이터가 대부분 sparse하다는 사실이다. LiDAR scan, depth camera sequence, RGB-D reconstruction은 전체 voxel grid 중 일부 위치에만 관측값이 존재하므로, dense 3D/4D convolution을 그대로 적용하면 빈 공간까지 계산하게 된다.

Figure 1. An example of 3D video: 3D scenes at different time steps.논문은 3D video를 시간에 따라 변하는 3D scan sequence로 정의한다. 즉 입력은 단순 point cloud가 아니라 공간 좌표와 시간 축을 함께 갖는 sparse signal이다.

Problem Flow

Introduction의 문제 제기는 “4D를 쓰고 싶지만 dense 방식은 너무 비싸고, 기존 3D 방법은 temporal structure를 직접 다루지 못한다”로 요약된다.

013D/4D data is sparse

관측된 surface 주변에만 point가 있고 대부분의 voxel은 비어 있음.

02Dense convolution is expensive

차원이 올라갈수록 kernel 위치와 volume size가 급격히 증가.

03Frame-wise processing loses time

3D video를 frame별로 처리하면 temporal consistency를 network 내부에서 직접 학습하기 어려움.

04Sparse tensor as common format

coordinate와 feature를 함께 저장하면 기존 CNN layer를 sparse domain으로 확장 가능.

Figure 2. 2D projections of hypercubes in various dimensions.차원이 커질수록 hypercube kernel의 후보 위치가 늘어난다. 이 그림은 왜 4D에서 full hypercubic kernel만 쓰면 부담이 커지는지 직관을 준다.

Related Work 맥락 보기

기존 접근의 위치

Related Work는 논문이 왜 sparse tensor와 convolutional representation을 선택했는지 설명하는 배경이다.

계열	장점	논문이 보는 한계
Dense voxel CNN	3D grid 위에서 CNN을 직접 적용	대부분의 3D 공간이 비어 있어 memory/computation 비용이 큼
Point-based network	point set을 직접 처리해 voxel 낭비를 줄임	CNN의 local weight sharing과 계층적 architecture prior를 그대로 쓰기 어려움
Sparse convolution	관측 좌표만 처리	논문은 이를 임의 차원, arbitrary kernel, reusable coordinate map으로 일반화
Early 4D perception	spatio-temporal data를 다룸	homogeneous convolutional representation으로 깊은 4D network를 구성한 사례가 제한적

Mechanism: generalized sparse convolution이 무엇을 바꾸나

방법론의 핵심은 sparse tensor representation과 input/output coordinate set을 분리한 convolution 정의다. 이 덕분에 dense convolution, sparse submanifold convolution, strided convolution, transposed convolution, pooling을 같은 coordinate-map 기반 연산으로 다룰 수 있다.

1. Sparse tensor representation

논문은 coordinate matrix와 feature matrix를 함께 저장한다. 실제 4D video에서는 coordinate가 $(x,y,z,t)$를 포함하고, batch 처리를 위해 batch index가 좌표에 추가된다.

$$ \mathbf{C}=\begin{bmatrix} x_1 & y_1 & z_1 & t_1 & b_1 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ x_N & y_N & z_N & t_N & b_N \end{bmatrix},\quad \mathbf{F}=\begin{bmatrix} \mathbf{f}_1^{\top} \\ \vdots \\ \mathbf{f}_N^{\top} \end{bmatrix} \tag{1} $$

Eq. (1). Sparse tensor storage.좌표 행렬 $\mathbf{C}$와 feature 행렬 $\mathbf{F}$를 함께 저장해 비어 있지 않은 sparse lattice 위치만 표현한다.

2. Generalized sparse convolution

논문은 먼저 dense convolution을 Eq. (2)로 두고, Eq. (3)에서 predefined output coordinate $\mathbf{u}$마다 실제 존재하는 input neighbor만 합산하도록 일반화한다. 핵심은 $\mathcal{C}^{\mathrm{in}}$과 $\mathcal{C}^{\mathrm{out}}$이 같을 필요가 없다는 점이다.

$$ \mathbf{x}^{\mathrm{out}}_{\mathbf{u}} = \sum_{\mathbf{i}\in V^D(K)} \mathbf{W}_{\mathbf{i}}\mathbf{x}^{\mathrm{in}}_{\mathbf{u}+\mathbf{i}} \quad \text{for } \mathbf{u}\in\mathbb{Z}^D \tag{2} $$

Eq. (2). Dense convolution reference.$D$차원 dense grid에서의 conventional convolution이다.

$$ \mathbf{x}^{\mathrm{out}}_{\mathbf{u}} = \sum_{\mathbf{i}\in\mathcal{N}^D(\mathbf{u},\mathcal{C}^{\mathrm{in}})} \mathbf{W}_{\mathbf{i}}\mathbf{x}^{\mathrm{in}}_{\mathbf{u}+\mathbf{i}} \quad \text{for } \mathbf{u}\in\mathcal{C}^{\mathrm{out}} \tag{3} $$

Eq. (3). Generalized sparse convolution.출력 coordinate $\mathbf{u}$에서 실제 존재하는 input neighbor만 합산한다.

Convolution Meaning

Eq. (3)의 “일반화”는 수식 모양보다 coordinate set의 자유도가 중요하다.

구성요소	역할	해석
$\mathcal{C}^{\mathrm{in}}$	입력 sparse tensor의 coordinate set	실제로 feature가 존재하는 lattice 위치
$\mathcal{C}^{\mathrm{out}}$	출력 sparse tensor의 coordinate set	stride, pooling, transposed convolution 등에 따라 새로 정의 가능
$V^D(K)$	$D$차원 kernel offset 후보	$K$ 크기 hypercube 안의 가능한 offset 집합
$\mathcal{N}^D(\mathbf{u},\mathcal{C}^{\mathrm{in}})$	실제로 유효한 input neighbor offset	$\mathbf{u}+\mathbf{i}\in\mathcal{C}^{\mathrm{in}}$인 offset만 남김
$\mathbf{W}_{\mathbf{i}}$	offset별 learnable weight	각 offset에서 들어온 feature를 output channel로 변환

3. Kernel shape: hypercube vs hyper-cross

4D 이상에서는 full hypercube kernel의 offset 수가 빠르게 커진다. 논문은 이 문제를 줄이기 위해 공간/시간 방향을 모두 보는 full kernel뿐 아니라, 축 방향 offset을 중심으로 구성한 hyper-cross와 hybrid kernel을 함께 사용한다.

Figure 3. Various kernels in space-time.빨간 화살표는 temporal dimension을 의미한다. 논문은 hypercubic kernel과 hypercross kernel을 조합해 4D receptive field와 계산량 사이의 균형을 잡는다.

4. Minkowski Engine: coordinate manager와 kernel map

Minkowski Engine은 sparse convolution을 매번 brute-force neighbor search로 처리하지 않는다. coordinate manager가 coordinate set을 관리하고, kernel map은 offset별 input row와 output row의 연결을 저장한다. 같은 coordinate/kernel specification이 반복되면 이 mapping을 cache할 수 있다.

System Thread Summary

구현 관점에서 가장 중요한 흐름은 coordinate quantization → coordinate map → kernel map → gather/GEMM/scatter-add다.

단계	무엇을 담당하나	왜 필요한가
Coordinate quantization	연속 point를 discrete lattice coordinate로 변환	point cloud를 sparse tensor로 만들기 위한 입력 정규화
Coordinate manager	unique coordinate set과 coordinate stride 관리	layer 간 coordinate reuse와 lookup 비용 절감
Kernel map	offset별 input/output row index pair 저장	실제 존재하는 neighbor만 gather하고 scatter-add 수행
Sparse operators	convolution, pooling, transposed convolution 실행	dense CNN의 layer 문법을 sparse domain으로 옮김

5. Network architecture와 TS-CRF

MinkowskiNet은 ResNet 계열 구조를 sparse convolution으로 옮긴 모델이고, MinkowskiUNet은 semantic segmentation을 위한 encoder-decoder 구조다. 논문은 4D network의 output을 더 일관되게 만들기 위해 7D space-time-chroma space에서 TS-CRF도 사용한다.

Figure 4. Architecture of ResNet18 and MinkowskiNet18. — Figure 4. Architecture of ResNet18 (left) and MinkowskiNet18 (right).MinkowskiNet은 기존 ResNet block의 구조적 prior를 유지하면서 convolution primitive만 sparse convolution으로 바꾼다.

Figure 5. Architecture of MinkowskiUNet32.캡션의 $\times$는 hypercubic kernel, $+$는 hypercross kernel을 의미한다. U-Net 구조는 coarse context와 local detail을 함께 쓰기 위해 encoder-decoder skip connection을 사용한다.

Implementation algorithm / TS-CRF 수식 보기

Algorithm 역할

논문에는 엔진 구현을 설명하는 알고리즘들이 들어간다. 핵심 수식은 본문에 유지하고, 구현 보조 흐름은 아래처럼 역할 중심으로 접어둔다.

Algorithm	역할	핵심 동작
Alg. 1	GPU Sparse Tensor Quantization	coordinate를 hash key로 바꾸고 sort/unique/reduce로 collision과 label conflict 처리
Alg. 2	Generalized Sparse Convolution	offset별 kernel map을 따라 feature gather, matrix multiplication, output scatter-add 수행
Alg. 3-4	Max / Average Pooling	같은 출력 coordinate로 모이는 input feature를 reduce하거나 sparse matrix 곱으로 평균화
Alg. 5	TS-CRF Variational Inference	7D coordinate $[C,F,T]$ 위에서 sparse convolution과 softmax를 반복

TS-CRF fixed-point update

TS-CRF는 7D space-time-chroma coordinate에서 pairwise message passing을 generalized sparse convolution으로 구현한다.

$$ Q_i^{+}(x_i)=\frac{1}{Z_i}\exp\left\{ \phi_u(x_i)+\sum_{j\in\mathcal{N}^7(x_i)}\sum_{x_j} \phi_p(x_i,x_j)Q_j(x_j) \right\} \tag{4} $$

Eq. (4). TS-CRF mean-field update.Trilateral stationary CRF의 fixed-point update를 나타낸다.

$$ \frac{\partial L}{\partial \phi_p} = \sum_{n}^{N}\frac{\partial L}{\partial Q_n^{+}}\frac{\partial Q_n^{+}}{\partial \phi_p}, \qquad \frac{\partial L}{\partial \phi_u} = \sum_{n}^{N}\frac{\partial L}{\partial Q_n^{+}}\frac{\partial Q_n^{+}}{\partial \phi_u} \tag{5} $$

Eq. (5). TS-CRF training gradient.Unary network와 CRF compatibility function을 학습하기 위한 gradient다.

Evidence: 어떤 task에서 sparse 3D/4D CNN을 검증했나

실험은 “sparse CNN이 3D segmentation에서 강한가”, “4D convolution이 temporal context에 실제로 도움이 되는가”, “엔진이 충분히 효율적인가”를 나누어 검증한다. 핵심 평가는 dataset 이름보다 task 단위로 읽는 편이 흐름을 따라가기 쉽다.

Evaluation Brief

각 실험은 generalized sparse convolution의 표현력, 4D temporal modeling, runtime 효율을 다른 각도에서 확인한다.

3D semantic segmentation

ScanNet, S3DIS, RueMonge에서 sparse CNN의 3D scene understanding 성능 확인.

4D spatio-temporal segmentation

Synthia 4D와 noisy Synthia에서 시간 축 convolution과 TS-CRF 효과 확인.

Efficiency

voxel size와 video length에 따른 3D/4D MinkNet runtime 비교.

3D semantic segmentation

Table 1. 3D Semantic Label Benchmark on ScanNet.ScanNet semantic segmentation에서 voxel size와 network depth에 따른 mIoU를 비교한다. 괄호 안 값은 voxel size이며, $\dagger$는 post-deadline 결과를 의미한다.

Figure 7. Visualization of Scannet predictions.위에서부터 3D input point cloud, network prediction, ground truth 순서다.

4D spatio-temporal segmentation

Table 2. Segmentation results on the 4D Synthia dataset.TA는 temporal averaging이다. 4D MinkNet + TS-CRF가 mIoU/mAcc에서 가장 높은 조합으로 보고된다.

Table 3. Segmentation results on the noisy Synthia 4D dataset.논문은 noisy input에서 temporal averaging이 오히려 noise를 도입할 수 있고, 4D network가 더 robust하다고 분석한다.

Figure 6. Visualizations of 3D and 4D networks on Synthia. — Figure 6. Visualizations of 3D (top) and 4D networks (bottom) on Synthia.동일한 Synthia sequence에서 3D frame-wise prediction과 4D temporal prediction의 차이를 보여준다.

Additional 3D benchmarks / runtime

Table 4. Stanford Area 5 Test results. — Table 4. Stanford Area 5 Test (Fold #1) on S3DIS.S3DIS에서도 MinkowskiNet20/32가 point-based와 sparse baseline 대비 경쟁력 있는 mIoU/mAcc를 보인다.

Figure 8. Visualization of Stanford dataset Area 5 test results.위에서부터 RGB input, prediction, ground truth 순서다.

Table 5. RueMonge 2014 dataset results. — Table 5. RueMonge 2014 dataset (Varcity) TASK3.논문은 RueMonge가 작은 dataset이라 성능이 빠르게 포화되고, Synthia 4D가 ablation에 더 적합하다고 설명한다.

Table 6. Runtime comparison between 3D and 4D MinkNet. — Table 6. Time to process 3D videos with 3D and 4D MinkNet.$50m\times50m\times50m$ room에서 voxel size와 video length를 바꿔 runtime in seconds를 비교한다. 4D network는 temporal context를 보면서도 batch-like sparse volume 처리로 속도 이득이 나타나는 설정이 있다.

Usage / Limits: 언제 쓰기 좋은가

MinkowskiCNN은 point cloud나 voxelized reconstruction처럼 공간 대부분이 비어 있는 데이터에 적합하다. 특히 temporal point cloud sequence를 4D sparse tensor로 묶을 수 있을 때, frame-wise 3D network보다 시간 정보를 자연스럽게 통합할 수 있다.

When to Use / Avoid

이 방법은 sparse coordinate structure가 강할수록 장점이 뚜렷하다.

상황	판단	이유
Good fit	LiDAR, RGB-D, reconstructed point cloud, 3D video segmentation	빈 voxel이 많아 sparse coordinate 연산 이점이 큼
Strong use case	temporal point cloud / 3D video	4D convolution이 temporal context를 feature hierarchy 내부에서 처리
Check carefully	quantization resolution 선택	voxel size가 accuracy/runtime trade-off를 크게 좌우
Limitation	관측이 거의 dense하거나 coordinate가 계속 새로 생성되는 문제	sparse map reuse와 empty-space saving의 이점이 줄어들 수 있음

느낀점

(진행중...)

Problem: how can 3D videos be processed without dense grids?

The paper starts from a simple observation: most 3D/4D perception data is sparse. LiDAR scans, depth-camera sequences, and RGB-D reconstructions occupy only a small subset of a full voxel grid, so dense 3D/4D convolution wastes computation on empty space.

Problem Flow

The introduction argues that 4D perception is attractive, but dense processing is expensive and frame-wise 3D processing does not model temporal structure directly.

013D/4D data is sparse

Only observed surfaces contain points; most cells are empty.

02Dense convolution is expensive

Kernel locations and volume size grow quickly with dimension.

03Frame-wise processing loses time

Temporal consistency is not learned inside the network hierarchy.

04Sparse tensor as common format

Coordinates and features allow CNN layers to move into sparse domains.

Related work context

Where prior work fits

Related work explains why the paper chooses sparse tensors and convolutional representation.

Family	Strength	Limitation in this paper
Dense voxel CNN	Direct CNN on 3D grids	High memory and compute because most 3D cells are empty
Point networks	Avoid dense voxelization	CNN-style locality and hierarchy are less direct
Sparse convolution	Computes only at observed coordinates	This paper generalizes it to arbitrary dimension, kernel shape, and reusable maps
Early 4D perception	Handles spatio-temporal data	Deep homogeneous 4D convolutional networks were limited

Mechanism: what does generalized sparse convolution change?

The method combines sparse tensor representation with a convolution definition whose input and output coordinate sets can differ. This is what lets dense convolution, sparse submanifold convolution, strided convolution, pooling, and transposed convolution share one coordinate-map view.

1. Sparse tensor representation

The paper stores coordinates and features together. In 4D video, coordinates include $(x,y,z,t)$, plus a batch index in implementation.

Eq. (1). Sparse tensor storage.The coordinate matrix $\mathbf{C}$ and feature matrix $\mathbf{F}$ represent only occupied sparse-lattice locations.

2. Generalized sparse convolution

Eq. (2) is the dense reference. Eq. (3) relaxes it by summing only over offsets that exist in the input coordinate set and by allowing $\mathcal{C}^{\mathrm{in}}$ and $\mathcal{C}^{\mathrm{out}}$ to differ.

$$ \mathbf{x}^{\mathrm{out}}_{\mathbf{u}} = \sum_{\mathbf{i}\in V^D(K)} \mathbf{W}_{\mathbf{i}}\mathbf{x}^{\mathrm{in}}_{\mathbf{u}+\mathbf{i}} \quad \text{for } \mathbf{u}\in\mathbb{Z}^D \tag{2} $$

Eq. (2). Dense convolution reference.This is the conventional convolution over every location in a $D$-dimensional dense grid.

Eq. (3). Generalized sparse convolution.For each output coordinate, the sum keeps only input-neighbor offsets that actually exist in the sparse coordinate set.

Convolution Meaning

The generalization is mainly about coordinate-set freedom.

Element	Role	Reading
$\mathcal{C}^{\mathrm{in}}$	Input coordinate set	Locations where features exist
$\mathcal{C}^{\mathrm{out}}$	Output coordinate set	Can change through stride, pooling, or transposed convolution
$V^D(K)$	Candidate kernel offsets in $D$ dimensions	All offsets inside a $K$-sized hypercube
$\mathcal{N}^D(\mathbf{u},\mathcal{C}^{\mathrm{in}})$	Valid input-neighbor offsets	Keeps only offsets where $\mathbf{u}+\mathbf{i}\in\mathcal{C}^{\mathrm{in}}$
$\mathbf{W}_{\mathbf{i}}$	Learnable offset weights	Transforms neighbor features into output channels

3. Kernel shape

For 4D and higher dimensions, full hypercubic kernels become expensive. The paper therefore combines hypercubic and hyper-cross kernels.

4. Minkowski Engine

The engine avoids brute-force neighbor search. A coordinate manager maintains sparse coordinate sets, and kernel maps store offset-wise input/output row pairs that can be reused.

System Thread Summary

The main implementation flow is quantization → coordinate map → kernel map → gather/GEMM/scatter-add.

Step	Role	Why it matters
Coordinate quantization	Converts continuous points to discrete lattice coordinates	Creates sparse tensor input
Coordinate manager	Manages unique coordinate sets and strides	Enables reuse and fast lookup
Kernel map	Stores row pairs per offset	Computes only existing neighbors
Sparse operators	Runs convolution, pooling, and transposed convolution	Transfers CNN layer grammar to sparse domains

5. Networks and TS-CRF

MinkowskiNet transfers ResNet-style blocks to sparse convolution, and MinkowskiUNet builds an encoder-decoder for semantic segmentation. TS-CRF adds a 7D space-time-chroma refinement layer.

Implementation algorithms / TS-CRF equations

Algorithm roles

The implementation algorithms are support material for understanding the engine.

Algorithm	Role	Main operation
Alg. 1	GPU Sparse Tensor Quantization	Hash, sort, unique, and reduce coordinates/labels
Alg. 2	Generalized Sparse Convolution	Gather, GEMM, and scatter-add through kernel maps
Alg. 3-4	Max / Average Pooling	Reduce or average features that map to the same output coordinate
Alg. 5	TS-CRF Variational Inference	Iterates sparse convolution and softmax on $[C,F,T]$

TS-CRF fixed-point update

$$ Q_i^{+}(x_i)=\frac{1}{Z_i}\exp\left\{ \phi_u(x_i)+\sum_{j\in\mathcal{N}^7(x_i)}\sum_{x_j} \phi_p(x_i,x_j)Q_j(x_j) \right\} \tag{4} $$

Eq. (4). TS-CRF mean-field update.The update performs message passing in 7D space-time-chroma coordinates before normalizing the label belief.

Eq. (5). TS-CRF training gradients.The gradients propagate loss through the CRF update to learn the unary term and pairwise compatibility.

Evidence: which tasks validate sparse 3D/4D CNNs?

The experiments test three questions: whether sparse CNNs work for 3D segmentation, whether 4D convolution helps temporal data, and whether the system is efficient enough in practice.

Evaluation Brief

Each evidence block corresponds to a claim about representation, temporal modeling, or runtime.

3D semantic segmentation

ScanNet, S3DIS, and RueMonge evaluate sparse 3D scene understanding.

4D segmentation

Synthia 4D and noisy Synthia test temporal context and TS-CRF.

Efficiency

Runtime varies voxel size and video length for 3D/4D networks.

3D semantic segmentation

4D spatio-temporal segmentation

Additional 3D benchmarks / runtime

Usage / Limits: when is it useful?

MinkowskiCNN is well suited to sparse point clouds, voxelized reconstructions, and temporal point-cloud sequences. Its advantage is strongest when empty-space saving and coordinate-map reuse matter.

When to Use / Avoid

The method is most useful when sparse coordinate structure is strong.

Situation	Judgment	Reason
Good fit	LiDAR, RGB-D, reconstructed point clouds, 3D video segmentation	Most voxels are empty
Strong use case	Temporal point-cloud / 3D video	4D convolution handles time inside the feature hierarchy
Check carefully	Quantization resolution	Voxel size controls accuracy/runtime trade-off
Limitation	Nearly dense observations or constantly changing coordinates	Sparse-map reuse and empty-space savings become weaker

Takeaway

(Writing in progress...)

구성요소	역할	해석
\(\mathcal{C}^{\mathrm{in}}\)	입력 sparse tensor의 coordinate set	실제로 feature가 존재하는 lattice 위치
\(\mathcal{C}^{\mathrm{out}}\)	출력 sparse tensor의 coordinate set	stride, pooling, transposed convolution 등에 따라 새로 정의 가능
\(V^D(K)\)	\(D\)차원 kernel offset 후보	\(K\) 크기 hypercube 안의 가능한 offset 집합
\(\mathcal{N}^D(\mathbf{u},\mathcal{C}^{\mathrm{in}})\)	실제로 유효한 input neighbor offset	\(\mathbf{u}+\mathbf{i}\in\mathcal{C}^{\mathrm{in}}\)인 offset만 남김
\(\mathbf{W}_{\mathbf{i}}\)	offset별 learnable weight	각 offset에서 들어온 feature를 output channel로 변환

핵심 요약

Generalized Sparse Convolution

Minkowski Engine

4D Spatio-temporal ConvNets

Hybrid Kernel / TS-CRF

dense grid first

point set first

sparse coordinate first

논문 상세 정리

Problem: 3D video를 dense grid 없이 어떻게 처리할까

Mechanism: generalized sparse convolution이 무엇을 바꾸나

Evidence: 어떤 task에서 sparse 3D/4D CNN을 검증했나

Usage / Limits: 언제 쓰기 좋은가

느낀점

Problem: how can 3D videos be processed without dense grids?

Mechanism: what does generalized sparse convolution change?

Evidence: which tasks validate sparse 3D/4D CNNs?

Usage / Limits: when is it useful?

Takeaway

Comments