MonoSDF

Abstract

In recent years, neural implicit surface reconstruction methods have become popular for multi-view 3D reconstruction. In contrast to traditional multi-view stereo methods, these approaches tend to produce smoother and more complete reconstructions due to the inductive smoothness bias of neural networks. State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views. Yet, their performance drops significantly for larger and more complex scenes and scenes captured from sparse viewpoints. This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints, in particular in less-observed and textureless areas. Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction. We demonstrate that depth and normal cues, predicted by general-purpose monocular estimators, significantly improve reconstruction quality and optimization time. Further, we analyse and investigate multiple design choices for representing neural implicit surfaces, ranging from monolithic MLP models over single-grid to multi-resolution grid representations. We observe that geometric monocular priors improve performance both for small-scale single-object as well as large-scale multi-object scenes, independent of the choice of representation.

Method

We use monocular geometric cues predicted by a general-purpose pretrained network to guide the optimization of neural implicit surface models. More specifically, for a batch of rays, we volume render predicted RGB colors, depth, and normals, and optimize wrt. the input RGB images and monocular geometric cues. Further, we investigate different design choices for neural implicit architectures and provide an in-depth analysis.

Results

ScanNet

We test our method on the ScanNet dataset and compare to state-of-the-art methods. Our approach achieves significantly better reconstruction results.

Tanks and Temples

We test our method on the Tanks and Temples dataset and compare to state-of-the-art methods. MonoSDF is the first neural implicit model achieving reasonable results on such a large-scale indoor scene.

Tanks and Temples with High-resolution Monocular Cues

We show a preliminary result of using high-resolution cues in the Tanks and Temples dataset.

DTU with 3 Input Views

We test our method on the DTU dataset with only 3 input views. Our monocular geometric cues significantly boost the reconstruction results.

DTU with All Input Views

We test our method on the DTU dataset with all input views. Using multi-resolution feature grids with monocular geometric cues significantly boost the reconstruction results.

Ablation on Replica

Reconstructions

BibTeX

@article{Yu2022MonoSDF,
  author    = {Yu, Zehao and Peng, Songyou and Niemeyer, Michael and Sattler, Torsten and Geiger, Andreas},
  title     = {MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction},
  journal   = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2022},
}

Acknowledgements

This work was supported by an NVIDIA research gift. We thank the Max Planck ETH Center for Learning Systems (CLS) for supporting SP and the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting MN. ZY is supported by BMWi in the project KI Delta Learning (project number 19A19013O). AG is supported by the ERC Starting Grant LEGO-3D (850533) and DFG EXC number 2064/1 - project number 390727645. TS is supported by the EU Horizon 2020 project RICAIP (grant agreeement No.857306), and the European Regional Development Fund under project IMPACT (No. CZ.02.1.01/0.0/0.0/15_003/0000468). We also thank the authors of Manhattan-SDF for sharing baseline results on ScanNet. We also thank Christian Reiser and Zijian Dong for proofreading.