RESUMEN
Recent telepresence systems have shown significant improvements in quality compared to prior systems. However, they struggle to achieve both low cost and high quality at the same time. In this work, we envision a future where telepresence systems become a commodity and can be installed on typical desktops. To this end, we present a high-quality view synthesis method that uses a cost-effective capture system that consists of commodity hardware accessible to the general public. We propose a neural renderer that uses a few RGBD cameras as input to synthesize novel views of a user and their surroundings. At the core of the renderer is Multi-Layer Point Cloud (MPC), a novel 3D representation that improves reconstruction accuracy by removing non-linear biases in depth cameras. Our temporally-aware renderer further improves the stability of synthesized videos by conditioning on past information. Additionally, we propose Spatial Skip Connections (SSC) to improve image upsampling under limited GPU memory. Experimental results show that our renderer outperforms recent methods in terms of view synthesis quality. Our method generalizes to new users and challenging content (e.g., hand gestures and clothing deformation) without costly per-video optimization, object templates, or heavy pre-processing. The code and dataset will be made available.