Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 46
1.
Article En | MEDLINE | ID: mdl-38656855

We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The fused features can also be used to predict semantic labels, allowing our method to reconstruct and segment the 3D scene simultaneously. Furthermore, we purpose an efficient self-supervised fine-tuning scheme that refines scene geometry based on input images through differentiable volume rendering. This fine-tuning scheme improves reconstruction quality on the fine-tuned scenes as well as the generalization to similar test scenes. The experiments on ScanNet, 7-Scenes and Replica datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed.

2.
IEEE Trans Vis Comput Graph ; 30(5): 2098-2108, 2024 May.
Article En | MEDLINE | ID: mdl-38437081

Visual-inertial SLAM (VI-SLAM) is a key technology for Augmented Reality (AR), which allows the AR device to recover its 6-DoF motion in real-time in order to render the virtual content with the corresponding pose. Nowadays, smartphones are still the mainstream devices for ordinary users to experience AR. However the current VI-SLAM methods, although performing well on high-end phones, still face robustness challenges when deployed on a larger stock of mid- and low-end phones. Existing VI-SLAM datasets use either very ideal sensors or only a limited number of devices for data collection, which cannot reflect the capability gaps that VI-SLAM methods need to solve when deployed on a large variety of phone models. This work proposes 100-Phones. the first VI-SLAM dataset covering a wide range of mainstream phones in the market. The dataset consists of 350 sequences collected by 100 different models of phones. Through analysis and experiments on the collected data, we conclude that the quality of visual-inertial data vary greatly among the mainstream phones, and the current open source VI-SLAM methods still have serious robustness issues when it comes to mass deployment on mobile phones. We release the dataset to facilitate the robustness improvement of VI-SLAM and to promote the mass popularization of AR. Project page: https://github.com/zju3dv/100-Phones.

3.
Article En | MEDLINE | ID: mdl-38507384

This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planar constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption and the Atlanta-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality. The code and supplementary materials are available at https://zju3dv.github.io/ manhattan sdf.

4.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4147-4159, 2024 Jun.
Article En | MEDLINE | ID: mdl-38231799

This paper addresses the challenge of reconstructing an animatable human model from a multi-view video. Some recent works have proposed to decompose a non-rigidly deforming scene into a canonical neural radiance field and a set of deformation fields that map observation-space points to the canonical space, thereby enabling them to learn the dynamic scene from images. However, they represent the deformation field as translational vector field or SE(3) field, which makes the optimization highly under-constrained. Moreover, these representations cannot be explicitly controlled by input motions. Instead, we introduce blend weight fields to produce the deformation fields. Based on the skeleton-driven deformation, blend weight fields are used with 3D human skeletons to generate observation-to-canonical and canonical-to-observation correspondences. Since 3D human skeletons are more observable, they can regularize the learning of deformation fields. Moreover, the blend weight fields can be combined with input skeletal motions to generate new deformation fields to animate the human model. To improve the quality of human modeling, we further represent the human geometry as a signed distance field in the canonical space. Additionally, a neural point displacement field is introduced to enhance the capability of the blend weight field on modeling detailed human motions. Experiments show that our approach significantly outperforms recent human modeling methods.

5.
Article En | MEDLINE | ID: mdl-38215333

It is typically challenging for visual or visual-inertial odometry systems to handle the problems of dynamic scenes and pure rotation. In this work, we design a novel visual-inertial odometry (VIO) system called RD-VIO to handle both of these two problems. Firstly, we propose an IMU-PARSAC algorithm which can robustly detect and match keypoints in a two-stage process. In the first state, landmarks are matched with new keypoints using visual and IMU measurements. We collect statistical information from the matching and then guide the intra-keypoint matching in the second stage. Secondly, to handle the problem of pure rotation, we detect the motion type and adapt the deferred-triangulation technique during the data-association process. We make the pure-rotational frames into the special subframes. When solving the visual-inertial bundle adjustment, they provide additional constraints to the pure-rotational motion. We evaluate the proposed VIO system on public datasets and online comparison. Experiments show the proposed RD-VIO has obvious advantages over other methods in dynamic environments.

6.
Article En | MEDLINE | ID: mdl-37294654

Although part-based motion synthesis networks have been investigated to reduce the complexity of modeling heterogeneous human motions, their computational cost remains prohibitive in interactive applications. To this end, we propose a novel two-part transformer network that aims to achieve high-quality, controllable motion synthesis results in real-time. Our network separates the skeleton into the upper and lower body parts, reducing the expensive cross-part fusion operations, and models the motions of each part separately through two streams of auto-regressive modules formed by multi-head attention layers. However, such a design might not sufficiently capture the correlations between the parts. We thus intentionally let the two parts share the features of the root joint and design a consistency loss to penalize the difference in the estimated root features and motions by these two auto-regressive modules, significantly improving the quality of synthesized motions. After training on our motion dataset, our network can synthesize a wide range of heterogeneous motions, like cartwheels and twists. Experimental and user study results demonstrate that our network is superior to state-of-the-art human motion synthesis networks in the quality of generated motions.

7.
Nat Commun ; 14(1): 2535, 2023 May 03.
Article En | MEDLINE | ID: mdl-37137891

This research proposes a deep-learning paradigm, termed functional learning (FL), to physically train a loose neuron array, a group of non-handcrafted, non-differentiable, and loosely connected physical neurons whose connections and gradients are beyond explicit expression. The paradigm targets training non-differentiable hardware, and therefore solves many interdisciplinary challenges at once: the precise modeling and control of high-dimensional systems, the on-site calibration of multimodal hardware imperfectness, and the end-to-end training of non-differentiable and modeless physical neurons through implicit gradient propagation. It offers a methodology to build hardware without handcrafted design, strict fabrication, and precise assembling, thus forging paths for hardware design, chip manufacturing, physical neuron training, and system control. In addition, the functional learning paradigm is numerically and physically verified with an original light field neural network (LFNN). It realizes a programmable incoherent optical neural network, a well-known challenge that delivers light-speed, high-bandwidth, and power-efficient neural network inference via processing parallel visible light signals in the free space. As a promising supplement to existing power- and bandwidth-constrained digital neural networks, light field neural network has various potential applications: brain-inspired optical computation, high-bandwidth power-efficient neural network inference, and light-speed programmable lens/displays/detectors that operate in visible light.

8.
Article En | MEDLINE | ID: mdl-37027614

The combination of augmented reality (AR) and medicine is an important trend in current research. The powerful display and interaction capabilities of the AR system can assist doctors to perform more complex operations. Since the tooth itself is an exposed rigid body structure, dental AR is a relatively hot research direction with application potential. However, none of the existing dental AR solutions are designed for wearable AR devices such as AR glasses. At the same time, these methods rely on high-precision scanning equipment or auxiliary positioning markers, which greatly increases the operational complexity and cost of clinical AR. In this work, we propose a simple and accurate neural-implicit model-driven dental AR system, named ImTooth, and adapted for AR glasses. Based on the modeling capabilities and differentiable optimization properties of state-of-the-art neural implicit representations, our system fuses reconstruction and registration in a single network, greatly simplifying the existing dental AR solutions and enabling reconstruction, registration, and interaction. Specifically, our method learns a scale-preserving voxel-based neural implicit model from multi-view images captured from a textureless plaster model of the tooth. Apart from color and surface, we also learn the consistent edge feature inside our representation. By leveraging the depth and edge information, our system can register the model to real images without additional training. In practice, our system uses a single Microsoft HoloLens 2 as the only sensor and display device. Experiments show that our method can reconstruct high-precision models and accomplish accurate registration. It is also robust to weak, repeating and inconsistent textures. We also show that our system can be easily integrated into dental diagnostic and therapeutic procedures, such as bracket placement guidance.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9895-9907, 2023 Aug.
Article En | MEDLINE | ID: mdl-37027766

This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. Furthermore, we combine Neural Body with implicit surface models to improve the learned geometry. To evaluate our approach, we perform experiments on both synthetic and real-world data, which show that our approach outperforms prior works by a large margin on novel view synthesis and 3D reconstruction. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset.


Algorithms , Human Body , Humans , Learning
10.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7726-7738, 2023 Jun.
Article En | MEDLINE | ID: mdl-36409815

We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. We further adapt LoFTR to modern SfM systems and illustrate its application in multiple-view geometry. The proposed method demonstrates superior performance in Image Matching Challenge 2021 and ranks first on two public benchmarks of visual localization among the published methods. The code is available at https://zju3dv.github.io/loftr.

11.
IEEE Trans Vis Comput Graph ; 29(10): 4284-4295, 2023 Oct.
Article En | MEDLINE | ID: mdl-35793302

The level of detail (LOD) technique has been widely exploited as a key rendering optimization in many graphics applications. Numerous approaches have been proposed to automatically generate different kinds of LODs, such as geometric LOD or shader LOD. However, none of them have considered simplifying the geometry and shader at the same time. In this paper, we explore the observation that simplifications of geometric and shading details can be combined to provide a greater variety of tradeoffs between performance and quality. We present a new discrete multiresolution representation of objects, which consists of mesh and shader LODs. Each level of the representation could contain both simplified representations of shader and mesh. To create such LODs, we propose two automatic algorithms that pursue the best simplifications of meshes and shaders at adaptively selected distances. The results show that our mesh and shader LOD achieves better performance-quality tradeoffs than prior LOD representations, such as those that only consider simplified meshes or shaders.

12.
IEEE Trans Vis Comput Graph ; 29(4): 1964-1976, 2023 Apr.
Article En | MEDLINE | ID: mdl-34919519

Controlling the size and shear of elements is crucial in pure hex or hex-dominant meshing. To this end, non-orthonormal frame fields that are almost everywhere integrable (except for the singularities) can play a key role. However, it is often challenging or impossible to generate such a frame field under the tight control of a general Riemannian metric field. Therefore, we propose to solve a relatively weaker problem, i.e., generating such a frame field for a Riemannian metric field that is flat away from singularities. Such a metric field admits a local isometry to 3D Euclidean space. Applying Cartans first structural equation to the associated rotation field, i.e., the rotation part of the frame field, we show that the rotation field must have zero covariant derivatives under the 3D connection induced by the metric field. This observation leads to a metric-aware smoothness measure, equivalent to local integrability. The use of such a measure can be justified on meshes associated with locally flat metric fields. We also propose a method to generate smooth metric fields under a few intuitive constraints. On cuboid shapes, our method generates singularities aware of the metric fields, which makes the parameterization match the input metric fields better than the conventional methods. For generic shapes, while our method generates visually similar results to those using boundary frame fields to guide the metric field generation, the integrability and consistency of the metric fields are still improved, as reflected by the statistics.

13.
IEEE Trans Vis Comput Graph ; 29(6): 2950-2964, 2023 Jun.
Article En | MEDLINE | ID: mdl-35077364

Data workers use various scripting languages for data transformation, such as SAS, R, and Python. However, understanding intricate code pieces requires advanced programming skills, which hinders data workers from grasping the idea of data transformation at ease. Program visualization is beneficial for debugging and education and has the potential to illustrate transformations intuitively and interactively. In this article, we explore visualization design for demonstrating the semantics of code pieces in the context of data transformation. First, to depict individual data transformations, we structure a design space by two primary dimensions, i.e., key parameters to encode and possible visual channels to be mapped. Then, we derive a collection of 23 glyphs that visualize the semantics of transformations. Next, we design a pipeline, named Somnus, that provides an overview of the creation and evolution of data tables using a provenance graph. At the same time, it allows detailed investigation of individual transformations. User feedback on Somnus is positive. Our study participants achieved better accuracy with less time using Somnus, and preferred it over carefully-crafted textual description. Further, we provide two example applications to demonstrate the utility and versatility of Somnus.

14.
Article En | MEDLINE | ID: mdl-36459607

Virtual content creation and interaction play an important role in modern 3D applications. Recovering detailed 3D models from real scenes can significantly expand the scope of its applications and has been studied for decades in the computer vision and computer graphics community. In this work, we propose Vox-Surf, a voxel-based implicit surface representation. Our Vox-Surf divides the space into finite sparse voxels, where each voxel is a basic geometry unit that stores geometry and appearance information on its corner vertices. Due to the sparsity inherited from the voxel representation, Vox-Surf is suitable for almost any scene and can be easily trained end-to-end from multiple view images. We utilize a progressive training process to gradually cull out empty voxels and keep only valid voxels for further optimization, which greatly reduces the number of sample points and improves inference speed. Experiments show that our Vox-Surf representation can learn fine surface details and accurate colors with less memory and faster rendering than previous methods. The resulting fine voxels can also be considered as the bounding volumes for collision detection, which is useful in 3D interactions. We also show the potential application of Vox-Surf in scene editing and augmented reality. The source code is publicly available at https://github.com/zju3dv/Vox-Surf.

15.
IEEE Trans Vis Comput Graph ; 28(11): 3727-3736, 2022 Nov.
Article En | MEDLINE | ID: mdl-36048987

Bundle adjustment (BA) is widely used in SLAM and SfM, which are key technologies in Augmented Reality. For real-time SLAM and large-scale SfM, the efficiency of BA is of great importance. This paper proposes CoLi-BA, a novel and efficient BA solver that significantly improves the optimization speed by compact linearization and reordering. Specifically, for each reprojection function, the redundant matrix representation of Jacobian is replaced with a tiny 3D vector, by which the computational complexity, memory storage, and cache missing for Hessian matrix construction and Schur complement are significantly reduced. Besides, we also propose a novel reordering strategy to improve the cache efficiency for Schur complement. Experiments on diverse datasets show that the speed of the proposed CoLi-BA is five times that of Ceres and two times that of g2o without sacrificing accuracy. We further verify the effectiveness by porting CoLi-BA to the open-source SLAM and SfM systems. Even when running the proposed solver in a single thread, the local BA of SLAM only takes about 20ms on a desktop PC, and the reconstruction of SfM with seven thousand photos only takes half an hour. The source code is available on the webpage: https://github.com/zju3dv/CoLi-BA.

16.
IEEE Trans Vis Comput Graph ; 28(5): 2212-2222, 2022 May.
Article En | MEDLINE | ID: mdl-35167466

In this paper, we present a novel monocular visual-inertial odometry system with pre-built maps deployed on the remote server, which can robustly run in real-time on a mobile device even in high latency situations. By tightly coupling VIO with geometric priors from pre-built maps, our system can tolerate the high latency and low frequency of global localization service, which is especially suitable for practical applications when the localization service is deployed on the remote server. Firstly, sparse point clouds are obtained from the dense mesh by the ray casting method according to the localization results. The dense mesh can be reconstructed from the point clouds generated by Structure-from-Motion. We directly use the sparse point clouds in feature tracking and state update to suppress drift. In the process of feature tracking, the high local accuracy of VIO is fully utilized to effectively remove outliers and make our system robust. The experiments on EurocMav datasets and simulation datasets show that compared with state-of-the-art methods, our method can achieve better results in terms of both precision and robustness. The effectiveness of the proposed method is further demonstrated through a real-time AR demo on a mobile phone with the aid of visual localization on the remote server.

17.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6981-6992, 2022 10.
Article En | MEDLINE | ID: mdl-34283712

This paper addresses the problem of reconstructing 3D poses of multiple people from a few calibrated camera views. The main challenge of this problem is to find the cross-view correspondences among noisy and incomplete 2D pose predictions. Most previous methods address this challenge by directly reasoning in 3D using a pictorial structure model, which is inefficient due to the huge state space. We propose a fast and robust approach to solve this problem. Our key idea is to use a multi-way matching algorithm to cluster the detected 2D poses in all views. Each resulting cluster encodes 2D poses of the same person across different views and consistent correspondences across the keypoints, from which the 3D pose of each person can be effectively inferred. The proposed convex optimization based multi-way matching algorithm is efficient and robust against missing and false detections, without knowing the number of people in the scene. Moreover, we propose to combine geometric and appearance cues for cross-view matching. Finally, an efficient tracking method is proposed to track the detected 3D poses across the multi-view video. The proposed approach achieves the state-of-the-art performance on the Campus and Shelf datasets, while being efficient for real-time applications.


Algorithms , Imaging, Three-Dimensional , Humans , Imaging, Three-Dimensional/methods
18.
IEEE Trans Vis Comput Graph ; 28(12): 4490-4502, 2022 12.
Article En | MEDLINE | ID: mdl-34161241

Most commercially available optical see-through head-mounted displays (OST-HMDs) utilize optical combiners to simultaneously visualize the physical background and virtual objects. The displayed images perceived by users are a blend of rendered pixels and background colors. Enabling high fidelity color perception in mixed reality (MR) scenarios using OST-HMDs is an important but challenging task. We propose a real-time rendering scheme to enhance the color contrast between virtual objects and the surrounding background for OST-HMDs. Inspired by the discovery of color perception in psychophysics, we first formulate the color contrast enhancement as a constrained optimization problem. We then design an end-to-end algorithm to search the optimal complementary shift in both chromaticity and luminance of the displayed color. This aims at enhancing the contrast between virtual objects and the real background as well as keeping the consistency with the original displayed color. We assess the performance of our approach using a simulated OST-HMD environment and an off-the-shelf OST-HMD. Experimental results from objective evaluations and subjective user studies demonstrate that the proposed approach makes rendered virtual objects more distinguishable from the surrounding background, thereby bringing a better visual experience.


Computer Graphics , Smart Glasses , User-Computer Interface , Head/diagnostic imaging , Equipment Design
19.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 5529-5540, 2022 Sep.
Article En | MEDLINE | ID: mdl-33914683

In this paper, we propose a novel system named Disp R-CNN for 3D object detection from stereo images. Many recent works solve this problem by first recovering point clouds with disparity estimation and then apply a 3D detector. The disparity map is computed for the entire image, which is costly and fails to leverage category-specific prior. In contrast, we design an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. To address the challenge from scarcity of disparity annotation in training, we propose to use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds, which makes our system more widely applicable. Experiments on the KITTI dataset show that, when LiDAR ground-truth is not used at training time, Disp R-CNN outperforms previous state-of-the-art methods based on stereo input by 20 percent in terms of average precision for all categories. The code and pseudo-ground-truth data are available at the project page: https://github.com/zju3dv/disprcnn.

20.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 3212-3223, 2022 06.
Article En | MEDLINE | ID: mdl-33360984

This paper addresses the problem of instance-level 6DoF object pose estimation from a single RGB image. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise vectors pointing to the keypoints and use these vectors to vote for keypoint locations. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occluded LINEMOD, YCB-Video, and Tless datasets, while being efficient for real-time pose estimation. We further create a Truncated LINEMOD dataset to validate the robustness of our approach against truncation. The code is available at https://github.com/zju3dv/pvnet.


Algorithms , Politics
...