C<sup>2</sup>BG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection.

Ding, Bonan; Xie, Jin; Nie, Jing; Wu, Yulong; Cao, Jiale

C²BG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection.

Ding, Bonan; Xie, Jin; Nie, Jing; Wu, Yulong; Cao, Jiale.

Afiliação

Ding B; School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China.
Xie J; School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China. Electronic address: xiejin@cqu.edu.cn.
Nie J; School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, 400044, China.
Wu Y; School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China.
Cao J; School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.

Neural Netw ; 179: 106535, 2024 Nov.

Article em En | MEDLINE | ID: mdl-39047336

ABSTRACT

ABSTRACT

Multi-modal 3D object detection is instrumental in identifying and localizing objects within 3D space. It combines RGB images from cameras and point-clouds data from lidar sensors, serving as a fundamental technology for autonomous driving applications. Current methods commonly employ simplistic element-wise additions or multiplications to aggregate multi-modal features extracted from point-clouds and images. While these methods enhance detection accuracy, the utilization of basic operations presents challenges in effectively balancing the significance between modalities. This can potentially introduce noise and irrelevant information during the feature aggregation process. Additionally, the multi-level features extracted from images display imbalances in receptive fields. To tackle the aforementioned challenges, we propose two innovative networks a cross-modality balance network (CMN) and a cross-scale balance network (CSN). CMN incorporates cross-modality attention mechanisms and introduces an auxiliary 2D detection head to balance the significance of both modalities. Meanwhile, CSN leverages cross-scale attention mechanisms to mitigate the gap in receptive fields between different image levels. Additionally, we introduce a novel Local with Global Voxel Attention Encoder (LGVAE) designed to capture global semantics by extracting more comprehensive point-level information into voxel-level features. We perform comprehensive experiments on three challenging public benchmarks KITTI, Dense and nuScenes. The results consistently demonstrate improvements across multiple 3D object detection frameworks, affirming the effectiveness and versatility of our proposed method. Remarkably, our approach achieves a substantial absolute gain of 3.1% over the baseline MVXNet on the challenging Hard set of the Dense test set.

Assuntos

Redes Neurais de Computação; Semântica; Imageamento Tridimensional/métodos; Humanos; Algoritmos; Processamento de Imagem Assistida por Computador/métodos

Palavras-chave

3D object detection; Multi-modal learning; Neural networks

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Semântica / Redes Neurais de Computação Limite: Humans Idioma: En Revista: Neural Netw Assunto da revista: NEUROLOGIA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google