Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer.

Liu, Tengfei; Hu, Yongli; Gao, Junbin; Wang, Jiapu; Sun, Yanfeng; Yin, Baocai

Liu, Tengfei; Hu, Yongli; Gao, Junbin; Wang, Jiapu; Sun, Yanfeng; Yin, Baocai.

Afiliação

Liu T; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Hu Y; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: huyongli@bjut.edu.cn.
Gao J; Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW 2006, Australia.
Wang J; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Sun Y; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Yin B; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.

Neural Netw ; 176: 106322, 2024 Aug.

Article em En | MEDLINE | ID: mdl-38653128

ABSTRACT

ABSTRACT

In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi-modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.

Assuntos

Redes Neurais de Computação; Humanos; Processamento de Linguagem Natural; Algoritmos

Palavras-chave

Multi-modal long document classification; Multi-modal transformer; Multi-scale multi-modal transformer; Prompt learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Redes Neurais de Computação Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google