Automated classification of clinical trial eligibility criteria text based on ensemble learning and metric learning.

Zeng, Kun; Xu, Yibin; Lin, Ge; Liang, Likeng; Hao, Tianyong

Zeng, Kun; Xu, Yibin; Lin, Ge; Liang, Likeng; Hao, Tianyong.

Afiliação

Zeng K; School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.
Xu Y; School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.
Lin G; National Engineering Research Center of Digital Life, Sun Yat-Sen University, Guangzhou, China.
Liang L; School of Computer Science, South China Normal University, Guangzhou, China.
Hao T; School of Computer Science, South China Normal University, Guangzhou, China. haoty@m.scnu.edu.cn.

BMC Med Inform Decis Mak ; 21(Suppl 2): 129, 2021 07 30.

Article em En | MEDLINE | ID: mdl-34330259

RESUMO

BACKGROUND: Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. METHODS: An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. RESULTS: Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. CONCLUSIONS: A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.

Assuntos

Aprendizado de Máquina; China; Humanos

Palavras-chave

Clinical trial; Eligibility criteria classification; Ensemble learning; Focal loss; Metric learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Aprendizado de Máquina Tipo de estudo: Prognostic_studies Limite: Humans País/Região como assunto: Asia Idioma: En Revista: BMC Med Inform Decis Mak Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: China

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google