Pesquisa | BVS Violência e Saúde

Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations.

Zelli, Veronica; Manno, Andrea; Compagnoni, Chiara; Ibraheem, Rasheed Oyewole; Zazzeroni, Francesca; Alesse, Edoardo; Rossi, Fabrizio; Arbib, Claudio; Tessitore, Alessandra.

J Transl Med ; 21(1): 836, 2023 11 21.

Artigo em Inglês | MEDLINE | ID: mdl-37990214

RESUMO

BACKGROUND: Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin. METHODS: TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores. RESULTS: The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches. CONCLUSIONS: A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier's performance, for example by considering more features and dividing tumors into their main molecular subtypes.

Assuntos

Carcinoma , Variações do Número de Cópias de DNA , Humanos , Variações do Número de Cópias de DNA/genética , Aprendizado de Máquina , Genômica

An Integer Linear Programming Model to Optimize Coding DNA Sequences By Joint Control of Transcript Indicators.

Arbib, Claudio; D'ascenzo, Andrea; Rossi, Fabrizio; Santoni, Daniele.

J Comput Biol ; 31(5): 416-428, 2024 05.

Artigo em Inglês | MEDLINE | ID: mdl-38687334

RESUMO

A Coding DNA Sequence (CDS) is a fraction of DNA whose nucleotides are grouped into consecutive triplets called codons, each one encoding an amino acid. Because most amino acids can be encoded by more than one codon, the same amino acid chain can be obtained by a very large number of different CDSs. These synonymous CDSs show different features that, also depending on the organism the transcript is expressed in, could affect translational efficiency and yield. The identification of optimal CDSs with respect to given transcript indicators is in general a challenging task, but it has been observed in recent literature that integer linear programming (ILP) can be a very flexible and efficient way to achieve it. In this article, we add evidence to this observation by proposing a new ILP model that simultaneously optimizes different well-grounded indicators. With this model, we efficiently find solutions that dominate those returned by six existing codon optimization heuristics.

Assuntos

Algoritmos , Códon , Modelos Genéticos , Programação Linear , Códon/genética , Sequência de Bases/genética , DNA/genética , Biologia Computacional/métodos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA