Model performance and interpretability of semi-supervised generative adversarial networks to predict oncogenic variants with unlabeled data.

Ren, Zilin; Li, Quan; Cao, Kajia; Li, Marilyn M; Zhou, Yunyun; Wang, Kai

Ren, Zilin; Li, Quan; Cao, Kajia; Li, Marilyn M; Zhou, Yunyun; Wang, Kai.

Afiliação

Ren Z; Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
Li Q; Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
Cao K; Princess Margaret Cancer Centre, University Health Network, University of Toronto, Toronto, ON, M5G2C1, Canada.
Li MM; Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
Zhou Y; Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
Wang K; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.

BMC Bioinformatics ; 24(1): 43, 2023 Feb 09.

Article em En | MEDLINE | ID: mdl-36759776

ABSTRACT

ABSTRACT

BACKGROUND:

It remains an important challenge to predict the functional consequences or clinical impacts of genetic variants in human diseases, such as cancer. An increasing number of genetic variants in cancer have been discovered and documented in public databases such as COSMIC, but the vast majority of them have no functional or clinical annotations. Some databases, such as CiVIC are available with manual annotation of functional mutations, but the size of the database is small due to the use of human annotation. Since the unlabeled data (millions of variants) typically outnumber labeled data (thousands of variants), computational tools that take advantage of unlabeled data may improve prediction accuracy.

RESULT:

To leverage unlabeled data to predict functional importance of genetic variants, we introduced a method using semi-supervised generative adversarial networks (SGAN), incorporating features from both labeled and unlabeled data. Our SGAN model incorporated features from clinical guidelines and predictive scores from other computational tools. We also performed comparative analysis to study factors that influence prediction accuracy, such as using different algorithms, types of features, and training sample size, to provide more insights into variant prioritization. We found that SGAN can achieve competitive performances with small labeled training samples by incorporating unlabeled samples, which is a unique advantage compared to traditional machine learning methods. We also found that manually curated samples can achieve a more stable predictive performance than publicly available datasets.

CONCLUSIONS:

By incorporating much larger samples of unlabeled data, the SGAN method can improve the ability to detect novel oncogenic variants, compared to other machine-learning algorithms that use only labeled datasets. SGAN can be potentially used to predict the pathogenicity of more complex variants such as structural variants or non-coding variants, with the availability of more training samples and informative features.

Assuntos

Algoritmos; Neoplasias; Humanos; Aprendizado de Máquina; Neoplasias/genética; Bases de Dados Factuais; Aprendizado de Máquina Supervisionado

Palavras-chave

Deep learning; Generative adversarial networks; Machine learning; Somatic variants; Variants annotation; Variants interpretation

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Neoplasias Tipo de estudo: Guideline / Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google