Your browser doesn't support javascript.
loading
High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data.
Phan, Nam Nhut; Chattopadhyay, Amrita; Lee, Tsui-Ting; Yin, Hsiang-I; Lu, Tzu-Pin; Lai, Liang-Chuan; Hwa, Hsiao-Lin; Tsai, Mong-Hsun; Chuang, Eric Y.
Afiliação
  • Phan NN; Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan.
  • Chattopadhyay A; Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan.
  • Lee TT; Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei 10055, Taiwan.
  • Yin HI; Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei 10055, Taiwan.
  • Lu TP; Department and Graduate Institute of Forensic Medicine, College of Medicine, National Taiwan University, Taipei, Taiwan.
  • Lai LC; Department and Graduate Institute of Forensic Medicine, College of Medicine, National Taiwan University, Taipei, Taiwan.
  • Hwa HL; Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei 10055, Taiwan.
  • Tsai MH; Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10055, Taiwan.
  • Chuang EY; Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei 10055, Taiwan.
Brief Bioinform ; 22(6)2021 11 05.
Article em En | MEDLINE | ID: mdl-34368845
ABSTRACT
In this study, we proposed a deep learning (DL) model for classifying individuals from mixtures of DNA samples using 27 short tandem repeats and 94 single nucleotide polymorphisms obtained through massively parallel sequencing protocol. The model was trained/tested/validated with sequenced data from 6 individuals and then evaluated using mixtures from forensic DNA samples. The model successfully identified both the major and the minor contributors with 100% accuracy for 90 DNA mixtures, that were manually prepared by mixing sequence reads of 3 individuals at different ratios. Furthermore, the model identified 100% of the major contributors and 50-80% of the minor contributors in 20 two-sample external-mixed-samples at ratios of 139 and 19, respectively. To further demonstrate the versatility and applicability of the pipeline, we tested it on whole exome sequence data to classify subtypes of 20 breast cancer patients and achieved an area under curve of 0.85. Overall, we present, for the first time, a complete pipeline, including sequencing data processing steps and DL steps, that is applicable across different NGS platforms. We also introduced a sliding window approach, to overcome the sequence length variation problem of sequencing data, and demonstrate that it improves the model performance dramatically.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: DNA / Análise de Sequência de DNA / Aprendizado Profundo Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: Brief Bioinform Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Taiwan

Texto completo: 1 Base de dados: MEDLINE Assunto principal: DNA / Análise de Sequência de DNA / Aprendizado Profundo Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Revista: Brief Bioinform Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Taiwan