Búsqueda | Portal de Búsqueda de la BVS España

A survey of algorithms for the detection of genomic structural variants from long-read sequencing data.

Ahsan, Mian Umair; Liu, Qian; Perdomo, Jonathan Elliot; Fang, Li; Wang, Kai.

Nat Methods ; 20(8): 1143-1158, 2023 08.

Artículo en Inglés | MEDLINE | ID: mdl-37386186

RESUMEN

As long-read sequencing technologies are becoming increasingly popular, a number of methods have been developed for the discovery and analysis of structural variants (SVs) from long reads. Long reads enable detection of SVs that could not be previously detected from short-read sequencing, but computational methods must adapt to the unique challenges and opportunities presented by long-read sequencing. Here, we summarize over 50 long-read-based methods for SV detection, genotyping and visualization, and discuss how new telomere-to-telomere genome assemblies and pangenome efforts can improve the accuracy and drive the development of SV callers in the future.

Asunto(s)

Algoritmos , Genoma , Humanos , Análisis de Secuencia de ADN/métodos , Variación Estructural del Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genoma Humano

Noncoding de novo mutations in SCN2A are associated with autism spectrum disorders.

Zhang, Yuan; Ahsan, Mian Umair; Wang, Kai.

medRxiv ; 2024 May 06.

Artículo en Inglés | MEDLINE | ID: mdl-38766206

RESUMEN

Coding de novo mutations (DNMs) contribute to the risk for autism spectrum disorders (ASD), but the contribution of noncoding DNMs remains relatively unexplored. Here we use whole genome sequencing (WGS) data of 12,411 individuals (including 3,508 probands and 2,218 unaffected siblings) from 3,357 families collected in Simons Foundation Powering Autism Research for Knowledge (SPARK) to detect DNMs associated with ASD, while examining Simons Simplex Collection (SSC) with 6383 individuals from 2274 families to replicate the results. For coding DNMs, SCN2A reached exome-wide significance (p=2.06×10-11) in SPARK. The 618 known dominant ASD genes as a group are strongly enriched for coding DNMs in cases than sibling controls (fold change=1.51, p =1.13×10-5 for SPARK; fold change=1.86, p =2.06×10-9 for SSC). For noncoding DNMs, we used two methods to assess statistical significance: a point-based test that analyzes sites with a Combined Annotation Dependent Depletion (CADD) score ≥15, and a segment-based test that analyzes 1kb genomic segments with segment-specific background mutation rates (inferred from expected rare mutations in Gnocchi genome constraint scores). The point-based test identified SCN2A as marginally significant (p=6.12×10-4) in SPARK, yet segment-based test identified CSMD1, RBFOX1 and CHD13 as exome-wide significant. We did not identify significant enrichment of noncoding DNMs (in all 1kb segments or those with Gnocchi>4) in the 618 known ASD genes as a group in cases than sibling controls. When combining evidence from both coding and noncoding DNMs, we found that SCN2A with 11 coding and 5 noncoding DNMs exhibited the strongest significance (p=4.15×10-13). In summary, we identified both coding and noncoding DNMs in SCN2A associated with ASD, while nominating additional candidates for further examination in future studies.

LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data.

Perdomo, Jonathan Elliot; Ahsan, Mian Umair; Liu, Qian; Fang, Li; Wang, Kai.

bioRxiv ; 2024 Aug 07.

Artículo en Inglés | MEDLINE | ID: mdl-39211184

RESUMEN

While several well-established quality control (QC) tools are available for short reads sequencing data, there is a general paucity of computational tools that provide long read metrics in a fast and comprehensive manner across all major sequencing platforms (such as PacBio, Oxford Nanopore, Illumina Complete Long Read) and data formats (such as ONT POD5, FAST5, basecall summary files and PacBio unaligned BAM). Additionally, none of the current tools provide support for summarizing Oxford Nanopore basecall signal or comprehensive base modification (methylation) information from genomic data. Furthermore, nowadays a single PromethION flowcell on the Oxford Nanopore platform can generate terabytes of signal data, which cannot be handled by existing tools designed for small-scale flowcells. To address these challenges, here we present LongReadSum, a multi-threaded C++ tool which provides fast and comprehensive QC reports on all major aspects of sequencing data (such as read, base, base quality, alignment, and base modification metrics) and produce basecalling signal intensity information from the Oxford Nanopore platform. We demonstrate use cases to analyze cDNA sequencing, direct mRNA sequencing, reduced representation methylation sequencing (RRMS) through adaptive sequencing, as well as whole genome sequencing (WGS) data using diverse long-read platforms.

A signal processing and deep learning framework for methylation detection using Oxford Nanopore sequencing.

Ahsan, Mian Umair; Gouru, Anagha; Chan, Joe; Zhou, Wanding; Wang, Kai.

Nat Commun ; 15(1): 1448, 2024 Feb 16.

Artículo en Inglés | MEDLINE | ID: mdl-38365920

RESUMEN

Oxford Nanopore sequencing can detect DNA methylations from ionic current signal of single molecules, offering a unique advantage over conventional methods. Additionally, adaptive sampling, a software-controlled enrichment method for targeted sequencing, allows reduced representation methylation sequencing that can be applied to CpG islands or imprinted regions. Here we present DeepMod2, a comprehensive deep-learning framework for methylation detection using ionic current signal from Nanopore sequencing. DeepMod2 implements both a bidirectional long short-term memory (BiLSTM) model and a Transformer model and can analyze POD5 and FAST5 signal files generated on R9 and R10 flowcells. Additionally, DeepMod2 can run efficiently on central processing unit (CPU) through model pruning and can infer epihaplotypes or haplotype-specific methylation calls from phased reads. We use multiple publicly available and newly generated datasets to evaluate the performance of DeepMod2 under varying scenarios. DeepMod2 has comparable performance to Guppy and Dorado, which are the current state-of-the-art methods from Oxford Nanopore Technologies that remain closed-source. Moreover, we show a high correlation (r = 0.96) between reduced representation and whole-genome Nanopore sequencing. In summary, DeepMod2 is an open-source tool that enables fast and accurate DNA methylation detection from whole-genome or adaptive sequencing data on a diverse range of flowcell types.

Asunto(s)

Aprendizaje Profundo , Secuenciación de Nanoporos , Nanoporos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metilación de ADN

Classification of integers based on residue classes via modern deep learning algorithms.

Wu, Da; Yang, Jingye; Ahsan, Mian Umair; Wang, Kai.

Patterns (N Y) ; 4(12): 100860, 2023 Dec 08.

Artículo en Inglés | MEDLINE | ID: mdl-38106613

RESUMEN

Judging whether an integer can be divided by prime numbers such as 2 or 3 may appear trivial to human beings, but it can be less straightforward for computers. Here, we tested multiple deep learning architectures and feature engineering approaches to classifying integers based on their residues when divided by small prime numbers. We found that the ability of classification critically depends on the feature space. We also evaluated automated machine learning (AutoML) platforms from Amazon, Google, and Microsoft and found that, without appropriately engineered features, they failed on this task. Furthermore, we introduced a method that utilizes linear regression on Fourier series basis vectors and demonstrated its effectiveness. Finally, we evaluated large language models (LLMs) such as GPT-4, GPT-J, LLaMA, and Falcon, and we demonstrated their failures. In conclusion, feature engineering remains an important task to improve performance and increase interpretability of machine learning models, even in the era of AutoML and LLMs.

Assessing the Expression of Long INterspersed Elements (LINEs) via Long-Read Sequencing in Diverse Human Tissues and Cell Lines.

Rybacki, Karleena; Xia, Mingyi; Ahsan, Mian Umair; Xing, Jinchuan; Wang, Kai.

Genes (Basel) ; 14(10)2023 09 29.

Artículo en Inglés | MEDLINE | ID: mdl-37895242

RESUMEN

Transposable elements, such as Long INterspersed Elements (LINEs), are DNA sequences that can replicate within genomes. LINEs replicate using an RNA intermediate followed by reverse transcription and are typically a few kilobases in length. LINE activity creates genomic structural variants in human populations and leads to somatic alterations in cancer genomes. Long-read RNA sequencing technologies, including Oxford Nanopore and PacBio, can directly sequence relatively long transcripts, thus providing the opportunity to examine full-length LINE transcripts. This study focuses on the development of a new bioinformatics pipeline for the identification and quantification of active, full-length LINE transcripts in diverse human tissues and cell lines. In our pipeline, we utilized RepeatMasker to identify LINE-1 (L1) transcripts from long-read transcriptome data and incorporated several criteria, such as transcript start position, divergence, and length, to remove likely false positives. Comparisons between cancerous and normal cell lines, as well as human tissue samples, revealed elevated expression levels of young LINEs in cancer, particularly at intact L1 loci. By employing bioinformatics methodologies on long-read transcriptome data, this study demonstrates the landscape of L1 expression in tissues and cell lines.

Asunto(s)

Elementos de Nucleótido Esparcido Largo , Neoplasias , Humanos , Elementos de Nucleótido Esparcido Largo/genética , Línea Celular , Transcriptoma/genética , ARN , Neoplasias/genética

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks.

Ahsan, Mian Umair; Liu, Qian; Fang, Li; Wang, Kai.

Genome Biol ; 22(1): 261, 2021 09 06.

Artículo en Inglés | MEDLINE | ID: mdl-34488830

RESUMEN

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

Asunto(s)

Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Mutación INDEL/genética , Nanopartículas/química , Redes Neurales de la Computación , Polimorfismo de Nucleótido Simple/genética , Alelos , Secuencia de Bases , Benchmarking , Mapeo Cromosómico , Genoma Humano , Humanos , Complejo Mayor de Histocompatibilidad/genética , Secuenciación de Nanoporos

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA