Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 91
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38426321

RESUMEN

The common loci represent a distinct set of the human genome sites that harbor genetic variants found in at least 1% of the population. Small somatic mutations occur at the common loci and non-common loci, i.e. csmVariants and ncsmVariants, are presumed with similar probabilities. However, our work revealed that within the coding region, common loci constituted only 1.03% of all loci, yet they accounted for 5.14% of TCGA somatic mutations. Furthermore, the small somatic mutation incidence rate at these common loci was 2.7 times that observed in the non-common. Notably, the csmVariants exhibited an impressive recurrent rate of 36.14%, which was 2.59 times of the ncsmVariants. The C-to-T transition at the CpG sites accounted for 32.41% of the csmVariants, which was 2.93 times for the ncsmVariants. Interestingly, the aging-related mutational signature contributed to 13.87% of the csmVariants, 5.5 times that of ncsmVariants. Moreover, 35.93% of the csmVariants contexts exhibited palindromic features, outperforming ncsmVariant contexts by 1.84 times. Notably, cancer patients with higher csmVariants rates had better progression-free survival. Furthermore, cancer patients with high-frequency csmVariants enriched with mismatch repair deficiency were also associated with better progression-free survival. The accumulation of csmVariants during cancerogenesis is a complex process influenced by various factors. These include the presence of a substantial percentage of palindromic sequences at csmVariants sites, the impact of aging and DNA mismatch repair deficiency. Together, these factors contribute to the higher somatic mutation incidence rates of common loci and the overall accumulation of csmVariants in cancer development.


Asunto(s)
Neoplasias Encefálicas , Neoplasias Colorrectales , Síndromes Neoplásicos Hereditarios , Humanos , Incidencia , Neoplasias Encefálicas/genética , Mutación
2.
Nucleic Acids Res ; 2024 Jun 17.
Artículo en Inglés | MEDLINE | ID: mdl-38884260

RESUMEN

Horizontal gene transfer (HGT) phenomena pervade the gut microbiome and significantly impact human health. Yet, no current method can accurately identify complete HGT events, including the transferred sequence and the associated deletion and insertion breakpoints from shotgun metagenomic data. Here, we develop LocalHGT, which facilitates the reliable and swift detection of complete HGT events from shotgun metagenomic data, delivering an accuracy of 99.4%-verified by Nanopore data-across 200 gut microbiome samples, and achieving an average F1 score of 0.99 on 100 simulated data. LocalHGT enables a systematic characterization of HGT events within the human gut microbiome across 2098 samples, revealing that multiple recipient genome sites can become targets of a transferred sequence, microhomology is enriched in HGT breakpoint junctions (P-value = 3.3e-58), and HGTs can function as host-specific fingerprints indicated by the significantly higher HGT similarity of intra-personal temporal samples than inter-personal samples (P-value = 4.3e-303). Crucially, HGTs showed potential contributions to colorectal cancer (CRC) and acute diarrhoea, as evidenced by the enrichment of the butyrate metabolism pathway (P-value = 3.8e-17) and the shigellosis pathway (P-value = 5.9e-13) in the respective associated HGTs. Furthermore, differential HGTs demonstrated promise as biomarkers for predicting various diseases. Integrating HGTs into a CRC prediction model achieved an AUC of 0.87.

3.
Nucleic Acids Res ; 52(D1): D756-D761, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37904614

RESUMEN

Bacteriophages are viruses that infect bacteria or archaea. Understanding the diverse and intricate genomic architectures of phages is essential to study microbial ecosystems and develop phage therapy strategies. However, the existing phage databases are short of meticulous annotations. To this end, we propose PhageScope (https://phagescope.deepomics.org), an online phage database with comprehensive annotations. PhageScope harbors a collection of 873 718 phage sequences from various sources. Applying fifteen state-of-the-art tools to perform systematic annotations and analyses, PhageScope provides annotations on genome completeness, host range, lifestyle information, taxonomy classification, nine types of structural and functional genetic elements, and three types of comparative genomic studies for curated phages. Additionally, PhageScope incorporates automatic analyses and visualizations for curated and customized phages, serving as an efficient platform for phage study.


Asunto(s)
Bacteriófagos , Bases de Datos Genéticas , Bacterias/virología , Bacteriófagos/genética , Genoma Viral/genética , Genómica , Terapia de Fagos
4.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36752378

RESUMEN

T-cell receptors (TCRs) play an essential role in the adaptive immune system. Probabilistic models for TCR repertoires can help decipher the underlying complex sequence patterns and provide novel insights into understanding the adaptive immune system. In this work, we develop TCRpeg, a deep autoregressive generative model to unravel the sequence patterns of TCR repertoires. TCRpeg largely outperforms state-of-the-art methods in estimating the probability distribution of a TCR repertoire, boosting the average accuracy from 0.672 to 0.906 measured by the Pearson correlation coefficient. Furthermore, with promising performance in probability inference, TCRpeg improves on a range of TCR-related tasks: profiling TCR repertoire probabilistically, classifying antigen-specific TCRs, validating previously discovered TCR motifs, generating novel TCRs and augmenting TCR data. Our results and analysis highlight the flexibility and capacity of TCRpeg to extract TCR sequence information, providing a novel approach for deciphering complex immunogenomic repertoires.


Asunto(s)
Modelos Estadísticos , Receptores de Antígenos de Linfocitos T , Receptores de Antígenos de Linfocitos T/genética
5.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36715274

RESUMEN

The advance in single-cell RNA-sequencing (scRNA-seq) sheds light on cell-specific transcriptomic studies of cell developments, complex diseases and cancers. Nevertheless, scRNA-seq techniques suffer from 'dropout' events, and imputation tools are proposed to address the sparsity. Here, rather than imputation, we propose a tool, SMURF, to extract the low-dimensional embeddings from cells and genes utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibits feasible cell subpopulation discovery efficacy with obtained cell embeddings on replicated in silico and eight web lab scRNA datasets with ground truth cell types. Furthermore, SMURF can reduce the cell embedding to a 1D-oval space to recover the time course of cell cycle. SMURF can also serve as an imputation tool; the in silico data assessment shows that SMURF parades the most robust gene expression recovery power with low root mean square error and high Pearson correlation. Moreover, SMURF recovers the gene distribution for the WM989 Drop-seq data. SMURF is available at https://github.com/deepomicslab/SMURF.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica , Análisis por Conglomerados
6.
Brief Bioinform ; 24(3)2023 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-37150761

RESUMEN

The specificity of a T-cell receptor (TCR) repertoire determines personalized immune capacity. Existing methods have modeled the qualitative aspects of TCR specificity, while the quantitative aspects remained unaddressed. We developed a package, TCRanno, to quantify the specificity of TCR repertoires. We created deep-learning-based, epitope-aware vector embeddings to infer individual TCR specificity. Then we aggregated clonotype frequencies of TCRs to obtain a quantitative profile of repertoire specificity at epitope, antigen and organism levels. Applying TCRanno to 4195 TCR repertoires revealed quantitative changes in repertoire specificity upon infections, autoimmunity and cancers. Specifically, TCRanno found cytomegalovirus-specific TCRs in seronegative healthy individuals, supporting the possibility of abortive infections. TCRanno discovered age-accumulated fraction of severe acute respiratory syndrome coronavirus 2 specific TCRs in pre-pandemic samples, which may explain the aggressive symptoms and age-related severity of coronavirus disease 2019. TCRanno also identified the encounter of Hepatitis B antigens as a potential trigger of systemic lupus erythematosus. TCRanno annotations showed capability in distinguishing TCR repertoires of healthy and cancers including melanoma, lung and breast cancers. TCRanno also demonstrated usefulness to single-cell TCRseq+gene expression data analyses by isolating T-cells with the specificity of interest.


Asunto(s)
Linfocitos T CD8-positivos , COVID-19 , Humanos , Linfocitos T CD8-positivos/metabolismo , COVID-19/genética , Receptores de Antígenos de Linfocitos T/genética , Epítopos , Citomegalovirus
7.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36892171

RESUMEN

The adaptive immune receptor repertoire (AIRR), consisting of T- and B-cell receptors, is the core component of the immune system. The AIRR sequencing is commonly used in cancer immunotherapy and minimal residual disease (MRD) detection of leukemia and lymphoma. The AIRR is captured by primers and sequenced to yield paired-end (PE) reads. The PE reads could be merged into one sequence by the overlapped region between them. However, the wide range of AIRR data raises the difficulty, so a special tool is required. We developed a software package for IMmune PE reads merger of sequencing data, named IMperm. We used the k-mer-and-vote strategy to pin down the overlapped region rapidly. IMperm could handle all types of PE reads, eliminate adapter contamination and successfully merge low-quality and minor/non-overlapping reads. Compared with existing tools, IMperm performed better in both simulated and sequencing data. Notably, IMperm was well suited to processing the data of MRD detection in leukemia and lymphoma and detected 19 novel MRD clones in 14 patients with leukemia from previously published data. Additionally, IMperm can handle PE reads from other sources, and we demonstrated its effectiveness on two genomic and one cell-free deoxyribonucleic acid datasets. IMperm is implemented in the C programming language and consumes little runtime and memory. It is freely available at https://github.com/zhangwei2015/IMperm.


Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Programas Informáticos , Genoma , Algoritmos
8.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38603603

RESUMEN

MOTIVATION: Genome sequencing technologies reveal a huge amount of genomic sequences. Neural network-based methods can be prime candidates for retrieving insights from these sequences because of their applicability to large and diverse datasets. However, the highly variable lengths of genome sequences severely impair the presentation of sequences as input to the neural network. Genetic variations further complicate tasks that involve sequence comparison or alignment. RESULTS: Inspired by the theory and applications of "spaced seeds," we propose a graph representation of genome sequences called "gapped pattern graph." These graphs can be transformed through a Graph Convolutional Network to form lower-dimensional embeddings for downstream tasks. On the basis of the gapped pattern graphs, we implemented a neural network model and demonstrated its performance on diverse tasks involving microbe and mammalian genome data. Our method consistently outperformed all the other state-of-the-art methods across various metrics on all tasks, especially for the sequences with limited homology to the training data. In addition, our model was able to identify distinct gapped pattern signatures from the sequences. AVAILABILITY AND IMPLEMENTATION: The framework is available at https://github.com/deepomicslab/GCNFrame.

9.
Nucleic Acids Res ; 51(2): e9, 2023 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-36373664

RESUMEN

Cells possess functional diversity hierarchically. However, most single-cell analyses neglect the nested structures while detecting and visualizing the functional diversity. Here, we incorporate cell hierarchy to study functional diversity at subpopulation, club (i.e., sub-subpopulation), and cell layers. Accordingly, we implement a package, SEAT, to construct cell hierarchies utilizing structure entropy by minimizing the global uncertainty in cell-cell graphs. With cell hierarchies, SEAT deciphers functional diversity in 36 datasets covering scRNA, scDNA, scATAC, and scRNA-scATAC multiome. First, SEAT finds optimal cell subpopulations with high clustering accuracy. It identifies cell types or fates from omics profiles and boosts accuracy from 0.34 to 1. Second, SEAT detects insightful functional diversity among cell clubs. The hierarchy of breast cancer cells reveals that the specific tumor cell club drives AREG-EGFT signaling. We identify a dense co-accessibility network of cis-regulatory elements specified by one cell club in GM12878. Third, the cell order from the hierarchy infers periodic pseudo-time of cells, improving accuracy from 0.79 to 0.89. Moreover, we incorporate cell hierarchy layers as prior knowledge to refine nonlinear dimension reduction, enabling us to visualize hierarchical cell layouts in low-dimensional space.


Asunto(s)
Análisis por Conglomerados , Análisis de la Célula Individual , ARN Citoplasmático Pequeño , Análisis de la Célula Individual/métodos , Incertidumbre
10.
Nucleic Acids Res ; 51(15): e81, 2023 08 25.
Artículo en Inglés | MEDLINE | ID: mdl-37403780

RESUMEN

Single-cell sequencing technology enables the simultaneous capture of multiomic data from multiple cells. The captured data can be represented by tensors, i.e. the higher-rank matrices. However, the existing analysis tools often take the data as a collection of two-order matrices, renouncing the correspondences among the features. Consequently, we propose a probabilistic tensor decomposition framework, SCOIT, to extract embeddings from single-cell multiomic data. SCOIT incorporates various distributions, including Gaussian, Poisson, and negative binomial distributions, to deal with sparse, noisy, and heterogeneous single-cell data. Our framework can decompose a multiomic tensor into a cell embedding matrix, a gene embedding matrix, and an omic embedding matrix, allowing for various downstream analyses. We applied SCOIT to eight single-cell multiomic datasets from different sequencing protocols. With cell embeddings, SCOIT achieves superior performance for cell clustering compared to nine state-of-the-art tools under various metrics, demonstrating its ability to dissect cellular heterogeneity. With the gene embeddings, SCOIT enables cross-omics gene expression analysis and integrative gene regulatory network study. Furthermore, the embeddings allow cross-omics imputation simultaneously, outperforming current imputation methods with the Pearson correlation coefficient increased by 3.38-39.26%; moreover, SCOIT accommodates the scenario that subsets of the cells are with merely one omic profile available.


Asunto(s)
Benchmarking , Multiómica , Análisis por Conglomerados , Correlación de Datos , Citosol , Análisis de la Célula Individual
11.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34671807

RESUMEN

The recent advance of single-cell copy number variation (CNV) analysis plays an essential role in addressing intratumor heterogeneity, identifying tumor subgroups and restoring tumor-evolving trajectories at single-cell scale. Informative visualization of copy number analysis results boosts productive scientific exploration, validation and sharing. Several single-cell analysis figures have the effectiveness of visualizations for understanding single-cell genomics in published articles and software packages. However, they almost lack real-time interaction, and it is hard to reproduce them. Moreover, existing tools are time-consuming and memory-intensive when they reach large-scale single-cell throughputs. We present an online visualization platform, single-cell Somatic Variant Analysis Suite (scSVAS), for real-time interactive single-cell genomics data visualization. scSVAS is specifically designed for large-scale single-cell genomic analysis that provides an arsenal of unique functionalities. After uploading the specified input files, scSVAS deploys the online interactive visualization automatically. Users may conduct scientific discoveries, share interactive visualizations and download high-quality publication-ready figures. scSVAS provides versatile utilities for managing, investigating, sharing and publishing single-cell CNV profiles. We envision this online platform will expedite the biological understanding of cancer clonal evolution in single-cell resolution. All visualizations are publicly hosted at https://sc.deepomics.org.


Asunto(s)
Variaciones en el Número de Copia de ADN , Programas Informáticos , Visualización de Datos , Genoma , Genómica/métodos
12.
Nucleic Acids Res ; 50(15): e88, 2022 08 26.
Artículo en Inglés | MEDLINE | ID: mdl-35639502

RESUMEN

Topologically associated domains (TADs) are crucial chromatin structural units. Evidence has illustrated that RNA-chromatin and RNA-RNA spatial interactions, so-called RNA-associated interactions (RAIs), may be associated with TAD-like domains (TLDs). To decode hierarchical TLDs from RAIs, we proposed SuperTLD, a domain detection algorithm incorporating imputation. We applied SuperTLD on four RAI data sets and compared TLDs with the TADs identified from the corresponding Hi-C datasets. The TLDs and TADs share a moderate similarity of hierarchies ≥ 0.5312 and the finest structures ≥ 0.8295. Comparison between boundaries and domains further demonstrated the novelty of TLDs. Enrichment analysis of epigenetic characteristics illustrated that the novel TLDs exhibit an enriched CTCF by 0.6245 fold change and H3 histone marks enriched within domains. GO analysis on the TLD novel boundaries exhibited enriched diverse terms, revealing TLDs' formation mechanism related closely to gene regulation.


Asunto(s)
Cromatina , ARN , Algoritmos , Cromatina/genética , Cromosomas , Código de Histonas , ARN/genética
13.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34463709

RESUMEN

Oncovirus integrations cause copy number variations and complex structural variations (SVs) on host genomes. However, the understanding of how inserted viral DNA impacts the local genome remains limited. The linear structure of the oncovirus integrated local genomic map (LGM) will lay the foundations to understand how oncovirus integrations emerge and compromise the host genome's functioning. We propose a conjugate graph model to reconstruct the rearranged LGM at integrated loci. Simulation tests prove the reliability and credibility of the algorithm. Applications of the algorithm to whole-genome sequencing data of human papillomavirus (HPV) and hepatitis B virus (HBV)-infected cancer samples gained biological insights on oncovirus integrations. We observed four affection patterns of oncovirus integrations from the HPV and HBV-integrated cancer samples, including the coding-frame truncation, hyper-amplification of tumor gene, the viral cis-regulation inserted at the single intron and at the intergenic region. We found that the focal duplicates and host SVs are frequent in the HPV-integrated LGMs, while the focal deletions are prevalent in HBV-integrated LGMs. Furthermore, with the results yields from our method, we found the enhanced microhomology-mediated end joining might lead to both HPV and HBV integrations and conjectured that the HPV integrations might mainly occur during the DNA replication process. The conjugate graph algorithm code and LGM construction pipeline, available at https://github.com/deepomicslab/FuseSV.


Asunto(s)
Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Genoma Humano , Retroviridae/fisiología , Interfaz Usuario-Computador , Integración Viral , Algoritmos , Secuencia de Bases , ADN Viral , Bases de Datos Genéticas , Humanos , Neoplasias/etiología
14.
Nucleic Acids Res ; 49(19): e114, 2021 11 08.
Artículo en Inglés | MEDLINE | ID: mdl-34403470

RESUMEN

Haplotype phasing plays an important role in understanding the genetic data of diploid eukaryotic organisms. Different sequencing technologies (such as next-generation sequencing or third-generation sequencing) produce various genetic data that require haplotype assembly. Although multiple diploid haplotype phasing algorithms exist, only a few will work equally well across all sequencing technologies. In this work, we propose SpecHap, a novel haplotype assembly tool that leverages spectral graph theory. On both in silico and whole-genome sequencing datasets, SpecHap consumed less memory and required less CPU time, yet achieved comparable accuracy with state-of-art methods across all the test instances, which comprises sequencing data from next-generation sequencing, linked-reads, high-throughput chromosome conformation capture, PacBio single-molecule real-time, and Oxford Nanopore long-reads. Furthermore, SpecHap successfully phased an individual Ambystoma mexicanum, a species with gigantic diploid genomes, within 6 CPU hours and 945MB peak memory usage, while other tools failed to yield results either due to memory overflow (40GB) or time limit exceeded (5 days). Our results demonstrated that SpecHap is scalable, efficient, and accurate for diploid phasing across many sequencing platforms.


Asunto(s)
Algoritmos , Ambystoma mexicanum/genética , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Secuenciación Completa del Genoma/estadística & datos numéricos , Animales , Benchmarking , Conjuntos de Datos como Asunto , Diploidia , Haplotipos , Humanos , Nanoporos , Factores de Tiempo
15.
Nucleic Acids Res ; 48(W1): W415-W426, 2020 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-32392343

RESUMEN

Genetics data visualization plays an important role in the sharing of knowledge from cancer genome research. Many types of visualization are widely used, most of which are static and require sufficient coding experience to create. Here, we present Oviz-Bio, a web-based platform that provides interactive and real-time visualizations of cancer genomics data. Researchers can interactively explore visual outputs and export high-quality diagrams. Oviz-Bio supports a diverse range of visualizations on common cancer mutation types, including annotation and signatures of small scale mutations, haplotype view and focal clusters of copy number variations, split-reads alignment and heatmap view of structural variations, transcript junction of fusion genes and genomic hotspot of oncovirus integrations. Furthermore, Oviz-Bio allows landscape view to investigate multi-layered data in samples cohort. All Oviz-Bio visual applications are freely available at https://bio.oviz.org/.


Asunto(s)
Genómica/métodos , Neoplasias/genética , Programas Informáticos , Gráficos por Computador , Visualización de Datos , Fusión Génica , Variación Genética , Haplotipos , Humanos , Internet , Mutación , Retroviridae/genética , Integración Viral
16.
BMC Genomics ; 22(Suppl 5): 651, 2021 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-34789142

RESUMEN

BACKGROUND: Copy number variation is crucial in deciphering the mechanism and cure of complex disorders and cancers. The recent advancement of scDNA sequencing technology sheds light upon addressing intratumor heterogeneity, detecting rare subclones, and reconstructing tumor evolution lineages at single-cell resolution. Nevertheless, the current circular binary segmentation based approach proves to fail to efficiently and effectively identify copy number shifts on some exceptional trails. RESULTS: Here, we propose SCYN, a CNV segmentation method powered with dynamic programming. SCYN resolves the precise segmentation on in silico dataset. Then we verified SCYN manifested accurate copy number inferring on triple negative breast cancer scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation. We tested SCYN on two datasets of the newly emerged 10x Genomics CNV solution. SCYN successfully recognizes gastric cancer cells from 1% and 10% spike-ins 10x datasets. Moreover, SCYN is about 150 times faster than state of the art tool when dealing with the datasets of approximately 2000 cells. CONCLUSIONS: SCYN robustly and efficiently detects segmentations and infers copy number profiles on single cell DNA sequencing data. It serves to reveal the tumor intra-heterogeneity. The source code of SCYN can be accessed in https://github.com/xikanfeng2/SCYN .


Asunto(s)
Variaciones en el Número de Copia de ADN , Programas Informáticos , Algoritmos , Hibridación Genómica Comparativa , Genómica , Análisis de Secuencia de ADN
17.
Proc Natl Acad Sci U S A ; 115(45): 11567-11572, 2018 11 06.
Artículo en Inglés | MEDLINE | ID: mdl-30348779

RESUMEN

Whole-exome sequencing has been successful in identifying genetic factors contributing to familial or sporadic Parkinson's disease (PD). However, this approach has not been applied to explore the impact of de novo mutations on PD pathogenesis. Here, we sequenced the exomes of 39 early onset patients, their parents, and 20 unaffected siblings to investigate the effects of de novo mutations on PD. We identified 12 genes with de novo mutations (MAD1L1, NUP98, PPP2CB, PKMYT1, TRIM24, CEP131, CTTNBP2, NUS1, SMPD3, MGRN1, IFI35, and RUSC2), which could be functionally relevant to PD pathogenesis. Further analyses of two independent case-control cohorts (1,852 patients and 1,565 controls in one cohort and 3,237 patients and 2,858 controls in the other) revealed that NUS1 harbors significantly more rare nonsynonymous variants (P = 1.01E-5, odds ratio = 11.3) in PD patients than in controls. Functional studies in Drosophila demonstrated that the loss of NUS1 could reduce the climbing ability, dopamine level, and number of dopaminergic neurons in 30-day-old flies and could induce apoptosis in fly brain. Together, our data suggest that de novo mutations could contribute to early onset PD pathogenesis and identify NUS1 as a candidate gene for PD.


Asunto(s)
Encéfalo/metabolismo , Neuronas Dopaminérgicas/metabolismo , Mutación , Proteínas del Tejido Nervioso/genética , Enfermedad de Parkinson/genética , Receptores de Superficie Celular/genética , Adulto , Edad de Inicio , Animales , Apoptosis/genética , Translocador Nuclear del Receptor de Aril Hidrocarburo/antagonistas & inhibidores , Translocador Nuclear del Receptor de Aril Hidrocarburo/genética , Translocador Nuclear del Receptor de Aril Hidrocarburo/metabolismo , Secuencia de Bases , Encéfalo/patología , Estudios de Casos y Controles , Estudios de Cohortes , Modelos Animales de Enfermedad , Dopamina/metabolismo , Neuronas Dopaminérgicas/patología , Proteínas de Drosophila/antagonistas & inhibidores , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Diagnóstico Precoz , Femenino , Expresión Génica , Redes Reguladoras de Genes , Humanos , Masculino , Proteínas del Tejido Nervioso/metabolismo , Padres , Enfermedad de Parkinson/diagnóstico , Enfermedad de Parkinson/metabolismo , Enfermedad de Parkinson/patología , ARN Interferente Pequeño/genética , ARN Interferente Pequeño/metabolismo , Receptores de Superficie Celular/metabolismo , Hermanos
18.
BMC Genomics ; 21(Suppl 11): 893, 2020 Dec 29.
Artículo en Inglés | MEDLINE | ID: mdl-33372605

RESUMEN

BACKGROUND: Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. RESULTS: In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. CONCLUSION: DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern.


Asunto(s)
Aprendizaje Profundo , Transferencia de Gen Horizontal , Secuencia de Bases , Genoma , Metagenómica , Filogenia
19.
BMC Genomics ; 21(Suppl 10): 618, 2020 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-33208097

RESUMEN

BACKGROUND: Single-cell RNA-sequencing (scRNA-seq) is becoming indispensable in the study of cell-specific transcriptomes. However, in scRNA-seq techniques, only a small fraction of the genes are captured due to "dropout" events. These dropout events require intensive treatment when analyzing scRNA-seq data. For example, imputation tools have been proposed to estimate dropout events and de-noise data. The performance of these imputation tools are often evaluated, or fine-tuned, using various clustering criteria based on ground-truth cell subgroup labels. This limits their effectiveness in the cases where we lack cell subgroup knowledge. We consider an alternative strategy which requires the imputation to follow a "self-consistency" principle; that is, the imputation process is to refine its results until there is no internal inconsistency or dropouts from the data. RESULTS: We propose the use of "self-consistency" as a main criteria in performing imputation. To demonstrate this principle we devised I-Impute, a "self-consistent" method, to impute scRNA-seq data. I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. On the in silico data sets, I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with the state-of-art methods SAVER and scImpute. Furthermore, we collected three wetlab datasets, mouse bladder cells dataset, embryonic stem cells dataset, and aortic leukocyte cells dataset, to evaluate the tools. I-Impute exhibited feasible cell subpopulation discovery efficacy on all the three datasets. It achieves the highest clustering accuracy compared with SAVER and scImpute. CONCLUSIONS: A strategy based on "self-consistency", captured through our method, I-Impute, gave imputation results better than the state-of-the-art tools. Source code of I-Impute can be accessed at https://github.com/xikanfeng2/I-Impute .


Asunto(s)
ARN , Análisis de la Célula Individual , Animales , Perfilación de la Expresión Génica , Ratones , Análisis de Secuencia de ARN , Programas Informáticos
20.
BMC Bioinformatics ; 20(Suppl 23): 648, 2019 Dec 27.
Artículo en Inglés | MEDLINE | ID: mdl-31881818

RESUMEN

BACKGROUND: With recent advances in high-throughput technologies, matrix factorization techniques are increasingly being utilized for mapping quantitative omics profiling matrix data into low-dimensional embedding space, in the hope of uncovering insights in the underlying biological processes. Nevertheless, current matrix factorization tools fall short in handling noisy data and missing entries, both deficiencies that are often found in real-life data. RESULTS: Here, we propose DeepMF, a deep neural network-based factorization model. DeepMF disentangles the association between molecular feature-associated and sample-associated latent matrices, and is tolerant to noisy and missing values. It exhibited feasible cancer subtype discovery efficacy on mRNA, miRNA, and protein profiles of medulloblastoma cancer, leukemia cancer, breast cancer, and small-blue-round-cell cancer, achieving the highest clustering accuracy of 76%, 100%, 92%, and 100% respectively. When analyzing data sets with 70% missing entries, DeepMF gave the best recovery capacity with silhouette values of 0.47, 0.6, 0.28, and 0.44, outperforming other state-of-the-art MF tools on the cancer data sets Medulloblastoma, Leukemia, TCGA BRCA, and SRBCT. Its embedding strength as measured by clustering accuracy is 88%, 100%, 84%, and 96% on these data sets, which improves on the current best methods 76%, 100%, 78%, and 87%. CONCLUSION: DeepMF demonstrated robust denoising, imputation, and embedding ability. It offers insights to uncover the underlying biological processes such as cancer subtype discovery. Our implementation of DeepMF can be found at https://github.com/paprikachan/DeepMF.


Asunto(s)
Aprendizaje Profundo , Genómica , Programas Informáticos , Algoritmos , Simulación por Computador , Bases de Datos Genéticas , Humanos , MicroARNs/genética , MicroARNs/metabolismo , Neoplasias/genética , Redes Neurales de la Computación , ARN Mensajero/genética , ARN Mensajero/metabolismo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA