Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 148
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PLoS Comput Biol ; 15(8): e1007273, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31433799

RESUMO

Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.


Assuntos
Cromossomos Humanos/genética , Genoma Humano , Genômica/métodos , Algoritmos , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Biblioteca Genômica , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Software
2.
PLoS Comput Biol ; 15(8): e1007274, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31465436

RESUMO

The popularity of CRISPR-based gene editing has resulted in an abundance of tools to design CRISPR-Cas9 guides. This is also driven by the fact that designing highly specific and efficient guides is a crucial, but not trivial, task in using CRISPR for gene editing. Here, we thoroughly analyse the performance of 18 design tools. They are evaluated based on runtime performance, compute requirements, and guides generated. To achieve this, we implemented a method for auditing system resources while a given tool executes, and tested each tool on datasets of increasing size, derived from the mouse genome. We found that only five tools had a computational performance that would allow them to analyse an entire genome in a reasonable time, and without exhausting computing resources. There was wide variation in the guides identified, with some tools reporting every possible guide while others filtered for predicted efficiency. Some tools also failed to exclude guides that would target multiple positions in the genome. We also considered two collections with over a thousand guides each, for which experimental data is available. There is a lot of variation in performance between the datasets, but the relative order of the tools is partially conserved. Importantly, the most striking result is a lack of consensus between the tools. Our results show that CRISPR-Cas9 guide design tools need further work in order to achieve rapid whole-genome analysis and that improvements in guide design will likely require combining multiple approaches.


Assuntos
Sistemas CRISPR-Cas , Edição de Genes/métodos , RNA Guia/genética , Animais , Benchmarking/métodos , Benchmarking/estatística & dados numéricos , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Edição de Genes/normas , Edição de Genes/estatística & dados numéricos , Camundongos , Software
3.
PLoS Comput Biol ; 15(8): e1007040, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31469823

RESUMO

Single-cell RNA-sequencing (scRNA-seq) provides new opportunities to gain a mechanistic understanding of many biological processes. Current approaches for single cell clustering are often sensitive to the input parameters and have difficulty dealing with cell types with different densities. Here, we present Panoramic View (PanoView), an iterative method integrated with a novel density-based clustering, Ordering Local Maximum by Convex hull (OLMC), that uses a heuristic approach to estimate the required parameters based on the input data structures. In each iteration, PanoView will identify the most confident cell clusters and repeat the clustering with the remaining cells in a new PCA space. Without adjusting any parameter in PanoView, we demonstrated that PanoView was able to detect major and rare cell types simultaneously and outperformed other existing methods in both simulated datasets and published single-cell RNA-sequencing datasets. Finally, we conducted scRNA-Seq analysis of embryonic mouse hypothalamus, and PanoView was able to reveal known cell types and several rare cell subpopulations.


Assuntos
Algoritmos , Análise de Sequência de RNA/estatística & dados numéricos , Animais , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Hipotálamo/citologia , Hipotálamo/embriologia , Hipotálamo/metabolismo , Camundongos , Análise de Célula Única/estatística & dados numéricos
4.
PLoS Comput Biol ; 15(4): e1006937, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30973878

RESUMO

Gestational alcohol exposure causes fetal alcohol spectrum disorder (FASD) and is a prominent cause of neurodevelopmental disability. Whole transcriptome sequencing (RNA-Seq) offer insights into mechanisms underlying FASD, but gene-level analysis provides limited information regarding complex transcriptional processes such as alternative splicing and non-coding RNAs. Moreover, traditional analytical approaches that use multiple hypothesis testing with a false discovery rate adjustment prioritize genes based on an adjusted p-value, which is not always biologically relevant. We address these limitations with a novel approach and implemented an unsupervised machine learning model, which we applied to an exon-level analysis to reduce data complexity to the most likely functionally relevant exons, without loss of novel information. This was performed on an RNA-Seq paired-end dataset derived from alcohol-exposed neural fold-stage chick crania, wherein alcohol causes facial deficits recapitulating those of FASD. A principal component analysis along with k-means clustering was utilized to extract exons that deviated from baseline expression. This identified 6857 differentially expressed exons representing 1251 geneIDs; 391 of these genes were identified in a prior gene-level analysis of this dataset. It also identified exons encoding 23 microRNAs (miRNAs) having significantly differential expression profiles in response to alcohol. We developed an RDAVID pipeline to identify KEGG pathways represented by these exons, and separately identified predicted KEGG pathways targeted by these miRNAs. Several of these (ribosome biogenesis, oxidative phosphorylation) were identified in our prior gene-level analysis. Other pathways are crucial to facial morphogenesis and represent both novel (focal adhesion, FoxO signaling, insulin signaling) and known (Wnt signaling) alcohol targets. Importantly, there was substantial overlap between the exomes themselves and the predicted miRNA targets, suggesting these miRNAs contribute to the gene-level expression changes. Our novel application of unsupervised machine learning in conjunction with statistical analyses facilitated the discovery of signaling pathways and miRNAs that inform mechanisms underlying FASD.


Assuntos
Éxons/genética , Transtornos do Espectro Alcoólico Fetal/genética , MicroRNAs/genética , Aprendizado de Máquina não Supervisionado , Animais , Big Data , Embrião de Galinha , Análise por Conglomerados , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Modelos Animais de Doenças , Etanol/toxicidade , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Gravidez , Análise de Componente Principal , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos
5.
Pac Symp Biocomput ; 24: 76-87, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864312

RESUMO

Noncoding single nucleotide polymorphisms (SNPs) and their target genes are important components of the heritability of diseases and other polygenic traits. Identifying these SNPs and target genes could potentially reveal new molecular mechanisms and advance precision medicine. For polygenic traits, genome-wide association studies (GWAS) are preferred tools for identifying trait-associated regions. However, identifying causal noncoding SNPs within such regions is a difficult problem in computational biology. The DNA sequence context of a noncoding SNP is well-established as an important source of information that is beneficial for discriminating functional from nonfunctional noncoding SNPs. We describe the use of a deep residual network (ResNet)-based model-entitled Res2s2aM-that fuses anking DNA sequence information with additional SNP annotation information to discriminate functional from nonfunctional noncoding SNPs. On a ground-truth set of disease-associated SNPs compiled from the Genome-wide Repository of Associations between SNPs and Phenotypes (GRASP) database, Res2s2aM improves the prediction accuracy of functional SNPs significantly in comparison to models based only on sequence information as well as a leading tool for post-GWAS noncoding SNP prioritization (RegulomeDB).


Assuntos
Aprendizado Profundo , Polimorfismo de Nucleotídeo Único , Algoritmos , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Modelos Genéticos , Anotação de Sequência Molecular , Análise de Sequência de DNA
6.
Pac Symp Biocomput ; 24: 100-111, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864314

RESUMO

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.


Assuntos
Filogeografia/métodos , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Metadados , Processamento de Linguagem Natural , Filogeografia/estatística & dados numéricos
7.
Pac Symp Biocomput ; 24: 160-171, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864319

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are small, non-coding RNA that regulate gene expression through post-transcriptional silencing. Differential expression observed in miRNAs, combined with advancements in deep learning (DL), have the potential to improve cancer classification by modelling non-linear miRNA-phenotype associations. We propose a novel miRNA-based deep cancer classifier (DCC) incorporating genomic and hierarchical tissue annotation, capable of accurately predicting the presence of cancer in wide range of human tissues. METHODS: miRNA expression profiles were analyzed for 1746 neoplastic and 3871 normal samples, across 26 types of cancer involving six organ sub-structures and 68 cell types. miRNAs were ranked and filtered using a specificity score representing their information content in relation to neoplasticity, incorporating 3 levels of hierarchical biological annotation. A DL architecture composed of stacked autoencoders (AE) and a multi-layer perceptron (MLP) was trained to predict neoplasticity using 497 abundant and informative miRNAs. Additional DCCs were trained using expression of miRNA cistrons and sequence families, and combined as a diagnostic ensemble. Important miRNAs were identified using backpropagation, and analyzed in Cytoscape using iCTNet and BiNGO. RESULTS: Nested four-fold cross-validation was used to assess the performance of the DL model. The model achieved an accuracy, AUC/ROC, sensitivity, and specificity of 94.73%, 98.6%, 95.1%, and 94.3%, respectively. CONCLUSION: Deep autoencoder networks are a powerful tool for modelling complex miRNA-phenotype associations in cancer. The proposed DCC improves classification accuracy by learning from the biological context of both samples and miRNAs, using anatomical and genomic annotation. Analyzing the deep structure of DCCs with backpropagation can also facilitate biological discovery, by performing gene ontology searches on the most highly significant features.


Assuntos
Aprendizado Profundo , MicroRNAs/genética , Neoplasias/genética , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Diagnóstico por Computador/métodos , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , MicroRNAs/classificação , Anotação de Sequência Molecular , Neoplasias/classificação , Neoplasias/diagnóstico , Análise de Sequência de RNA
8.
Pac Symp Biocomput ; 24: 184-195, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864321

RESUMO

Genetic variations of the human genome are linked to many disease phenotypes. While whole-genome sequencing and genome-wide association studies (GWAS) have uncovered a number of genotype-phenotype associations, their functional interpretation remains challenging given most single nucleotide polymorphisms (SNPs) fall into the non-coding region of the genome. Advances in chromatin immunoprecipitation sequencing (ChIP-seq) have made large-scale repositories of epigenetic data available, allowing investigation of coordinated mechanisms of epigenetic markers and transcriptional regulation and their influence on biological function. To address this, we propose SNPs2ChIP, a method to infer biological functions of non-coding variants through unsupervised statistical learning methods applied to publicly-available epigenetic datasets. We systematically characterized latent factors by applying singular value decomposition to ChIP-seq tracks of lymphoblastoid cell lines, and annotated the biological function of each latent factor using the genomic region enrichment analysis tool. Using these annotated latent factors as reference, we developed SNPs2ChIP, a pipeline that takes genomic region(s) as an input, identifies the relevant latent factors with quantitative scores, and returns them along with their inferred functions. As a case study, we focused on systemic lupus erythematosus and demonstrated our method's ability to infer relevant biological function. We systematically applied SNPs2ChIP on publicly available datasets, including known GWAS associations from the GWAS catalogue and ChIP-seq peaks from a previously published study. Our approach to leverage latent patterns across genome-wide epigenetic datasets to infer the biological function will advance understanding of the genetics of human diseases by accelerating the interpretation of non-coding genomes.


Assuntos
Imunoprecipitação da Cromatina/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único , Algoritmos , Linhagem Celular , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Epigênese Genética , Estudos de Associação Genética , Genoma Humano , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Lúpus Eritematoso Sistêmico/genética , Linfócitos/metabolismo , Receptores de Calcitriol/genética
9.
Pac Symp Biocomput ; 24: 196-207, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864322

RESUMO

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole-genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by Whole Exome Sequencing (WXS), suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.


Assuntos
Alelos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Big Data , Biologia Computacional , Variação Genética , Humanos , Metadados , Neoplasias/genética , Polimorfismo de Nucleotídeo Único , Análise de Célula Única , Sequenciamento Completo do Exoma/estatística & dados numéricos
10.
Pac Symp Biocomput ; 24: 338-349, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864335

RESUMO

Cell trajectory reconstruction based on single cell RNA sequencing is important for obtaining the landscape of different cell types and discovering cell fate transitions. Despite intense effort, analyzing massive single cell RNA-seq datasets is still challenging. We propose a new method named Landmark Isomap for Single-cell Analysis (LISA). LISA is an unsupervised approach to build cell trajectory and compute pseudo-time in the isometric embedding based on geodesic distances. The advantages of LISA include: (1) It utilizes k-nearest-neighbor graph and hierarchical clustering to identify cell clusters, peaks and valleys in low-dimension representation of the data; (2) Based on Landmark Isomap, it constructs the main geometric structure of cell lineages; (3) It projects cells to the edges of the main cell trajectory to generate the global pseudo-time. Assessments on simulated and real datasets demonstrate the advantages of LISA on cell trajectory and pseudo-time reconstruction compared to Monocle2 and TSCAN. LISA is accurate, fast, and requires less memory usage, allowing its applications to massive single cell datasets generated from current experimental platforms.


Assuntos
/métodos , Análise de Célula Única/métodos , Algoritmos , Animais , Blastocisto/citologia , Blastocisto/metabolismo , Análise por Conglomerados , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Análise de Componente Principal , Análise de Célula Única/estatística & dados numéricos , Processos Estocásticos , Fatores de Tempo , Aprendizado de Máquina não Supervisionado , Peixe-Zebra/embriologia , Peixe-Zebra/genética
11.
Methods Mol Biol ; 1912: 251-285, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30635897

RESUMO

One of the most important resources for researchers of noncoding RNAs is the information available in public databases spread over the internet. However, the effective exploration of this data can represent a daunting task, given the large amount of databases available and the variety of stored data. This chapter describes a classification of databases based on information source, type of RNA, source organisms, data formats, and the mechanisms for information retrieval, detailing the relevance of each of these classifications and its usability by researchers. This classification is used to update a 2012 review, indexing now more than 229 public databases. This review will include an assessment of the new trends for ncRNA research based on the information that is being offered by the databases. Additionally, we will expand the previous analysis focusing on the usability and application of these databases in pathogen and disease research. Finally, this chapter will analyze how currently available database schemas can help the development of new and improved web resources.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/tendências , Armazenamento e Recuperação da Informação/tendências , RNA não Traduzido/genética , Biologia Computacional/tendências , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Conjuntos de Dados como Assunto , Humanos , Armazenamento e Recuperação da Informação/estatística & dados numéricos
12.
Brief Bioinform ; 20(1): 288-298, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-29028903

RESUMO

RNA sequencing (RNA-seq) has become a standard procedure to investigate transcriptional changes between conditions and is routinely used in research and clinics. While standard differential expression (DE) analysis between two conditions has been extensively studied, and improved over the past decades, RNA-seq time course (TC) DE analysis algorithms are still in their early stages. In this study, we compare, for the first time, existing TC RNA-seq tools on an extensive simulation data set and validated the best performing tools on published data. Surprisingly, TC tools were outperformed by the classical pairwise comparison approach on short time series (<8 time points) in terms of overall performance and robustness to noise, mostly because of high number of false positives, with the exception of ImpulseDE2. Overlapping of candidate lists between tools improved this shortcoming, as the majority of false-positive, but not true-positive, candidates were unique for each method. On longer time series, pairwise approach was less efficient on the overall performance compared with splineTC and maSigPro, which did not identify any false-positive candidate.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Teorema de Bayes , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Cadeias de Markov , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Razão Sinal-Ruído , Software , Fatores de Tempo
13.
Brief Bioinform ; 20(1): 66-76, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968629

RESUMO

Cardiovascular diseases (CVDs) continue to be a major cause of morbidity and mortality, and non-coding RNAs (ncRNAs) play critical roles in CVDs. With the recent emergence of high-throughput technologies, including small RNA sequencing, investigations of CVDs have been transformed from candidate-based studies into genome-wide undertakings, and a number of ncRNAs in CVDs were discovered in various studies. A comprehensive review of these ncRNAs would be highly valuable for researchers to get a complete picture of the ncRNAs in CVD. To address these knowledge gaps and clinical needs, in this review, we first discussed dysregulated ncRNAs and their critical roles in cardiovascular development and related diseases. Moreover, we reviewed >28 561 published papers and documented the ncRNA-CVD association benchmarking data sets to summarize the principles of ncRNA regulation in CVDs. This data set included 13 249 curated relationships between 9503 ncRNAs and 139 CVDs in 12 species. Based on this comprehensive resource, we summarized the regulatory principles of dysregulated ncRNAs in CVDs, including the complex associations between ncRNA and CVDs, tissue specificity and ncRNA synergistic regulation. The highlighted principles are that CVD microRNAs (miRNAs) are highly expressed in heart tissue and that they play central roles in miRNA-miRNA functional synergistic network. In addition, CVD-related miRNAs are close to one another in the functional network, indicating the modular characteristic features of CVD miRNAs. We believe that the regulatory principles summarized here will further contribute to our understanding of ncRNA function and dysregulation mechanisms in CVDs.


Assuntos
Doenças Cardiovasculares/genética , RNA não Traduzido/genética , Animais , Big Data , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Estudos de Associação Genética/estatística & dados numéricos , Marcadores Genéticos , Humanos , Camundongos , MicroRNAs/genética , Distribuição Tecidual
14.
Brief Bioinform ; 20(1): 102-109, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968662

RESUMO

Adenosine-to-inosine (A-to-I) editing by adenosine deaminase acting on the RNA (ADAR) proteins is one of the most frequent modifications during post- and co-transcription. To facilitate the assignment of biological functions to specific editing sites, we designed an automatic online platform to annotate A-to-I RNA editing sites in pre-mRNA splicing signals, microRNAs (miRNAs) and miRNA target untranslated regions (3' UTRs) from human (Homo sapiens) high-throughput sequencing data and predict their effects based on large-scale bioinformatic analysis. After analysing plenty of previously reported RNA editing events and human normal tissues RNA high-seq data, >60 000 potentially effective RNA editing events on functional genes were found. The RNA Editing Plus platform is available for free at https://www.rnaeditplus.org/, and we believe our platform governing multiple optimized methods will improve further studies of A-to-I-induced editing post-transcriptional regulation.


Assuntos
Adenosina Desaminase/metabolismo , Edição de RNA , Proteínas de Ligação a RNA/metabolismo , Software , Regiões 3' não Traduzidas , Adenosina/genética , Adenosina/metabolismo , Processamento Alternativo/genética , Sequência de Bases , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Feminino , Ontologia Genética , Humanos , Inosina/genética , Inosina/metabolismo , Masculino , MicroRNAs/genética , MicroRNAs/metabolismo , Mutação de Sentido Incorreto , Edição de RNA/genética , Precursores de RNA/genética , Precursores de RNA/metabolismo , Análise de Sequência de RNA/estatística & dados numéricos , Máquina de Vetores de Suporte , Distribuição Tecidual
15.
Brief Bioinform ; 20(1): 144-155, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968766

RESUMO

Ribosome profiling is emerging as a powerful technique that enables genome-wide investigation of in vivo translation at sub-codon resolution. The increasing application of ribosome profiling in recent years has achieved remarkable progress toward understanding the composition, regulation and mechanism of translation. This benefits from not only the awesome power of ribosome profiling but also an extensive range of computational resources available for ribosome profiling. At present, however, a comprehensive review on these resources is still lacking. Here, we survey the recent computational advances guided by ribosome profiling, with a focus on databases, Web servers and software tools for storing, visualizing and analyzing ribosome profiling data. This review is intended to provide experimental and computational biologists with a reference to make appropriate choices among existing resources for the question at hand.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Ribossomos/genética , Ribossomos/metabolismo , Software , Algoritmos , Animais , Biologia Computacional/tendências , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Internet , Biossíntese de Proteínas , Análise de Sequência de RNA/estatística & dados numéricos
16.
Brief Bioinform ; 20(1): 58-65, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968841

RESUMO

Circular RNAs are widely existing in eukaryotes. However, there is as yet no tissue-specific Arabidopsis circular RNA database, which hinders the study of circular RNA in plants. Here, we used 622 Arabidopsis RNA sequencing data sets from 87 independent studies hosted at NCBI SRA and developed AtCircDB to systematically identify, store and retrieve circular RNAs. By analyzing back-splicing sites, we characterized 84 685 circular RNAs, 30 648 tissue-specific circular RNAs and 3486 microRNA-circular RNA interactions. In addition, we used a metric (detection score) to measure the detection ability of the circular RNAs using a big-data approach. By experimental validation, we demonstrate that this metric improves the accuracy of the detection algorithm. We also defined the regions hosting enriched circular RNAs as super circular RNA regions. The results suggest that these regions are highly related to alternative splicing and chloroplast. Finally, we developed a comprehensive tissue-specific database (AtCircDB) to help the community store, retrieve, visualize and download Arabidopsis circular RNAs. This database will greatly expand our understanding of circular RNAs and their related regulatory networks. AtCircDB is freely available at http://genome.sdau.edu.cn/circRNA.


Assuntos
Arabidopsis/genética , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , RNA de Plantas/genética , RNA/genética , Algoritmos , Big Data , Biologia Computacional , Internet , MicroRNAs/genética , Análise de Sequência de RNA/estatística & dados numéricos , Distribuição Tecidual/genética , Interface Usuário-Computador
17.
PLoS Comput Biol ; 14(12): e1006651, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30532261

RESUMO

An expanded chemical space is essential for improved identification of small molecules for emerging therapeutic targets. However, the identification of targets for novel compounds is biased towards the synthesis of known scaffolds that bind familiar protein families, limiting the exploration of chemical space. To change this paradigm, we validated a new pipeline that identifies small molecule-protein interactions and works even for compounds lacking similarity to known drugs. Based on differential mRNA profiles in multiple cell types exposed to drugs and in which gene knockdowns (KD) were conducted, we showed that drugs induce gene regulatory networks that correlate with those produced after silencing protein-coding genes. Next, we applied supervised machine learning to exploit drug-KD signature correlations and enriched our predictions using an orthogonal structure-based screen. As a proof-of-principle for this regimen, top-10/top-100 target prediction accuracies of 26% and 41%, respectively, were achieved on a validation of set 152 FDA-approved drugs and 3104 potential targets. We then predicted targets for 1680 compounds and validated chemical interactors with four targets that have proven difficult to chemically modulate, including non-covalent inhibitors of HRAS and KRAS. Importantly, drug-target interactions manifest as gene expression correlations between drug treatment and both target gene KD and KD of genes that act up- or down-stream of the target, even for relatively weak binders. These correlations provide new insights on the cellular response of disrupting protein interactions and highlight the complex genetic phenotypes of drug treatment. With further refinement, our pipeline may accelerate the identification and development of novel chemical classes by screening compound-target interactions.


Assuntos
Descoberta de Drogas/métodos , Perfilação da Expressão Gênica/métodos , Proteínas/química , Proteínas/efeitos dos fármacos , Linhagem Celular , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Descoberta de Drogas/estatística & dados numéricos , Avaliação Pré-Clínica de Medicamentos/métodos , Avaliação Pré-Clínica de Medicamentos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Técnicas de Silenciamento de Genes , Ontologia Genética , Redes Reguladoras de Genes/efeitos dos fármacos , Humanos , Modelos Moleculares , Simulação de Acoplamento Molecular , Inibidores de Proteínas Quinases/química , Inibidores de Proteínas Quinases/farmacologia , Proteínas/genética , Ubiquitina-Proteína Ligases/antagonistas & inibidores , Ubiquitina-Proteína Ligases/química , Ubiquitina-Proteína Ligases/genética , Wortmanina/química , Wortmanina/farmacologia , Proteínas ras/antagonistas & inibidores , Proteínas ras/química , Proteínas ras/genética
18.
Math Biosci ; 306: 1-9, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30336146

RESUMO

The last few decades have verified the vital roles of microRNAs in the development of human diseases and witnessed the increasing interest in the prediction of potential disease-miRNA associations. Owning to the open access of many miRNA-related databases, up until recently, kinds of feasible in silico models have been proposed. In this work, we developed a computational model of Maximal Entropy Random Walk on heterogenous network for MiRNA-disease Association prediction (MERWMDA). MERWMDA integrated known disease-miRNA association, pair-wise functional relation of miRNAs and pair-wise semantic relation of diseases into a heterogenous network comprised of disease and miRNA nodes full of information. As a kind of widely-applied biased walk process with more randomness, MERW was then implemented on the heterogenous network to reveal potential disease-miRNA associations. Cross validation was further performed to evaluate the performance of MERWMDA. As a result, MERWMDA obtained AUCs of 0.8966 and 0.8491 respectively in the aspect of global and local leave-one-out cross validation. What' more, three different case study strategies on four human complex diseases were conducted to comprehensively assess the quality of the model. Specifically, one kind of case study on Esophageal cancer and Prostate cancer were conducted based on HMDD v2.0 database. 94% and 88% out of the top 50 ranked miRNAs were confirmed by recent literature, respectively. To simulate new disease without known related miRNAs, Lung cancer (confirmed ratio 94%) associated miRNAs were removed for case study. Lymphoma (verified ratio 88%) was adopted to assess the prediction robustness of MERWMDA based on HMDD v1.0 database. We anticipated that MERWMDA could offer valuable candidates for in vitro biomedical experiments in future.


Assuntos
Predisposição Genética para Doença , MicroRNAs/genética , MicroRNAs/metabolismo , Neoplasias/genética , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Entropia , Feminino , Redes Reguladoras de Genes , Humanos , Masculino , Modelos Genéticos , Valor Preditivo dos Testes
19.
J Biomed Inform ; 85: 80-92, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-30041017

RESUMO

With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.


Assuntos
Técnicas e Procedimentos Diagnósticos/estatística & dados numéricos , Doença/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Algoritmos , Big Data , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Modelos Estatísticos , Máquina de Vetores de Suporte
20.
Nat Commun ; 9(1): 1612, 2018 04 24.
Artigo em Inglês | MEDLINE | ID: mdl-29691392

RESUMO

Protein-truncating variants can have profound effects on gene function and are critical for clinical genome interpretation and generating therapeutic hypotheses, but their relevance to medical phenotypes has not been systematically assessed. Here, we characterize the effect of 18,228 protein-truncating variants across 135 phenotypes from the UK Biobank and find 27 associations between medical phenotypes and protein-truncating variants in genes outside the major histocompatibility complex. We perform phenome-wide analyses and directly measure the effect in homozygous carriers, commonly referred to as "human knockouts," across medical phenotypes for genes implicated as being protective against disease or associated with at least one phenotype in our study. We find several genes with strong pleiotropic or non-additive effects. Our results illustrate the importance of protein-truncating variants in a variety of diseases.


Assuntos
Bases de Dados de Ácidos Nucleicos , Proteínas/genética , Deleção de Sequência , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Estudo de Associação Genômica Ampla , Humanos , Fenótipo , Reino Unido
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA