Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
PLoS One ; 17(4): e0267291, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35476804

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are frequently deregulated in various types of cancer. While antisense oligonucleotides are used to block oncomiRs, delivery of tumour-suppressive miRNAs holds great potential as a potent anti-cancer strategy. Here, we aim to determine, and functionally analyse, miRNAs that are lowly expressed in various types of tumour but abundantly expressed in multiple normal tissues. METHODS: The miRNA sequencing data of 14 cancer types were downloaded from the TCGA dataset. Significant differences in miRNA expression between tumor and normal samples were calculated using limma package (R programming). An adjusted p value < 0.05 was used to compare normal versus tumor miRNA expression profiles. The predicted gene targets were obtained using TargetScan, miRanda, and miRDB and then subjected to gene ontology analysis using Enrichr. Only GO terms with an adjusted p < 0.05 were considered statistically significant. All data from wet-lab experiments (cell viability assays and flow cytometry) were expressed as means ± SEM, and their differences were analyzed using GraphPad Prism software (Student's t test, p < 0.05). RESULTS: By compiling all publicly available miRNA profiling data from The Cancer Genome Atlas (TCGA) Pan-Cancer Project, we reveal a small set of tumour-suppressing miRNAs (which we designate as 'normomiRs') that are highly expressed in 14 types of normal tissues but poorly expressed in corresponding tumour tissues. Interestingly, muscle-enriched miRNAs (e.g. miR-133a/b and miR-206) and miRNAs from DLK1-DIO3 locus (e.g. miR-381 and miR-411) constitute a large fraction of the normomiRs. Moreover, we define that the CCCGU motif is absent in the oncomiRs' seed sequences but present in a fraction of tumour-suppressive miRNAs. Finally, the gain of function of candidate normomiRs across several cancer cell types indicates that miR-206 and miR-381 exert the most potent inhibition on multiple cancer types in vitro. CONCLUSION: Our results reveal a pan-cancer set of tumour-suppressing miRNAs and highlight the potential of miRNA-replacement therapies for targeting multiple types of tumour.


Assuntos
MicroRNAs , Neoplasias , Bases de Dados Factuais , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Humanos , MicroRNAs/genética , MicroRNAs/metabolismo , Neoplasias/genética
2.
J Comput Biol ; 27(3): 436-439, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-32160033

RESUMO

Graph Traversal Edit Distance (GTED) is a measure of distance (or dissimilarity) between two graphs introduced. This measure is based on the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. GTED was motivated by and provides the first mathematical formalism for sequence coassembly and de novo variation detection in bioinformatics. Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications. In this article, we introduce a tool, PyGTED to compute GTED. It implements the algorithm based on the polynomial time algorithm devised for it by the authors. Informally, the GTED is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs.


Assuntos
Biologia Computacional/métodos , Mineração de Dados , Aprendizado de Máquina , Programação Linear , Software
3.
J Comput Biol ; 27(3): 317-329, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-32058803

RESUMO

Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications. In this article, we give a new graph kernel, which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the GTED is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics. We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of a support vector machine (SVM) classifier on a few data sets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested data sets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next-generation sequencing reads.


Assuntos
Biologia Computacional/métodos , Programação Linear , Algoritmos , Animais , Mineração de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Máquina de Vetores de Suporte
4.
Biol Reprod ; 102(1): 156-169, 2020 02 12.
Artigo em Inglês | MEDLINE | ID: mdl-31504222

RESUMO

Gonadotropes represent approximately 5-15% of the total endocrine cell population in the mammalian anterior pituitary. Therefore, assessing the effects of experimental manipulation on virtually any parameter of gonadotrope biology is difficult to detect and parse from background noise. In non-rodent species, applying techniques such as high-throughput ribonucleic acid (RNA) sequencing is problematic due to difficulty in isolating and analyzing individual endocrine cell populations. Herein, we exploited cell-specific properties inherent to the proximal promoter of the human glycoprotein hormone alpha subunit gene (CGA) to genetically target the expression of a fluorescent reporter (green fluorescent protein [GFP]) selectively to ovine gonadotropes. Dissociated ovine pituitary cells were cultured and infected with an adenoviral reporter vector (Ad-hαCGA-eGFP). We established efficient gene targeting by successfully enriching dispersed GFP-positive cells with flow cytometry. Confirming enrichment of gonadotropes specifically, we detected elevated levels of luteinizing hormone (LH) but not thyrotropin-stimulating hormone (TSH) in GFP-positive cell populations compared to GFP-negative populations. Subsequently, we used next-generation sequencing to obtain the transcriptional profile of GFP-positive ovine gonadotropes in the presence or absence of estradiol 17-beta (E2), a key modulator of gonadotrope function. Compared to non-sorted cells, enriched GFP-positive cells revealed a distinct transcriptional profile consistent with established patterns of gonadotrope gene expression. Importantly, we also detected nearly 200 E2-responsive genes in enriched gonadotropes, which were not apparent in parallel experiments on non-enriched cell populations. From these data, we conclude that CGA-targeted adenoviral gene transfer is an effective means for selectively labeling and enriching ovine gonadotropes suitable for investigation by numerous experimental approaches.


Assuntos
Estradiol/farmacologia , Gonadotrofos/efeitos dos fármacos , Adeno-Hipófise/efeitos dos fármacos , Adenoviridae , Animais , Gonadotrofos/metabolismo , Hormônio Luteinizante/metabolismo , Adeno-Hipófise/metabolismo , Ovinos , Tireotropina/metabolismo
5.
Sci Rep ; 9(1): 16526, 2019 11 11.
Artigo em Inglês | MEDLINE | ID: mdl-31712594

RESUMO

Despite great advances, molecular cancer pathology is often limited to the use of a small number of biomarkers rather than the whole transcriptome, partly due to computational challenges. Here, we introduce a novel architecture of Deep Neural Networks (DNNs) that is capable of simultaneous inference of various properties of biological samples, through multi-task and transfer learning. It encodes the whole transcription profile into a strikingly low-dimensional latent vector of size 8, and then recovers mRNA and miRNA expression profiles, tissue and disease type from this vector. This latent space is significantly better than the original gene expression profiles for discriminating samples based on their tissue and disease. We employed this architecture on mRNA transcription profiles of 10750 clinical samples from 34 classes (one healthy and 33 different types of cancer) from 27 tissues. Our method significantly outperforms prior works and classical machine learning approaches in predicting tissue-of-origin, normal or disease state and cancer type of each sample. For tissues with more than one type of cancer, it reaches 99.4% accuracy in identifying the correct cancer subtype. We also show this system is very robust against noise and missing values. Collectively, our results highlight applications of artificial intelligence in molecular cancer pathology and oncological research. DeePathology is freely available at https://github.com/SharifBioinf/DeePathology .


Assuntos
Biologia Computacional , Aprendizado Profundo , Perfilação da Expressão Gênica , Neoplasias/genética , Neoplasias/patologia , Transcriptoma , Algoritmos , Biologia Computacional/métodos , Mineração de Dados , Suscetibilidade a Doenças , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Humanos , Redes Neurais de Computação , Especificidade de Órgãos/genética , Patologia Molecular/métodos , Reprodutibilidade dos Testes
6.
Sci Rep ; 9(1): 2342, 2019 02 20.
Artigo em Inglês | MEDLINE | ID: mdl-30787315

RESUMO

Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but exclusive markers are not available for many cell types. A second approach is to use machine learning to discriminate cell types based on the whole gene expression profiles (GEPs). The accuracies of simple classification algorithms such as linear discriminators or support vector machines are limited due to the complexity of biological systems. We used deep neural networks to analyze 1040 GEPs from 16 different human tissues and cell types. After comparing different architectures, we identified a specific structure of deep autoencoders that can encode a GEP into a vector of 30 numeric values, which we call the cell identity code (CIC). The original GEP can be reproduced from the CIC with an accuracy comparable to technical replicates of the same experiment. Although we use an unsupervised approach to train the autoencoder, we show different values of the CIC are connected to different biological aspects of the cell, such as different pathways or biological processes. This network can use CIC to reproduce the GEP of the cell types it has never seen during the training. It also can resist some noise in the measurement of the GEP. Furthermore, we introduce classifier autoencoder, an architecture that can accurately identify cell type based on the GEP or the CIC.


Assuntos
Células/metabolismo , Aprendizado Profundo , Perfilação da Expressão Gênica , Redes Neurais de Computação , Algoritmos , Compartimento Celular , Humanos , Especificidade de Órgãos/genética
7.
Zoological Lett ; 4: 24, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30181897

RESUMO

BACKGROUND: Planarians are non-parasitic Platyhelminthes (flatworms) famous for their regeneration ability and for having a well-organized brain. Dugesia japonica is a typical planarian species that is widely distributed in the East Asia. Extensive cellular and molecular experimental methods have been developed to identify the functions of thousands of genes in this species, making this planarian a good experimental model for regeneration biology and neurobiology. However, no genome-level information is available for D. japonica, and few gene regulatory networks have been identified thus far. RESULTS: To obtain whole-genome information on this species and to study its gene regulatory networks, we extracted genomic DNA from 200 planarians derived from a laboratory-bred asexual clonal strain, and sequenced 476 Gb of data by second-generation sequencing. Kmer frequency graphing and fosmid sequence analysis indicated a complex genome that would be difficult to assemble using second-generation sequencing short reads. To address this challenge, we developed a new assembly strategy and improved the de novo genome assembly, producing a 1.56 Gb genome sequence (DjGenome ver1.0, including 202,925 scaffolds and N50 length 27,741 bp) that covers 99.4% of all 19,543 genes in the assembled transcriptome, although the genome is fragmented as 80% of the genome consists of repeated sequences (genomic frequency ≥ 2). By genome comparison between two planarian genera, we identified conserved non-coding elements (CNEs), which are indicative of gene regulatory elements. Transgenic experiments using Xenopus laevis indicated that one of the CNEs in the Djndk gene may be a regulatory element, suggesting that the regulation of the ndk gene and the brain formation mechanism may be conserved between vertebrates and invertebrates. CONCLUSION: This draft genome and CNE analysis will contribute to resolving gene regulatory networks in planarians. The genome database is available at: http://www.planarian.jp.

8.
BMC Genomics ; 18(1): 964, 2017 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-29233090

RESUMO

BACKGROUND: DNA methylation at promoters is largely correlated with inhibition of gene expression. However, the role of DNA methylation at enhancers is not fully understood, although a crosstalk with chromatin marks is expected. Actually, there exist contradictory reports about positive and negative correlations between DNA methylation and H3K4me1, a chromatin hallmark of enhancers. RESULTS: We investigated the relationship between DNA methylation and active chromatin marks through genome-wide correlations, and found anti-correlation between H3K4me1 and H3K4me3 enrichment at low and intermediate DNA methylation loci. We hypothesized "seesaw" dynamics between H3K4me1 and H3K4me3 in the low and intermediate DNA methylation range, in which DNA methylation discriminates between enhancers and promoters, marked by H3K4me1 and H3K4me3, respectively. Low methylated regions are H3K4me3 enriched, while those with intermediate DNA methylation levels are progressively H3K4me1 enriched. Additionally, the enrichment of H3K27ac, distinguishing active from primed enhancers, follows a plateau in the lower range of the intermediate DNA methylation level, corresponding to active enhancers, and decreases linearly in the higher range of the intermediate DNA methylation. Thus, the decrease of the DNA methylation switches smoothly the state of the enhancers from a primed to an active state. We summarize these observations into a rule of thumb of one-out-of-three methylation marks: "In each genomic region only one out of these three methylation marks {DNA methylation, H3K4me1, H3K4me3} is high. If it is the DNA methylation, the region is inactive. If it is H3K4me1, the region is an enhancer, and if it is H3K4me3, the region is a promoter". To test our model, we used available genome-wide datasets of H3K4 methyltransferases knockouts. Our analysis suggests that CXXC proteins, as readers of non-methylated CpGs would regulate the "seesaw" mechanism that focuses H3K4me3 to unmethylated sites, while being repulsed from H3K4me1 decorated enhancers and CpG island shores. CONCLUSIONS: Our results show that DNA methylation discriminates promoters from enhancers through H3K4me1-H3K4me3 seesaw mechanism, and suggest its possible function in the inheritance of chromatin marks after cell division. Our analyses suggest aberrant formation of promoter-like regions and ectopic transcription of hypomethylated regions of DNA. Such mechanism process can have important implications in biological process in where it has been reported abnormal DNA methylation status such as cancer and aging.


Assuntos
Metilação de DNA , Elementos Facilitadores Genéticos , Código das Histonas , Regiões Promotoras Genéticas , Animais , Citosina/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/metabolismo , Expressão Gênica , Histonas/metabolismo , Camundongos , Domínios Proteicos
9.
Sci Rep ; 7(1): 6894, 2017 07 31.
Artigo em Inglês | MEDLINE | ID: mdl-28761171

RESUMO

In budding yeast, the 3' end processing of mRNA and the coupled termination of transcription by RNAPII requires the CF IA complex. We have earlier demonstrated a role for the Clp1 subunit of this complex in termination and promoter-associated transcription of CHA1. To assess the generality of the observed function of Clp1 in transcription, we tested the effect of Clp1 on transcription on a genomewide scale using the Global Run-On-Seq (GRO-Seq) approach. GRO-Seq analysis showed the polymerase reading through the termination signal in the downstream region of highly transcribed genes in a temperature-sensitive mutant of Clp1 at elevated temperature. No such terminator readthrough was observed in the mutant at the permissive temperature. The poly(A)-independent termination of transcription of snoRNAs, however, remained unaffected in the absence of Clp1 activity. These results strongly suggest a role for Clp1 in poly(A)-coupled termination of transcription. Furthermore, the density of antisense transcribing polymerase upstream of the promoter region exhibited an increase in the absence of Clp1 activity, thus implicating Clp1 in promoter directionality. The overall conclusion of these results is that Clp1 plays a general role in poly(A)-coupled termination of RNAPII transcription and in enhancing promoter directionality in budding yeast.


Assuntos
RNA Mensageiro/metabolismo , Saccharomycetales/metabolismo , Fatores de Poliadenilação e Clivagem de mRNA/genética , Fatores de Poliadenilação e Clivagem de mRNA/metabolismo , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo , Mutação , Poliadenilação , Regiões Promotoras Genéticas , RNA Polimerase II/metabolismo , RNA Fúngico/genética , RNA Fúngico/metabolismo , RNA Mensageiro/genética , Saccharomycetales/genética , Análise de Sequência de RNA , Transcrição Gênica
10.
Sci Rep ; 7(1): 3666, 2017 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-28623339

RESUMO

The human protein disulfide isomerase (hPDI), is an essential four-domain multifunctional enzyme. As a result of disulfide shuffling in its terminal domains, hPDI exists in two oxidation states with different conformational preferences which are important for substrate binding and functional activities. Here, we address the redox-dependent conformational dynamics of hPDI through molecular dynamics (MD) simulations. Collective domain motions are identified by the principal component analysis of MD trajectories and redox-dependent opening-closing structure variations are highlighted on projected free energy landscapes. Then, important structural features that exhibit considerable differences in dynamics of redox states are extracted by statistical machine learning methods. Mapping the structural variations to time series of residue interaction networks also provides a holistic representation of the dynamical redox differences. With emphasizing on persistent long-lasting interactions, an approach is proposed that compiled these time series networks to a single dynamic residue interaction network (DRIN). Differential comparison of DRIN in oxidized and reduced states reveals chains of residue interactions that represent potential allosteric paths between catalytic and ligand binding sites of hPDI.


Assuntos
Aprendizado de Máquina , Simulação de Dinâmica Molecular , Isomerases de Dissulfetos de Proteínas/química , Domínios e Motivos de Interação entre Proteínas , Humanos , Oxirredução , Conformação Proteica , Isomerases de Dissulfetos de Proteínas/metabolismo , Mapeamento de Interação de Proteínas , Mapas de Interação de Proteínas
11.
J Biomed Semantics ; 8(1): 14, 2017 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-28388928

RESUMO

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. METHOD: In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. RESULT: The agreement between annotators was measured by Cohen's Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639 . CONCLUSION: Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. TRIAL REGISTRATION: Not Applicable.


Assuntos
Ontologia Genética , Armazenamento e Recuperação da Informação/métodos , Fenótipo , Polimorfismo de Nucleotídeo Único , Mutação , Semântica
12.
PLoS One ; 11(10): e0163480, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27695078

RESUMO

MOTIVATION: Supervised biomedical relation extraction plays an important role in biomedical natural language processing, endeavoring to obtain the relations between biomedical entities. Drug-drug interactions, which are investigated in the present paper, are notably among the critical biomedical relations. Thus far many methods have been developed with the aim of extracting DDI relations. However, unfortunately there has been a scarcity of comprehensive studies on the effects of negation, complex sentences, clause dependency, and neutral candidates in the course of DDI extraction from biomedical articles. RESULTS: Our study proposes clause dependency features and a number of features for identifying neutral candidates as well as negation cues and scopes. Furthermore, our experiments indicate that the proposed features significantly improve the performance of the relation extraction task combined with other kernel methods. We characterize the contribution of each category of features and finally conclude that neutral candidate features have the most prominent role among all of the three categories.


Assuntos
Pesquisa Biomédica , Mineração de Dados , Interações Medicamentosas , Publicações , Algoritmos , Inteligência Artificial , Humanos , Processamento de Linguagem Natural
13.
Artigo em Inglês | MEDLINE | ID: mdl-27243002

RESUMO

As the vast majority of all microbes are unculturable, single-cell sequencing has become a significant method to gain insight into microbial physiology. Single-cell sequencing methods, currently powered by multiple displacement genome amplification (MDA), have passed important milestones such as finishing and closing the genome of a prokaryote. However, the quality and reliability of genome assemblies from single cells are still unsatisfactory due to uneven coverage depth and the absence of scattered chunks of the genome in the final collection of reads caused by MDA bias. In this work, our new algorithm Hybrid De novo Assembler (HyDA) demonstrates the power of coassembly of multiple single-cell genomic data sets through significant improvement of the assembly quality in terms of predicted functional elements and length statistics. Coassemblies contain significantly more base pairs and protein coding genes, cover more subsystems, and consist of longer contigs compared to individual assemblies by the same algorithm as well as state-of-the-art single-cell assemblers SPAdes and IDBA-UD. Hybrid De novo Assembler (HyDA) is also able to avoid chimeric assemblies by detecting and separating shared and exclusive pieces of sequence for input data sets. By replacing one deep single-cell sequencing experiment with a few single-cell sequencing experiments of lower depth, the coassembly method can hedge against the risk of failure and loss of the sample, without significantly increasing sequencing cost. Application of the single-cell coassembler HyDA to the study of three uncultured members of an alkane-degrading methanogenic community validated the usefulness of the coassembly concept. HyDA is open source and publicly available at http://chitsazlab.org/software.html, and the raw reads are available at http://chitsazlab.org/research.html.

14.
BMC Syst Biol ; 9: 23, 2015 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-26033487

RESUMO

BACKGROUND: Understanding the mechanisms by which hundreds of diverse cell types develop from a single mammalian zygote has been a central challenge of developmental biology. Conrad H. Waddington, in his metaphoric "epigenetic landscape" visualized the early embryogenesis as a hierarchy of lineage bifurcations. In each bifurcation, a single progenitor cell type produces two different cell lineages. The tristable dynamical systems are used to model the lineage bifurcations. It is also shown that a genetic circuit consisting of two auto-activating transcription factors (TFs) with cross inhibitions can form a tristable dynamical system. RESULTS: We used gene expression profiles of pre-implantation mouse embryos at the single cell resolution to visualize the Waddington landscape of the early embryogenesis. For each lineage bifurcation we identified two clusters of TFs - rather than two single TFs as previously proposed - that had opposite expression patterns between the pair of bifurcated cell types. The regulatory circuitry among each pair of TF clusters resembled a genetic circuit of a pair of single TFs; it consisted of positive feedbacks among the TFs of the same cluster, and negative interactions among the members of the opposite clusters. Our analyses indicated that the tristable dynamical system of the two-cluster regulatory circuitry is more robust than the genetic circuit of two single TFs. CONCLUSIONS: We propose that a modular hierarchy of regulatory circuits, each consisting of two mutually inhibiting and auto-activating TF clusters, can form hierarchical lineage bifurcations with improved safeguarding of critical early embryogenesis against biological perturbations. Furthermore, our computationally fast framework for modeling and visualizing the epigenetic landscape can be used to obtain insights from experimental data of development at the single cell resolution.


Assuntos
Desenvolvimento Embrionário , Modelos Biológicos , Fatores de Transcrição/metabolismo , Animais , Blastocisto/citologia , Blastocisto/metabolismo , Perfilação da Expressão Gênica , Camundongos , Análise de Célula Única
15.
J Comput Biol ; 22(6): 463-73, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25756654

RESUMO

We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences. The set of learned energy parameters is characterized as the convex cone generated by the normal vectors to those facets of the resulting polytope that are incident to the origin. In practice, the sufficient condition may not be satisfied by the entire training data set; hence, computing a maximal subset of training data for which the sufficient condition is satisfied is often desired. We show that the problem is NP-hard in general for an arbitrary dimensional feature space. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate A-U, C-G, G-U base pair counting model. The set of learned energy parameters includes experimentally measured energies of A-U, C-G, and G-U pairs; hence, our parameter set is in agreement with the Turner parameters.


Assuntos
RNA/química , Algoritmos , Pareamento de Bases , Bases de Dados de Ácidos Nucleicos , Modelos Moleculares , Conformação de Ácido Nucleico , Termodinâmica
16.
BMC Bioinformatics ; 15 Suppl 9: S12, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25252881

RESUMO

MOTIVATION: Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $10(6) prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment. CONTRIBUTION: We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA's superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine. AVAILABILITY: ARYANA with complete source code can be obtained from http://github.com/aryana-aligner.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/economia , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Alinhamento de Sequência/economia , Análise de Sequência de DNA/economia
17.
BMC Genomics ; 15 Suppl 10: S9, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25558875

RESUMO

MOTIVATION: Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of k prior to building the de Bruijn graph. We show that optimal values of k can be predicted prior to assembly by using the information contained in a phylogenetically-close genome and therefore, help make the use of multiple values of k practical for genome assembly. RESULTS: We present HyDA-Vista, which is a genome assembler that uses homology information to choose a value of k for each read prior to the de Bruijn graph construction. The chosen k is optimal if there are no sequencing errors and the coverage is sufficient. Fundamental to our method is the construction of the maximal sequence landscape, which is a data structure that stores for each position in the input string, the largest repeated substring containing that position. In particular, we show the maximal sequence landscape can be constructed in O(n+n log n)-time and O(n)-space. HyDA-Vista first constructs the maximal sequence landscape for a homologous genome. The reads are then aligned to this reference genome, and values of k are assigned to each read using the maximal sequence landscape and the alignments. Eventually, all the reads are assembled by an iterative de Bruijn graph construction method. Our results and comparison to other assemblers demonstrate that HyDA-Vista achieves the best assembly of E. coli before repeat resolution or scaffolding. AVAILABILITY: HyDA-Vista is freely available 1. The code for constructing the maximal sequence landscape and choosing the optimal value of k for each read is also separately available on the website and could be incorporated into any genome assembler.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Simulação por Computador , Escherichia coli/genética , Genoma , Humanos , Homologia de Sequência do Ácido Nucleico
18.
ISME J ; 8(4): 757-67, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24152715

RESUMO

Microbial interactions have a key role in global geochemical cycles. Although we possess significant knowledge about the general biochemical processes occurring in microbial communities, we are often unable to decipher key functions of individual microorganisms within the environment in part owing to the inability to cultivate or study them in isolation. Here, we circumvent this shortcoming through the use of single-cell genome sequencing and a novel low-input metatranscriptomics protocol to reveal the intricate metabolic capabilities and microbial interactions of an alkane-degrading methanogenic community. This methanogenic consortium oxidizes saturated hydrocarbons under anoxic conditions through a thus-far-uncharacterized biochemical process. The genome sequence of a dominant bacterial member of this community, belonging to the genus Smithella, was sequenced and served as the basis for subsequent analysis through metabolic reconstruction. Metatranscriptomic data generated from less than 500 pg of mRNA highlighted metabolically active genes during anaerobic alkane oxidation in comparison with growth on fatty acids. These data sets suggest that Smithella is not activating hexadecane by fumarate addition. Differential expression assisted in the identification of hypothetical proteins with no known homology that may be involved in hexadecane activation. Additionally, the combination of 16S rDNA sequence and metatranscriptomic data enabled the study of other prevalent organisms within the consortium and their interactions with Smithella, thus yielding a comprehensive characterization of individual constituents at the genome scale during methanogenic alkane oxidation.


Assuntos
Alcanos/metabolismo , Deltaproteobacteria/genética , Deltaproteobacteria/metabolismo , Ecossistema , Genoma , Transcriptoma , Anaerobiose , Euryarchaeota/genética , Euryarchaeota/metabolismo , Ácidos Graxos/metabolismo , Genes Bacterianos/genética , Dados de Sequência Molecular , RNA Ribossômico 16S/genética , Análise de Célula Única
19.
Bioinformatics ; 29(19): 2395-401, 2013 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-23918251

RESUMO

MOTIVATION: Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single-cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier, as the number of different cell types with distinct genome sequences is usually much smaller than the number of cells. RESULTS: Here, we present a novel divide and conquer method to sequence and de novo assemble all distinct genomes present in a microbial sample with a sequencing cost and computational complexity proportional to the number of genome types, rather than the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide and conquer method successfully reduces the cost of sequencing in comparison with the naïve exhaustive approach. AVAILABILITY: Squeezambler and datasets are available at http://compbio.cs.wayne.edu/software/squeezambler/.


Assuntos
Genoma Microbiano , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Humanos , Intestinos/microbiologia , Homologia de Sequência do Ácido Nucleico
20.
J Comput Biol ; 20(7): 486-94, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23829650

RESUMO

It has been shown that minimum free-energy structure for RNAs and RNA-RNA interaction is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the energy model. In contrast, ensemble-based quantities such as melting temperature and equilibrium concentrations can be more reliably predicted. Even structure prediction by sampling from the ensemble and clustering those structures by Sfold has proven to be more reliable than minimum free energy structure prediction. The main obstacle for ensemble-based approaches is the computational complexity of the partition function and base-pairing probabilities. For instance, the space complexity of the partition function for RNA-RNA interaction is O(n4) and the time complexity is O(n6), which is prohibitively large. Our goal in this article is to present a fast algorithm, based on sparse folding, to calculate an upper bound on the partition function. Our work is based on the recent algorithm of Hazan and Jaakkola (2012). The space complexity of our algorithm is the same as that of sparse folding algorithms, and the time complexity of our algorithm is O(MFE(n)ℓ) for single RNA and O(MFE(m, n)ℓ) for RNA-RNA interaction in practice, in which MFE is the running time of sparse folding and ℓ≤n (ℓ≤n+m) is a sequence-dependent parameter.


Assuntos
Algoritmos , RNA/química , RNA/metabolismo , Biologia Computacional , Humanos , RNA/genética , Termodinâmica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA