Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 173(6): 1356-1369.e22, 2018 05 31.
Artigo em Inglês | MEDLINE | ID: mdl-29856954

RESUMO

Genetic changes causing brain size expansion in human evolution have remained elusive. Notch signaling is essential for radial glia stem cell proliferation and is a determinant of neuronal number in the mammalian cortex. We find that three paralogs of human-specific NOTCH2NL are highly expressed in radial glia. Functional analysis reveals that different alleles of NOTCH2NL have varying potencies to enhance Notch signaling by interacting directly with NOTCH receptors. Consistent with a role in Notch signaling, NOTCH2NL ectopic expression delays differentiation of neuronal progenitors, while deletion accelerates differentiation into cortical neurons. Furthermore, NOTCH2NL genes provide the breakpoints in 1q21.1 distal deletion/duplication syndrome, where duplications are associated with macrocephaly and autism and deletions with microcephaly and schizophrenia. Thus, the emergence of human-specific NOTCH2NL genes may have contributed to the rapid evolution of the larger human neocortex, accompanied by loss of genomic stability at the 1q21.1 locus and resulting recurrent neurodevelopmental disorders.


Assuntos
Encéfalo/embriologia , Córtex Cerebral/fisiologia , Neurogênese/fisiologia , Receptor Notch2/metabolismo , Transdução de Sinais , Animais , Diferenciação Celular , Células-Tronco Embrionárias/metabolismo , Feminino , Deleção de Genes , Genes Reporter , Gorilla gorilla , Células HEK293 , Humanos , Neocórtex/citologia , Células-Tronco Neurais/metabolismo , Neuroglia/metabolismo , Neurônios/metabolismo , Pan troglodytes , Receptor Notch2/genética , Análise de Sequência de RNA
2.
Nature ; 617(7960): 312-324, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37165242

RESUMO

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Assuntos
Genoma Humano , Genômica , Humanos , Diploide , Genoma Humano/genética , Haplótipos/genética , Análise de Sequência de DNA , Genômica/normas , Padrões de Referência , Estudos de Coortes , Alelos , Variação Genética
3.
Nature ; 604(7906): 437-446, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35444317

RESUMO

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.


Assuntos
Genoma Humano , Genômica , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
4.
Nat Methods ; 20(2): 239-247, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36646895

RESUMO

Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.


Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Haplótipos , Metagenômica , Transcriptoma
5.
Nature ; 587(7833): 246-251, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33177663

RESUMO

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Assuntos
Genoma/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Vertebrados/genética , Âmnio , Animais , Simulação por Computador , Genômica/normas , Haplótipos , Humanos , Controle de Qualidade , Alinhamento de Sequência/normas , Software/normas
6.
Int J Mol Sci ; 25(15)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39125675

RESUMO

Membrane-type metalloproteinases (including MMP-14 and MMP-15) are enzymes involved in the degradation of extracellular matrix components. In cancer, they are involved in processes such as cellular invasion, angiogenesis and metastasis. Therefore, the aim of this study was to evaluate the expression, content and activity of MMP-14 and MMP-15 in human renal cell carcinoma. Samples of healthy kidney tissue (n = 20) and tissue from clear-cell kidney cancer (n = 20) were examined. The presence and contents of the MMPs were assessed using Western blot and ELISA techniques, respectively. Their activity-both actual and specific-was evaluated using fluorimetric analysis. Both control and cancer human kidney tissues contain MMP-14 and MMP-15 enzymes in the form of high-molecular-weight complexes. Moreover, these enzymes occur in both active and latent forms. Their content in cancer tissues is very similar, but with a noteworthy decrease in content with an increase in the kidney cancer grade for both membrane-type metalloproteinases. Even more notable is the highest content of the investigated enzymes represented by MMP-14 in the control tissues. Considering the actual and specific activity outcomes, MMP-14 dominates over MMP-15 in all of the investigated tissues. Nevertheless, we also noted a significant enhancement of the activity of both metalloproteinases with an increase in the grade of renal cancer. The expression and activity of both enzymes were detected in all examined renal cancer tissues. However, our findings suggest that transmembrane metalloproteinase 14 (MMP-14) plays a much more significant and essential role than MMP-15 in the studied renal carcinoma tissues. Therefore, it seems that MMP-14 could be a promising target in the diagnosis, prognosis and therapy of renal cell carcinoma.


Assuntos
Carcinoma de Células Renais , Neoplasias Renais , Metaloproteinase 14 da Matriz , Metaloproteinase 15 da Matriz , Humanos , Metaloproteinase 14 da Matriz/metabolismo , Neoplasias Renais/patologia , Neoplasias Renais/metabolismo , Neoplasias Renais/enzimologia , Carcinoma de Células Renais/metabolismo , Carcinoma de Células Renais/patologia , Carcinoma de Células Renais/enzimologia , Metaloproteinase 15 da Matriz/metabolismo , Metaloproteinase 15 da Matriz/genética , Feminino , Masculino , Pessoa de Meia-Idade , Idoso , Adulto
7.
Annu Rev Genomics Hum Genet ; 21: 139-162, 2020 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-32453966

RESUMO

Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.


Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
8.
Bioinformatics ; 36(21): 5139-5144, 2021 01 29.
Artigo em Inglês | MEDLINE | ID: mdl-33040146

RESUMO

MOTIVATION: Pangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic datasets can be challenging to work at. These challenges have impeded progress in this field. RESULTS: Here, we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage are up to an order of magnitude better than the prior graph implementation in the VG toolkit, which has now transitioned to using libbdsg's implementations. AVAILABILITY AND IMPLEMENTATION: libhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.


Assuntos
Bibliotecas , Software , Genoma , Genômica
9.
Bioinformatics ; 36(Suppl_1): i146-i153, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657356

RESUMO

MOTIVATION: Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed alignments could belong to the same mapping. RESULTS: We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs. AVAILABILITY AND IMPLEMENTATION: Our algorithms have been implemented as part of the vg toolkit and are available at https://github.com/vgteam/vg.


Assuntos
Genoma , Software , Algoritmos , Análise por Conglomerados , Análise de Sequência de DNA
10.
Bioinformatics ; 36(2): 400-407, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31406990

RESUMO

MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Haplótipos , Algoritmos , Genoma , Análise de Sequência de DNA , Software
11.
Genome Res ; 27(5): 665-676, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28360232

RESUMO

The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.


Assuntos
Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Estudo de Associação Genômica Ampla/normas , Genômica/normas , Humanos , Polimorfismo Genético
12.
Bioinformatics ; 35(24): 5318-5320, 2019 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-31368484

RESUMO

MOTIVATION: Compared to traditional haploid reference genomes, graph genomes are an efficient and compact data structure for storing multiple genomic sequences, for storing polymorphisms or for mapping sequencing reads with greater sensitivity. Further, graphs are well-studied computer science objects that can be efficiently analyzed. However, their adoption in genomic research is slow, in part because of the cognitive difficulty in interpreting graphs. RESULTS: We present an intuitive graphical representation for graph genomes that re-uses well-honed techniques developed to display public transport networks, and demonstrate it as a web tool. AVAILABILITY AND IMPLEMENTATION: Code: https://github.com/vgteam/sequenceTubeMap. DEMONSTRATION: https://vgteam.github.io/sequenceTubeMap/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Software , Genômica , Análise de Sequência de DNA
13.
Bioinformatics ; 34(13): i105-i114, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949989

RESUMO

Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation: https://github.com/whatshap/whatshap. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Diploide , Genoma Fúngico , Análise de Sequência de DNA/métodos , Visualização de Dados , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Leveduras/genética
14.
Bioinformatics ; 31(22): 3569-76, 2015 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-26220960

RESUMO

MOTIVATION: Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. RESULTS: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high-performance context schemes, and present efficient context scheme mapping algorithms. AVAILABILITY AND IMPLEMENTATION: The software test framework created for this study is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. CONTACT: anovak@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Software , Algoritmos , Simulação por Computador , Loci Gênicos , Humanos , Complexo Principal de Histocompatibilidade/genética , Alinhamento de Sequência
15.
BMC Bioinformatics ; 16: 108, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25888064

RESUMO

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .


Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Modelos Estatísticos , Alinhamento de Sequência/métodos , Software , Simulação por Computador , Humanos , Incerteza
16.
Mol Biol Evol ; 31(9): 2251-66, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-24899668

RESUMO

For sequences that are highly divergent, there is often insufficient information to infer accurate alignments, and phylogenetic uncertainty may be high. One way to address this issue is to make use of protein structural information, since structures generally diverge more slowly than sequences. In this work, we extend a recently developed stochastic model of pairwise structural evolution to multiple structures on a tree, analytically integrating over ancestral structures to permit efficient likelihood computations under the resulting joint sequence-structure model. We observe that the inclusion of structural information significantly reduces alignment and topology uncertainty, and reduces the number of topology and alignment errors in cases where the true trees and alignments are known. In some cases, the inclusion of structure results in changes to the consensus topology, indicating that structure may contain additional information beyond that which can be obtained from sequences. We use the model to investigate the order of divergence of cytoglobins, myoglobins, and hemoglobins and observe a stabilization of phylogenetic inference: although a sequence-based inference assigns significant posterior probability to several different topologies, the structural model strongly favors one of these over the others and is more robust to the choice of data set.


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Globinas/química , Hemoglobinas/química , Mioglobina/química , Animais , Citoglobina , Globinas/genética , Hemoglobinas/genética , Humanos , Cadeias de Markov , Modelos Moleculares , Mutação , Mioglobina/genética , Filogenia , Conformação Proteica , Alinhamento de Sequência , Análise de Sequência de Proteína
17.
Bioinformatics ; 29(5): 654-5, 2013 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-23335014

RESUMO

MOTIVATION: Comparative modeling of RNA is known to be important for making accurate secondary structure predictions. RNA structure prediction tools such as PPfold or RNAalifold use an aligned set of sequences in predictions. Obtaining a multiple alignment from a set of sequences is quite a challenging problem itself, and the quality of the alignment can affect the quality of a prediction. By implementing RNA secondary structure prediction in a statistical alignment framework, and predicting structures from multiple alignment samples instead of a single fixed alignment, it may be possible to improve predictions. RESULTS: We have extended the program StatAlign to make use of RNA-specific features, which include RNA secondary structure prediction from multiple alignments using either a thermodynamic approach (RNAalifold) or a Stochastic Context-Free Grammars (SCFGs) approach (PPfold). We also provide the user with scores relating to the quality of a secondary structure prediction, such as information entropy values for the combined space of secondary structures and sampled alignments, and a reliability score that predicts the expected number of correctly predicted base pairs. Finally, we have created RNA secondary structure visualization plugins and automated the process of setting up Markov Chain Monte Carlo runs for RNA alignments in StatAlign. AVAILABILITY AND IMPLEMENTATION: The software is available from http://statalign.github.com/statalign/.


Assuntos
RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA , Software , Algoritmos , Pareamento de Bases , Teorema de Bayes , Cadeias de Markov , Conformação de Ácido Nucleico , Termodinâmica
18.
Nat Biotechnol ; 42(4): 663-673, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37165083

RESUMO

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.


Assuntos
Drosophila melanogaster , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Animais , Drosophila melanogaster/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alelos , Análise de Sequência de DNA , Genoma Humano/genética
19.
BMC Bioinformatics ; 14: 149, 2013 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-23634662

RESUMO

BACKGROUND: With the advancement of next-generation sequencing and transcriptomics technologies, regulatory effects involving RNA, in particular RNA structural changes are being detected. These results often rely on RNA secondary structure predictions. However, current approaches to RNA secondary structure modelling produce predictions with a high variance in predictive accuracy, and we have little quantifiable knowledge about the reasons for these variances. RESULTS: In this paper we explore a number of factors which can contribute to poor RNA secondary structure prediction quality. We establish a quantified relationship between alignment quality and loss of accuracy. Furthermore, we define two new measures to quantify uncertainty in alignment-based structure predictions. One of the measures improves on the "reliability score" reported by PPfold, and considers alignment uncertainty as well as base-pair probabilities. The other measure considers the information entropy for SCFGs over a space of input alignments. CONCLUSIONS: Our predictive accuracy improves on the PPfold reliability score. We can successfully characterize many of the underlying reasons for and variances in poor prediction. However, there is still variability unaccounted for, which we therefore suggest comes from the RNA secondary structure predictive model itself.


Assuntos
RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA , Algoritmos , Pareamento de Bases , Evolução Molecular , Conformação de Ácido Nucleico , Probabilidade , Reprodutibilidade dos Testes , Alinhamento de Sequência/normas
20.
BMC Bioinformatics ; 14 Suppl 2: S22, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23368905

RESUMO

Comparative methods for RNA secondary structure prediction use evolutionary information from RNA alignments to increase prediction accuracy. The model is often described in terms of stochastic context-free grammars (SCFGs), which generate a probability distribution over secondary structures. It is, however, unclear how this probability distribution changes as a function of the input alignment. As prediction programs typically only return a single secondary structure, better characterisation of the underlying probability space of RNA secondary structures is of great interest. In this work, we show how to efficiently compute the information entropy of the probability distribution over RNA secondary structures produced for RNA alignments by a phylo-SCFG, and implement it for the PPfold model. We also discuss interpretations and applications of this quantity, including how it can clarify reasons for low prediction reliability scores. PPfold and its source code are available from http://birc.au.dk/software/ppfold/.


Assuntos
Algoritmos , Modelos Teóricos , Conformação de Ácido Nucleico , RNA/química , Sequência de Bases , Biologia Computacional/métodos , Entropia , Probabilidade , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA