Pesquisa | Portal de Pesquisa da BVS

1.

Natural mutagenesis of human genomes by endogenous retrotransposons.

Iskow, Rebecca C; McCabe, Michael T; Mills, Ryan E; Torene, Spencer; Pittard, W Stephen; Neuwald, Andrew F; Van Meir, Erwin G; Vertino, Paula M; Devine, Scott E.

Cell ; 141(7): 1253-61, 2010 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-20603005

RESUMO

Two abundant classes of mobile elements, namely Alu and L1 elements, continue to generate new retrotransposon insertions in human genomes. Estimates suggest that these elements have generated millions of new germline insertions in individual human genomes worldwide. Unfortunately, current technologies are not capable of detecting most of these young insertions, and the true extent of germline mutagenesis by endogenous human retrotransposons has been difficult to examine. Here, we describe technologies for detecting these young retrotransposon insertions and demonstrate that such insertions indeed are abundant in human populations. We also found that new somatic L1 insertions occur at high frequencies in human lung cancer genomes. Genome-wide analysis suggests that altered DNA methylation may be responsible for the high levels of L1 mobilization observed in these tumors. Our data indicate that transposon-mediated mutagenesis is extensive in human genomes and is likely to have a major impact on human biology and diseases.

Assuntos

Elementos Alu , Genoma Humano , Elementos Nucleotídeos Longos e Dispersos , Mutagênese , Análise de Sequência de DNA/métodos , Neoplasias Encefálicas/genética , Humanos , Neoplasias Pulmonares/genética , Metilação

2.

eCOMPASS: evaluative comparison of multiple protein alignments by statistical score.

Neuwald, Andrew F; Kolaczkowski, Bryan D; Altschul, Stephen F.

Bioinformatics ; 37(20): 3456-3463, 2021 Oct 25.

Artigo em Inglês | MEDLINE | ID: mdl-33983436

RESUMO

MOTIVATION: Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. RESULTS: eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. AVAILABILITY AND IMPLEMENTATION: The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.

IntAPT: integrated assembly of phenotype-specific transcripts from multiple RNA-seq profiles.

Shi, Xu; Neuwald, Andrew F; Wang, Xiao; Wang, Tian-Li; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua.

Bioinformatics ; 37(5): 650-658, 2021 05 05.

Artigo em Inglês | MEDLINE | ID: mdl-33016988

RESUMO

MOTIVATION: High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. RESULTS: We present a Bayesian method, integrated assembly of phenotype-specific transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real datasets show that IntAPT consistently outperforms existing methods for the IntAPT. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. AVAILABILITY AND IMPLEMENTATION: The IntAPT package is available at http://github.com/henryxushi/IntAPT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Perfilação da Expressão Gênica , Transcriptoma , Teorema de Bayes , Fenótipo , RNA-Seq , Análise de Sequência de RNA , Software

4.

ChIP-GSM: Inferring active transcription factor modules to predict functional regulatory elements.

Chen, Xi; Neuwald, Andrew F; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua.

PLoS Comput Biol ; 17(7): e1009203, 2021 07.

Artigo em Inglês | MEDLINE | ID: mdl-34292930

RESUMO

Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIP-seq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIP-GSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.

Assuntos

Sequenciamento de Cromatina por Imunoprecipitação/métodos , Redes Reguladoras de Genes , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Teorema de Bayes , Sítios de Ligação/genética , Cromatina/genética , Cromatina/metabolismo , Sequenciamento de Cromatina por Imunoprecipitação/estatística & dados numéricos , Biologia Computacional , Elementos Facilitadores Genéticos , Epigênese Genética , Regulação da Expressão Gênica , Humanos , Células K562 , Células MCF-7 , Modelos Estatísticos , Regiões Promotoras Genéticas

5.

Identifying Function Determining Residues in Neuroimmune Semaphorin 4A.

Chapoval, Svetlana P; Lee, Mariah; Lemmer, Aaron; Ajayi, Oluwaseyi; Qi, Xiulan; Neuwald, Andrew F; Keegan, Achsah D.

Int J Mol Sci ; 23(6)2022 Mar 11.

Artigo em Inglês | MEDLINE | ID: mdl-35328445

RESUMO

Semaphorin 4A (Sema4A) exerts a stabilizing effect on human Treg cells in PBMC and CD4+ T cell cultures by engaging Plexin B1. Sema4A deficient mice display enhanced allergic airway inflammation accompanied by fewer Treg cells, while Sema4D deficient mice displayed reduced inflammation and increased Treg cell numbers even though both Sema4 subfamily members engage Plexin B1. The main objectives of this study were: 1. To compare the in vitro effects of Sema4A and Sema4D proteins on human Treg cells; and 2. To identify function-determining residues in Sema4A critical for binding to Plexin B1 based on Sema4D homology modeling. We report here that Sema4A and Sema4D display opposite effects on human Treg cells in in vitro PBMC cultures; Sema4D inhibited the CD4+CD25+Foxp3+ cell numbers and CD25/Foxp3 expression. Sema4A and Sema4D competitively bind to Plexin B1 in vitro and hence may be doing so in vivo as well. Bayesian Partitioning with Pattern Selection (BPPS) partitioned 4505 Sema domains from diverse organisms into subgroups based on distinguishing sequence patterns that are likely responsible for functional differences. BPPS groups Sema3 and Sema4 into one family and further separates Sema4A and Sema4D into distinct subfamilies. Residues distinctive of the Sema3,4 family and of Sema4A (and by homology of Sema4D) tend to cluster around the Plexin B1 binding site. This suggests that the residues both common to and distinctive of Sema4A and Sema4D may mediate binding to Plexin B1, with subfamily residues mediating functional specificity. We mutated the Sema4A-specific residues M198 and F223 to alanine; notably, F223 in Sema4A corresponds to alanine in Sema4D. Mutant proteins were assayed for Plexin B1-binding and Treg stimulation activities. The F223A mutant was unable to stimulate Treg stability in in vitro PBMC cultures despite binding Plexin B1 with an affinity similar to the WT protein. This research is a first step in generating potent mutant Sema4A molecules with stimulatory function for Treg cells with a view to designing immunotherapeutics for asthma.

Assuntos

Leucócitos Mononucleares , Semaforinas/metabolismo , Alanina , Animais , Teorema de Bayes , Fatores de Transcrição Forkhead/genética , Humanos , Inflamação , Leucócitos Mononucleares/metabolismo , Camundongos , Proteínas do Tecido Nervoso/metabolismo

6.

ChIP-BIT2: a software tool to detect weak binding events using a Bayesian integration approach.

Chen, Xi; Shi, Xu; Neuwald, Andrew F; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua.

BMC Bioinformatics ; 22(1): 193, 2021 Apr 15.

Artigo em Inglês | MEDLINE | ID: mdl-33858322

RESUMO

BACKGROUND: ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq 'peak' observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs. RESULTS: ChIP-BIT2 ( http://sourceforge.net/projects/chipbitc/ ) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability. CONCLUSION: ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Teorema de Bayes , Sítios de Ligação , Imunoprecipitação da Cromatina , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de DNA

7.

A survey of TIR domain sequence and structure divergence.

Toshchakov, Vladimir Y; Neuwald, Andrew F.

Immunogenetics ; 72(3): 181-203, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32002590

RESUMO

Toll-interleukin-1R resistance (TIR) domains are ubiquitously present in all forms of cellular life. They are most commonly found in signaling proteins, as units responsible for signal-dependent formation of protein complexes that enable amplification and spatial propagation of the signal. A less common function of TIR domains is their ability to catalyze nicotinamide adenine dinucleotide degradation. This survey analyzes 26,414 TIR domains, automatically classified based on group-specific sequence patterns presumably determining biological function, using a statistical approach termed Bayesian partitioning with pattern selection (BPPS). We examine these groups and patterns in the light of available structures and biochemical analyses. Proteins within each of thirteen eukaryotic groups (10 metazoans and 3 plants) typically appear to perform similar functions, whereas proteins within each prokaryotic group typically exhibit diverse domain architectures, suggesting divergent functions. Groups are often uniquely characterized by structural fold variations associated with group-specific sequence patterns and by herein identified sequence motifs defining TIR domain functional divergence. For example, BPPS identifies, in helices C and D of TIRAP and MyD88 orthologs, conserved surface-exposed residues apparently responsible for specificity of TIR domain interactions. In addition, BPPS clarifies the functional significance of the previously described Box 2 and Box 3 motifs, each of which is a part of a larger, group-specific block of conserved, intramolecularly interacting residues.

Assuntos

Proteínas Adaptadoras de Transdução de Sinal/genética , Domínios Proteicos/genética , Domínios Proteicos/fisiologia , Proteínas Adaptadoras de Transdução de Sinal/metabolismo , Sequência de Aminoácidos , Animais , Teorema de Bayes , Bases de Dados Genéticas , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Humanos , Interleucinas , Modelos Moleculares , Fator 88 de Diferenciação Mieloide/genética , Fator 88 de Diferenciação Mieloide/metabolismo , Estrutura Secundária de Proteína , Receptores de Interleucina-1/genética , Receptores de Interleucina-1/metabolismo , Transdução de Sinais/genética , Transdução de Sinais/fisiologia , Receptores Toll-Like/genética , Receptores Toll-Like/metabolismo

8.

Statistical investigations of protein residue direct couplings.

Neuwald, Andrew F; Altschul, Stephen F.

PLoS Comput Biol ; 14(12): e1006237, 2018 12.

Artigo em Inglês | MEDLINE | ID: mdl-30596639

RESUMO

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.

Assuntos

Sítios de Ligação/fisiologia , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas/fisiologia , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos

9.

Correction to: A survey of TIR domain sequence and structure divergence.

Toshchakov, Vladimir Y; Neuwald, Andrew F.

Immunogenetics ; 74(2): 269, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-34977974

10.

Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations.

Neuwald, Andrew F; Altschul, Stephen F.

PLoS Comput Biol ; 12(12): e1005294, 2016 12.

Artigo em Inglês | MEDLINE | ID: mdl-28002465

RESUMO

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).

Assuntos

Acetiltransferases/química , Modelos Moleculares , Análise de Sequência de Proteína/métodos , Acetiltransferases/genética , Acetiltransferases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional , Humanos , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/métodos

11.

Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

Neuwald, Andrew F; Altschul, Stephen F.

PLoS Comput Biol ; 12(5): e1004936, 2016 05.

Artigo em Inglês | MEDLINE | ID: mdl-27192614

RESUMO

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.

Assuntos

Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Teorema de Bayes , Biologia Computacional , Bases de Dados de Proteínas , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/normas , Software

12.

Identification and classification of small molecule kinases: insights into substrate recognition and specificity.

Oruganty, Krishnadev; Talevich, Eric E; Neuwald, Andrew F; Kannan, Natarajan.

BMC Evol Biol ; 16: 7, 2016 Jan 06.

Artigo em Inglês | MEDLINE | ID: mdl-26738562

RESUMO

BACKGROUND: Many prokaryotic kinases that phosphorylate small molecule substrates, such as antibiotics, lipids and sugars, are evolutionarily related to Eukaryotic Protein Kinases (EPKs). These Eukaryotic-Like Kinases (ELKs) share the same overall structural fold as EPKs, but differ in their modes of regulation, substrate recognition and specificity-the sequence and structural determinants of which are poorly understood. RESULTS: To better understand the basis for ELK specificity, we applied a Bayesian classification procedure designed to identify sequence determinants responsible for functional divergence. This reveals that a large and diverse family of aminoglycoside kinases, characterized members of which are involved in antibiotic resistance, fall into major sub-groups based on differences in putative substrate recognition motifs. Aminoglycoside kinase substrate specificity follows simple rules of alternating hydroxyl and amino groups that is strongly correlated with variations at the DFG + 1 position. CONCLUSIONS: Substrate specificity determining features in small molecule kinases are mostly confined to the catalytic core and can be identified based on quantitative sequence and crystal structure comparisons.

Assuntos

Proteínas Quinases/classificação , Sequência de Aminoácidos , Teorema de Bayes , Proteínas Quinases/química , Proteínas Quinases/metabolismo , Estrutura Terciária de Proteína , Especificidade por Substrato

13.

Protein domain hierarchy Gibbs sampling strategies.

Neuwald, Andrew F.

Stat Appl Genet Mol Biol ; 13(4): 497-517, 2014 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-24988248

RESUMO

Hierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed statistical evolutionary dynamics analysis to accomplish this task automatically while, at the same time, identifying sequence determinants of protein function. Here I describe the statistical model and sampling strategies underlying this sampler. When implemented and applied to simulated protein sequences (which conform to the underlying statistical model precisely), these sampling strategies efficiently converge on the hierarchy used to generate the sequences. However, for real protein sequences the sampler finds alternative, nearly-optimal hierarchies for many domains, indicating a significant degree of ambiguity. I illustrate how both the nature of such ambiguities and the most robust ("consensus") features of a hierarchy may be determined from an ensemble of independently generated hierarchies for the same domain. Such consensus hierarchies can provide reliably stable models of protein domain functional divergence.

Assuntos

Sequência de Aminoácidos , Modelos Genéticos , Proteínas/genética , Algoritmos , Sequência Conservada , Bases de Dados de Proteínas , Modelos Estatísticos , Estrutura Terciária de Proteína , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos

14.

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Neuwald, Andrew F; Lanczycki, Christopher J; Marchler-Bauer, Aron.

BMC Bioinformatics ; 13: 144, 2012 Jun 22.

Artigo em Inglês | MEDLINE | ID: mdl-22726767

RESUMO

BACKGROUND: The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. RESULTS: Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. CONCLUSIONS: This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.

Assuntos

Sequência Conservada , Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Método de Monte Carlo , Filogenia , Dobramento de Proteína , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência

15.

Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms.

Neuwald, Andrew F.

Stat Appl Genet Mol Biol ; 10: Article 36, 2011.

Artigo em Inglês | MEDLINE | ID: mdl-22331370

RESUMO

Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.

Assuntos

Algoritmos , Proteínas/química , Bases de Dados de Proteínas , Proteínas/classificação , Análise de Sequência de Proteína

16.

The CHAIN program: forging evolutionary links to underlying mechanisms.

Neuwald, Andrew F.

Trends Biochem Sci ; 32(11): 487-93, 2007 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-17962021

RESUMO

Proteins evolve new functions by modifying and extending the molecular machinery of an ancestral protein. Such changes show up as divergent sequence patterns, which are conserved in descendent proteins that maintain the divergent function. After multiply-aligning a set of input sequences, the CHAIN program partitions the sequences into two functionally divergent groups and then outputs an alignment that is annotated to reveal the selective pressures imposed on divergent residue positions. If atomic coordinates are also provided, hydrogen bonds and other atomic interactions associated with various categories of divergent residues are graphically displayed. Such analyses establish links between protein evolutionary divergence and functionally crucial atomic features and, as a result, can suggest plausible molecular mechanisms for experimental testing. This is illustrated here by its application to bacterial clamp-loader ATPases.

Assuntos

Evolução Biológica , Proteínas/genética , Adenosina Trifosfatases/química , Sequência de Aminoácidos , Ligação de Hidrogênio , Dados de Sequência Molecular , Proteínas/química , Homologia de Sequência de Aminoácidos

17.

SPARC: Structural properties associated with residue constraints.

Neuwald, Andrew F; Yang, Hui; Tracy Nixon, B.

Comput Struct Biotechnol J ; 20: 1702-1715, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35495120

RESUMO

SPARC facilitates the generation of plausible hypotheses regarding underlying biochemical mechanisms by structurally characterizing protein sequence constraints. Such constraints appear as residues co-conserved in functionally related subgroups, as subtle pairwise correlations (i.e., direct couplings), and as correlations among these sequence features or with structural features. SPARC performs three types of analyses. First, based on pairwise sequence correlations, it estimates the biological relevance of alternative conformations and of homomeric contacts, as illustrated here for death domains. Second, it estimates the statistical significance of the correspondence between directly coupled residue pairs and interactions at heterodimeric interfaces. Third, given molecular dynamics simulated structures, it characterizes interactions among constrained residues or between such residues and ligands that: (a) are stably maintained during the simulation; (b) undergo correlated formation and/or disruption of interactions with other constrained residues; or (c) switch between alternative interactions. We illustrate this for two homohexameric complexes: the bacterial enhancer binding protein (bEBP) NtrC1, which activates transcription by remodeling RNA polymerase (RNAP) containing σ54, and for DnaB helicase, which opens DNA at the bacterial replication fork. Based on the NtrC1 analysis, we hypothesize possible mechanisms for inhibiting ATP hydrolysis until ADP is released from an adjacent subunit and for coupling ATP hydrolysis to restructuring of σ54 binding loops. Based on the DnaB analysis, we hypothesize that DnaB 'grabs' ssDNA by flipping every fourth base and inserting it into cavities between subunits and that flipping of a DnaB-specific glutamine residue triggers ATP hydrolysis.

18.

Bayesian shadows of molecular mechanisms cast in the light of evolution.

Neuwald, Andrew F.

Trends Biochem Sci ; 31(7): 374-82, 2006 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-16766187

RESUMO

A great many carefully designed experiments will be required to fully understand biological mechanisms in atomic detail. A complementary approach is to use powerful statistical procedures to rapidly test numerous scientific hypotheses using vast numbers of protein sequences--the cell's own blueprints for specifying biological mechanisms. Bayesian inference of the evolutionary constraints imposed on functionally divergent proteins can reveal key components of the molecular machinery and thereby suggest likely mechanisms to test experimentally. This approach is demonstrated by considering how DNA polymerase clamp-loader AAA+ ATPases couple DNA recognition to ATP hydrolysis and clamp loading.

Assuntos

Adenosina Trifosfatases/metabolismo , Teorema de Bayes , DNA Polimerase III/fisiologia , Evolução Molecular , Trifosfato de Adenosina/metabolismo , Sequência de Aminoácidos , Modelos Moleculares , Dados de Sequência Molecular , Proteína de Replicação C/fisiologia , Alinhamento de Sequência

19.

Identifying intracellular signaling modules and exploring pathways associated with breast cancer recurrence.

Chen, Xi; Gu, Jinghua; Neuwald, Andrew F; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua.

Sci Rep ; 11(1): 385, 2021 01 11.

Artigo em Inglês | MEDLINE | ID: mdl-33432018

RESUMO

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/ .

Assuntos

Neoplasias da Mama/patologia , Biologia Computacional , Recidiva Local de Neoplasia/genética , Algoritmos , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Resistencia a Medicamentos Antineoplásicos/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes/fisiologia , Genes BRCA1 , Humanos , Metástase Neoplásica , Recidiva Local de Neoplasia/metabolismo , Receptor ErbB-2/genética , Receptor ErbB-2/metabolismo , Receptores de Estrogênio/genética , Receptores de Estrogênio/metabolismo , Transdução de Sinais/genética , Tamoxifeno/farmacologia , Tamoxifeno/uso terapêutico

20.

A Bayesian approach for accurate de novo transcriptome assembly.

Shi, Xu; Wang, Xiao; Neuwald, Andrew F; Halakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua.

Sci Rep ; 11(1): 17663, 2021 09 03.

Artigo em Inglês | MEDLINE | ID: mdl-34480063

RESUMO

De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.

Assuntos

Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Transcriptoma , Teorema de Bayes , Simulação por Computador , Humanos , Análise de Sequência de RNA

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA