Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 97
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas
2.
Microbiol Spectr ; 12(4): e0358423, 2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38436242

RESUMO

We conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.


Assuntos
COVID-19 , Cervos , Animais , Humanos , SARS-CoV-2/genética , Filogenia , Vison
3.
J Mol Evol ; 92(2): 153-168, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38485789

RESUMO

Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.


Assuntos
Evolução Molecular , Proteínas , Humanos , Teorema de Bayes , Proteínas/genética , Proteínas/química , Sequência de Aminoácidos , Domínios Proteicos
4.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38212995

RESUMO

MOTIVATION: Proteins accomplish cellular functions by interacting with each other, which makes the prediction of interaction sites a fundamental problem. As experimental methods are expensive and time consuming, computational prediction of the interaction sites has been studied extensively. Structure-based programs are the most accurate, while the sequence-based ones are much more widely applicable, as the sequences available outnumber the structures by two orders of magnitude. Ideally, we would like a tool that has the quality of the former and the applicability of the latter. RESULTS: We provide here the first solution that achieves these two goals. Our new sequence-based program, Seq-InSite, greatly surpasses the performance of sequence-based models, matching the quality of state-of-the-art structure-based predictors, thus effectively superseding the need for models requiring structure. The predictive power of Seq-InSite is illustrated using an analysis of evolutionary conservation for four protein sequences. AVAILABILITY AND IMPLEMENTATION: Seq-InSite is freely available as a web server at http://seq-insite.csd.uwo.ca/ and as free source code, including trained models and all datasets used for training and testing, at https://github.com/lucian-ilie/Seq-InSite.


Assuntos
Proteínas , Software , Proteínas/química , Sequência de Aminoácidos
5.
PLoS Pathog ; 19(7): e1011538, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37523413

RESUMO

Brucellosis is a disease caused by the bacterium Brucella and typically transmitted through contact with infected ruminants. It is one of the most common chronic zoonotic diseases and of particular interest to public health agencies. Despite its well-known transmission history and characteristic symptoms, we lack a more complete understanding of the evolutionary history of its best-known species-Brucella melitensis. To address this knowledge gap we fortuitously found, sequenced and assembled a high-quality ancient B. melitensis draft genome from the kidney stone of a 14th-century Italian friar. The ancient strain contained fewer core genes than modern B. melitensis isolates, carried a complete complement of virulence genes, and did not contain any indication of significant antimicrobial resistances. The ancient B. melitensis genome fell as a basal sister lineage to a subgroup of B. melitensis strains within the Western Mediterranean phylogenetic group, with a short branch length indicative of its earlier sampling time, along with a similar gene content. By calibrating the molecular clock we suggest that the speciation event between B. melitensis and B. abortus is contemporaneous with the estimated time frame for the domestication of both sheep and goats. These results confirm the existence of the Western Mediterranean clade as a separate group in the 14th CE and suggest that its divergence was due to human and ruminant co-migration.


Assuntos
Brucella melitensis , Brucelose , Humanos , Animais , Ovinos , Brucella melitensis/genética , Brucella abortus/genética , Filogenia , Brucelose/microbiologia , Zoonoses , Cabras
6.
bioRxiv ; 2023 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-37066254

RESUMO

Barton et al.1 raise several statistical concerns regarding our original analyses2 that highlight the challenge of inferring natural selection using ancient genomic data. We show here that these concerns have limited impact on our original conclusions. Specifically, we recover the same signature of enrichment for high FST values at the immune loci relative to putatively neutral sites after switching the allele frequency estimation method to a maximum likelihood approach, filtering to only consider known human variants, and down-sampling our data to the same mean coverage across sites. Furthermore, using permutations, we show that the rs2549794 variant near ERAP2 continues to emerge as the strongest candidate for selection (p = 1.2×10-5), falling below the Bonferroni-corrected significance threshold recommended by Barton et al. Importantly, the evidence for selection on ERAP2 is further supported by functional data demonstrating the impact of the ERAP2 genotype on the immune response to Y. pestis and by epidemiological data from an independent group showing that the putatively selected allele during the Black Death protects against severe respiratory infection in contemporary populations.

7.
Mol Biol Evol ; 40(4)2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-37036379

RESUMO

Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.


Assuntos
Drosophila melanogaster , Proteoma , Animais , Drosophila melanogaster/genética , DNA , Sequência de Aminoácidos , Saccharomyces cerevisiae/genética
8.
Microbiol Spectr ; : e0206522, 2023 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-36877086

RESUMO

Developing an understanding of how microbial communities vary across conditions is an important analytical step. We used 16S rRNA data isolated from human stool samples to investigate whether learned dissimilarities, such as those produced using unsupervised decision tree ensembles, can be used to improve the analysis of the composition of bacterial communities in patients suffering from Crohn's disease and adenomas/colorectal cancers. We also introduce a workflow capable of learning dissimilarities, projecting them into a lower dimensional space, and identifying features that impact the location of samples in the projections. For example, when used with the centered log ratio transformation, our new workflow (TreeOrdination) could identify differences in the microbial communities of Crohn's disease patients and healthy controls. Further investigation of our models elucidated the global impact amplicon sequence variants (ASVs) had on the locations of samples in the projected space and how each ASV impacted individual samples in this space. Furthermore, this approach can be used to integrate patient data easily into the model and results in models that generalize well to unseen data. Models employing multivariate splits can improve the analysis of complex high-throughput sequencing data sets because they are better able to learn about the underlying structure of the data set. IMPORTANCE There is an ever-increasing level of interest in accurately modeling and understanding the roles that commensal organisms play in human health and disease. We show that learned representations can be used to create informative ordinations. We also demonstrate that the application of modern model introspection algorithms can be used to investigate and quantify the impacts of taxa in these ordinations, and that the taxa identified by these approaches have been associated with immune-mediated inflammatory diseases and colorectal cancer.

9.
Curr Biol ; 33(6): 1147-1152.e5, 2023 03 27.
Artigo em Inglês | MEDLINE | ID: mdl-36841239

RESUMO

The historical epidemiology of plague is controversial due to the scarcity and ambiguity of available data.1,2 A common source of debate is the extent and pattern of plague re-emergence and local continuity in Europe during the 14th-18th century CE.3 Despite having a uniquely long history of plague (∼5,000 years), Scandinavia is relatively underrepresented in the historical archives.4,5 To better understand the historical epidemiology and evolutionary history of plague in this region, we performed in-depth (n = 298) longitudinal screening (800 years) for the plague bacterium Yersinia pestis (Y. pestis) across 13 archaeological sites in Denmark from 1000 to 1800 CE. Our genomic and phylogenetic data captured the emergence, continuity, and evolution of Y. pestis in this region over a period of 300 years (14th-17th century CE), for which the plague-positivity rate was 8.3% (3.3%-14.3% by site). Our phylogenetic analysis revealed that the Danish Y. pestis sequences were interspersed with those from other European countries, rather than forming a single cluster, indicative of the generation, spread, and replacement of bacterial variants through communities rather than their long-term local persistence. These results provide an epidemiological link between Y. pestis and the unknown pestilence that afflicted medieval and early modern Europe. They also demonstrate how population-scale genomic evidence can be used to test hypotheses on disease mortality and epidemiology and help pave the way for the next generation of historical disease research.


Assuntos
Peste , Yersinia pestis , Humanos , Yersinia pestis/genética , Peste/epidemiologia , Peste/microbiologia , Filogenia , Genoma Bacteriano , Dinamarca
10.
Commun Biol ; 6(1): 23, 2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36658311

RESUMO

Plague has an enigmatic history as a zoonotic pathogen. This infectious disease will unexpectedly appear in human populations and disappear just as suddenly. As a result, a long-standing line of inquiry has been to estimate when and where plague appeared in the past. However, there have been significant disparities between phylogenetic studies of the causative bacterium, Yersinia pestis, regarding the timing and geographic origins of its reemergence. Here, we curate and contextualize an updated phylogeny of Y. pestis using 601 genome sequences sampled globally. Through a detailed Bayesian evaluation of temporal signal in subsets of these data we demonstrate that a Y. pestis-wide molecular clock is unstable. To resolve this, we developed a new approach in which each Y. pestis population was assessed independently, enabling us to recover substantial temporal signal in five populations, including the ancient pandemic lineages which we now estimate may have emerged decades, or even centuries, before a pandemic was historically documented from European sources. Despite this methodological advancement, we only obtain robust divergence dates from populations sampled over a period of at least 90 years, indicating that genetic evidence alone is insufficient for accurately reconstructing the timing and spread of short-term plague epidemics.


Assuntos
Peste , Yersinia pestis , Humanos , Peste/epidemiologia , Peste/genética , Peste/microbiologia , Yersinia pestis/genética , Filogenia , Teorema de Bayes , Genoma Bacteriano
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA