Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 97
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas
2.
J Mol Evol ; 92(2): 153-168, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38485789

RESUMO

Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.


Assuntos
Evolução Molecular , Proteínas , Humanos , Teorema de Bayes , Proteínas/genética , Proteínas/química , Sequência de Aminoácidos , Domínios Proteicos
3.
Microbiol Spectr ; 12(4): e0358423, 2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38436242

RESUMO

We conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.


Assuntos
COVID-19 , Cervos , Animais , Humanos , SARS-CoV-2/genética , Filogenia , Vison
4.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38212995

RESUMO

MOTIVATION: Proteins accomplish cellular functions by interacting with each other, which makes the prediction of interaction sites a fundamental problem. As experimental methods are expensive and time consuming, computational prediction of the interaction sites has been studied extensively. Structure-based programs are the most accurate, while the sequence-based ones are much more widely applicable, as the sequences available outnumber the structures by two orders of magnitude. Ideally, we would like a tool that has the quality of the former and the applicability of the latter. RESULTS: We provide here the first solution that achieves these two goals. Our new sequence-based program, Seq-InSite, greatly surpasses the performance of sequence-based models, matching the quality of state-of-the-art structure-based predictors, thus effectively superseding the need for models requiring structure. The predictive power of Seq-InSite is illustrated using an analysis of evolutionary conservation for four protein sequences. AVAILABILITY AND IMPLEMENTATION: Seq-InSite is freely available as a web server at http://seq-insite.csd.uwo.ca/ and as free source code, including trained models and all datasets used for training and testing, at https://github.com/lucian-ilie/Seq-InSite.


Assuntos
Proteínas , Software , Proteínas/química , Sequência de Aminoácidos
5.
PLoS Pathog ; 19(7): e1011538, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37523413

RESUMO

Brucellosis is a disease caused by the bacterium Brucella and typically transmitted through contact with infected ruminants. It is one of the most common chronic zoonotic diseases and of particular interest to public health agencies. Despite its well-known transmission history and characteristic symptoms, we lack a more complete understanding of the evolutionary history of its best-known species-Brucella melitensis. To address this knowledge gap we fortuitously found, sequenced and assembled a high-quality ancient B. melitensis draft genome from the kidney stone of a 14th-century Italian friar. The ancient strain contained fewer core genes than modern B. melitensis isolates, carried a complete complement of virulence genes, and did not contain any indication of significant antimicrobial resistances. The ancient B. melitensis genome fell as a basal sister lineage to a subgroup of B. melitensis strains within the Western Mediterranean phylogenetic group, with a short branch length indicative of its earlier sampling time, along with a similar gene content. By calibrating the molecular clock we suggest that the speciation event between B. melitensis and B. abortus is contemporaneous with the estimated time frame for the domestication of both sheep and goats. These results confirm the existence of the Western Mediterranean clade as a separate group in the 14th CE and suggest that its divergence was due to human and ruminant co-migration.


Assuntos
Brucella melitensis , Brucelose , Humanos , Animais , Ovinos , Brucella melitensis/genética , Brucella abortus/genética , Filogenia , Brucelose/microbiologia , Zoonoses , Cabras
6.
bioRxiv ; 2023 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-37066254

RESUMO

Barton et al.1 raise several statistical concerns regarding our original analyses2 that highlight the challenge of inferring natural selection using ancient genomic data. We show here that these concerns have limited impact on our original conclusions. Specifically, we recover the same signature of enrichment for high FST values at the immune loci relative to putatively neutral sites after switching the allele frequency estimation method to a maximum likelihood approach, filtering to only consider known human variants, and down-sampling our data to the same mean coverage across sites. Furthermore, using permutations, we show that the rs2549794 variant near ERAP2 continues to emerge as the strongest candidate for selection (p = 1.2×10-5), falling below the Bonferroni-corrected significance threshold recommended by Barton et al. Importantly, the evidence for selection on ERAP2 is further supported by functional data demonstrating the impact of the ERAP2 genotype on the immune response to Y. pestis and by epidemiological data from an independent group showing that the putatively selected allele during the Black Death protects against severe respiratory infection in contemporary populations.

7.
Mol Biol Evol ; 40(4)2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-37036379

RESUMO

Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.


Assuntos
Drosophila melanogaster , Proteoma , Animais , Drosophila melanogaster/genética , DNA , Sequência de Aminoácidos , Saccharomyces cerevisiae/genética
8.
Microbiol Spectr ; : e0206522, 2023 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-36877086

RESUMO

Developing an understanding of how microbial communities vary across conditions is an important analytical step. We used 16S rRNA data isolated from human stool samples to investigate whether learned dissimilarities, such as those produced using unsupervised decision tree ensembles, can be used to improve the analysis of the composition of bacterial communities in patients suffering from Crohn's disease and adenomas/colorectal cancers. We also introduce a workflow capable of learning dissimilarities, projecting them into a lower dimensional space, and identifying features that impact the location of samples in the projections. For example, when used with the centered log ratio transformation, our new workflow (TreeOrdination) could identify differences in the microbial communities of Crohn's disease patients and healthy controls. Further investigation of our models elucidated the global impact amplicon sequence variants (ASVs) had on the locations of samples in the projected space and how each ASV impacted individual samples in this space. Furthermore, this approach can be used to integrate patient data easily into the model and results in models that generalize well to unseen data. Models employing multivariate splits can improve the analysis of complex high-throughput sequencing data sets because they are better able to learn about the underlying structure of the data set. IMPORTANCE There is an ever-increasing level of interest in accurately modeling and understanding the roles that commensal organisms play in human health and disease. We show that learned representations can be used to create informative ordinations. We also demonstrate that the application of modern model introspection algorithms can be used to investigate and quantify the impacts of taxa in these ordinations, and that the taxa identified by these approaches have been associated with immune-mediated inflammatory diseases and colorectal cancer.

9.
Curr Biol ; 33(6): 1147-1152.e5, 2023 03 27.
Artigo em Inglês | MEDLINE | ID: mdl-36841239

RESUMO

The historical epidemiology of plague is controversial due to the scarcity and ambiguity of available data.1,2 A common source of debate is the extent and pattern of plague re-emergence and local continuity in Europe during the 14th-18th century CE.3 Despite having a uniquely long history of plague (∼5,000 years), Scandinavia is relatively underrepresented in the historical archives.4,5 To better understand the historical epidemiology and evolutionary history of plague in this region, we performed in-depth (n = 298) longitudinal screening (800 years) for the plague bacterium Yersinia pestis (Y. pestis) across 13 archaeological sites in Denmark from 1000 to 1800 CE. Our genomic and phylogenetic data captured the emergence, continuity, and evolution of Y. pestis in this region over a period of 300 years (14th-17th century CE), for which the plague-positivity rate was 8.3% (3.3%-14.3% by site). Our phylogenetic analysis revealed that the Danish Y. pestis sequences were interspersed with those from other European countries, rather than forming a single cluster, indicative of the generation, spread, and replacement of bacterial variants through communities rather than their long-term local persistence. These results provide an epidemiological link between Y. pestis and the unknown pestilence that afflicted medieval and early modern Europe. They also demonstrate how population-scale genomic evidence can be used to test hypotheses on disease mortality and epidemiology and help pave the way for the next generation of historical disease research.


Assuntos
Peste , Yersinia pestis , Humanos , Yersinia pestis/genética , Peste/epidemiologia , Peste/microbiologia , Filogenia , Genoma Bacteriano , Dinamarca
10.
Commun Biol ; 6(1): 23, 2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36658311

RESUMO

Plague has an enigmatic history as a zoonotic pathogen. This infectious disease will unexpectedly appear in human populations and disappear just as suddenly. As a result, a long-standing line of inquiry has been to estimate when and where plague appeared in the past. However, there have been significant disparities between phylogenetic studies of the causative bacterium, Yersinia pestis, regarding the timing and geographic origins of its reemergence. Here, we curate and contextualize an updated phylogeny of Y. pestis using 601 genome sequences sampled globally. Through a detailed Bayesian evaluation of temporal signal in subsets of these data we demonstrate that a Y. pestis-wide molecular clock is unstable. To resolve this, we developed a new approach in which each Y. pestis population was assessed independently, enabling us to recover substantial temporal signal in five populations, including the ancient pandemic lineages which we now estimate may have emerged decades, or even centuries, before a pandemic was historically documented from European sources. Despite this methodological advancement, we only obtain robust divergence dates from populations sampled over a period of at least 90 years, indicating that genetic evidence alone is insufficient for accurately reconstructing the timing and spread of short-term plague epidemics.


Assuntos
Peste , Yersinia pestis , Humanos , Peste/epidemiologia , Peste/genética , Peste/microbiologia , Yersinia pestis/genética , Filogenia , Teorema de Bayes , Genoma Bacteriano
11.
J Comput Biol ; 30(2): 149-160, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-35939266

RESUMO

A partial cover of a string or sequence of length n, which we model as an array x=x[1..n], is a repeating substring u of x such that "many" positions in x lie within occurrences of u. A maximal cover u*-introduced in 2018 by Mhaskar and Smyth as optimal cover-is a partial cover that, over all partial covers u, maximizes the positions covered. Applying data structures also introduced by Mhaskar and Smyth, our software MAXCOVER for the first time enables efficient computation of u* for any x-in particular, as described here, for protein sequences of Arabidopsis, Caenorhabditis elegans, Drosophila melanogaster, and humans. In this protein context, we also compare an extended version of MAXCOVER with existing software (MUMmer's repeat-match) for the closely related task of computing non-extendible repeating substrings (a.k.a. maximal repeats). In practice, MAXCOVER is an order-of-magnitude faster than MUMmer, with much lower space requirements, while producing more compact output that, nevertheless, yields a more exact and user-friendly specification of the repeats.


Assuntos
Algoritmos , Drosophila melanogaster , Humanos , Animais , Software , Sequência de Aminoácidos , Proteínas
12.
Nature ; 611(7935): 312-319, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36261521

RESUMO

Infectious diseases are among the strongest selective pressures driving human evolution1,2. This includes the single greatest mortality event in recorded history, the first outbreak of the second pandemic of plague, commonly called the Black Death, which was caused by the bacterium Yersinia pestis3. This pandemic devastated Afro-Eurasia, killing up to 30-50% of the population4. To identify loci that may have been under selection during the Black Death, we characterized genetic variation around immune-related genes from 206 ancient DNA extracts, stemming from two different European populations before, during and after the Black Death. Immune loci are strongly enriched for highly differentiated sites relative to a set of non-immune loci, suggesting positive selection. We identify 245 variants that are highly differentiated within the London dataset, four of which were replicated in an independent cohort from Denmark, and represent the strongest candidates for positive selection. The selected allele for one of these variants, rs2549794, is associated with the production of a full-length (versus truncated) ERAP2 transcript, variation in cytokine response to Y. pestis and increased ability to control intracellular Y. pestis in macrophages. Finally, we show that protective variants overlap with alleles that are today associated with increased susceptibility to autoimmune diseases, providing empirical evidence for the role played by past pandemics in shaping present-day susceptibility to disease.


Assuntos
DNA Antigo , Predisposição Genética para Doença , Imunidade , Peste , Seleção Genética , Yersinia pestis , Humanos , Aminopeptidases/genética , Aminopeptidases/imunologia , Peste/genética , Peste/imunologia , Peste/microbiologia , Peste/mortalidade , Yersinia pestis/imunologia , Yersinia pestis/patogenicidade , Seleção Genética/imunologia , Europa (Continente)/epidemiologia , Europa (Continente)/etnologia , Imunidade/genética , Conjuntos de Dados como Assunto , Londres/epidemiologia , Dinamarca/epidemiologia
13.
Int J Paleopathol ; 39: 20-34, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36174312

RESUMO

OBJECTIVE: To investigate variation in ancient DNA recovery of Brucella melitensis, the causative agent of brucellosis, from multiple tissues belonging to one individual MATERIALS: 14 samples were analyzed from the mummified remains of the Blessed Sante, a 14 th century Franciscan friar from central Italy, with macroscopic diagnosis of probable brucellosis. METHODS: Shotgun sequencing data from was examined to determine the presence of Brucella DNA. RESULTS: Three of the 14 samples contained authentic ancient DNA, identified as belonging to B. melitensis. A genome (23.81X depth coverage, 0.98 breadth coverage) was recovered from a kidney stone. Nine of the samples contained reads classified as B. melitensis (7-169), but for many the data quality was insufficient to withstand our identification and authentication criteria. CONCLUSIONS: We identified significant variation in the preservation and abundance of B. melitensis DNA present across multiple tissues, with calcified nodules yielding the highest number of authenticated reads. This shows how greatly sample selection can impact pathogen identification. SIGNIFICANCE: Our results demonstrate variation in the preservation and recovery of pathogen DNA across tissues. This study highlights the importance of sample selection in the reconstruction of infectious disease burden and highlights the importance of a holistic approach to identifying disease. LIMITATIONS: Study focuses on pathogen recovery in a single individual. SUGGESTIONS FOR FURTHER RESEARCH: Further analysis of how sampling impacts aDNA recovery will improve pathogen aDNA recovery and advance our understanding of disease in past peoples.


Assuntos
Brucella melitensis , Brucelose , Monges , Humanos , Brucella melitensis/genética , DNA Antigo , Itália
14.
Commun Biol ; 5(1): 599, 2022 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-35710940

RESUMO

Escherichia coli - one of the most characterized bacteria and a major public health concern - remains invisible across the temporal landscape. Here, we present the meticulous reconstruction of the first ancient E. coli genome from a 16th century gallstone from an Italian mummy with chronic cholecystitis. We isolated ancient DNA and reconstructed the ancient E. coli genome. It consisted of one chromosome of 4446 genes and two putative plasmids with 52 genes. The E. coli strain belonged to the phylogroup A and an exceptionally rare sequence type 4995. The type VI secretion system component genes appears to be horizontally acquired from Klebsiella aerogenes, however we could not identify any pathovar specific genes nor any acquired antibiotic resistances. A sepsis mouse assay showed that a closely related contemporary E. coli strain was avirulent. Our reconstruction of this ancient E. coli helps paint a more complete picture of the burden of opportunistic infections of the past.


Assuntos
Infecções por Escherichia coli , Infecções Oportunistas , Animais , Bile , Escherichia coli/genética , Infecções por Escherichia coli/genética , Infecções por Escherichia coli/microbiologia , Genoma Bacteriano , Camundongos
15.
Mol Biol Evol ; 39(5)2022 05 03.
Artigo em Inglês | MEDLINE | ID: mdl-35482425

RESUMO

Low Complexity Regions (LCRs) are present in a surprisingly large number of eukaryotic proteins. These highly repetitive and compositionally biased sequences are often structurally disordered, bind promiscuously, and evolve rapidly. Frequently studied in terms of evolutionary dynamics, little is known about how LCRs affect the expression of the proteins which contain them. It would be expected that rapidly evolving LCRs are unlikely to be tolerated in strongly conserved, highly abundant proteins, leading to lower overall abundance in proteins which contain LCRs. To test this hypothesis and examine the associations of protein abundance and transcript abundance with the presence of LCRs, we have integrated high-throughput data from across mammals. We have found that LCRs are indeed associated with reduced protein abundance, but are also associated with elevated transcript abundance. These associations are qualitatively consistent across 12 human tissues and nine mammalian species. The differential impacts of LCRs on abundance at the protein and transcript level are not explained by differences in either protein degradation rates or the inefficiency of translation for LCR containing proteins. We suggest that rapidly evolving LCRs are a source of selective pressure on the regulatory mechanisms which maintain steady-state protein abundance levels.


Assuntos
Evolução Molecular , Proteínas , Animais , Humanos , Mamíferos/genética , Domínios Proteicos , Proteínas/genética
16.
BMC Bioinformatics ; 23(1): 110, 2022 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-35361114

RESUMO

BACKGROUND: Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. RESULTS: We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada's Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark's generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries. CONCLUSIONS: Our work introduces LANDMark, a meta-classifier which blends the characteristics of several machine learning models into a decision tree and ensemble learning framework. To our knowledge, this is the first study to apply this type of ensemble approach to amplicon sequencing data and we have shown that analyzing these datasets using LANDMark can produce highly predictive and consistent models.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Biomarcadores , Aprendizado de Máquina , Máquina de Vetores de Suporte
17.
Genome ; 65(5): 287-299, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-35073184

RESUMO

Genomic reorganization, such as rearrangements and inversions, influences how genetic information is organized within the bacterial genomes. Inversions, in particular, facilitate genome evolution through gene gain and loss, and can alter gene expression. Previous studies have investigated the impact inversions have on gene expression induced inversions targeting specific genes or examine inversions between distantly related species. This fails to encompass a genome-wide perspective of naturally occurring inversions and their post-adaptation impact on gene expression. Here, we used bioinformatic techniques and multiple RNA-seq datasets to investigate the short- and long-range impact inversions have on genomic gene expression within Escherichia coli. We observed differences in gene expression between homologous inverted and non-inverted genes even after long-term exposure to adaptive selection. In 4% of inversions representing 33 genes, differential gene expression between inverted and non-inverted homologs was detected, with greater than two-thirds (71%) of differentially expressed inverted genes having 9.4-85.6-fold higher gene expression. The identified inversions had more overlap than expected with nucleoid-associated protein binding sites, which assist in the regulation of genomic gene expression. Some inversions can drastically impact gene expression, even between different strains of E. coli, and could provide a mechanism for the diversification of genetic content through controlled expression changes.


Assuntos
Inversão Cromossômica , Escherichia coli , Escherichia coli/genética , Expressão Gênica , Genoma Bacteriano , Genômica , Humanos , Ligação Proteica
18.
BMC Ecol Evol ; 21(1): 119, 2021 06 12.
Artigo em Inglês | MEDLINE | ID: mdl-34118864

RESUMO

BACKGROUND: Natural populations harbor significant levels of genetic variability. Because of this standing genetic variation, the number of possible genotypic combinations is many orders of magnitude greater than the population size. This means that any given population contains only a tiny fraction of all possible genotypic combinations. RESULTS: We show that recombination allows a finite population to resample the genotype pool, i.e., the universe of all possible genotypic combinations. Recombination, in combination with natural selection, enables an evolving sexual population to replace existing genotypes with new, higher-fitness genotypic combinations that did not previously exist in the population. This process allows the sexual population to gradually increase its fitness far beyond the range of fitnesses in the initial population. In contrast to this, an asexual population is limited to selection among existing lower fitness genotypes. CONCLUSIONS: The results provide an explanation for the ubiquity of sexual reproduction in evolving natural populations, especially when natural selection is acting on the standing genetic variation.


Assuntos
Reprodução , Seleção Genética , Genótipo , Densidade Demográfica , Reprodução/genética
19.
Genome Biol Evol ; 13(1)2021 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-33320172

RESUMO

Increasing evidence supports the notion that different regions of a genome have unique rates of molecular change. This variation is particularly evident in bacterial genomes where previous studies have reported gene expression and essentiality tend to decrease, whereas substitution rates usually increase with increasing distance from the origin of replication. Genomic reorganization such as rearrangements occur frequently in bacteria and allow for the introduction and restructuring of genetic content, creating gradients of molecular traits along genomes. Here, we explore the interplay of these phenomena by mapping substitutions to the genomes of Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti, quantifying how many substitutions have occurred at each position in the genome. Preceding work indicates that substitution rate significantly increases with distance from the origin. Using a larger sample size and accounting for genome rearrangements through ancestral reconstruction, our analysis demonstrates that the correlation between the number of substitutions and the distance from the origin of replication is significant but small and inconsistent in direction. Some replicons had a significantly decreasing trend (E. coli and the chromosome of S. meliloti), whereas others showed the opposite significant trend (B. subtilis, Streptomyces, pSymA and pSymB in S. meliloti). dN, dS, and ω were examined across all genes and there was no significant correlation between those values and distance from the origin. This study highlights the impact that genomic rearrangements and location have on molecular trends in some bacteria, illustrating the importance of considering spatial trends in molecular evolutionary analysis. Assuming that molecular trends are exclusively in one direction can be problematic.


Assuntos
Bactérias/genética , Substituição de Medicamentos , Genoma Bacteriano , Bacillus subtilis/genética , Proteínas de Bactérias/genética , Replicação do DNA , Escherichia coli/genética , Evolução Molecular , Regulação Bacteriana da Expressão Gênica , Rearranjo Gênico , Genes Bacterianos , Genômica , Modelos Logísticos , Filogenia , Replicon , Sinorhizobium meliloti/genética , Streptomyces/genética
20.
Bioinformatics ; 37(7): 896-904, 2021 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-32840562

RESUMO

MOTIVATION: Proteins usually perform their functions by interacting with other proteins, which is why accurately predicting protein-protein interaction (PPI) binding sites is a fundamental problem. Experimental methods are slow and expensive. Therefore, great efforts are being made towards increasing the performance of computational methods. RESULTS: We propose DEep Learning Prediction of Highly probable protein Interaction sites (DELPHI), a new sequence-based deep learning suite for PPI-binding sites prediction. DELPHI has an ensemble structure which combines a CNN and a RNN component with fine tuning technique. Three novel features, HSP, position information and ProtVec are used in addition to nine existing ones. We comprehensively compare DELPHI to nine state-of-the-art programmes on five datasets, and DELPHI outperforms the competing methods in all metrics even though its training dataset shares the least similarities with the testing datasets. In the most important metrics, AUPRC and MCC, it surpasses the second best programmes by as much as 18.5% and 27.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model and, especially, the three new features. Using DELPHI it is shown that there is a strong correlation with protein-binding residues (PBRs) and sites with strong evolutionary conservation. In addition, DELPHI's predicted PBR sites closely match known data from Pfam. DELPHI is available as open-sourced standalone software and web server. AVAILABILITY AND IMPLEMENTATION: The DELPHI web server can be found at delphi.csd.uwo.ca/, with all datasets and results in this study. The trained models, the DELPHI standalone source code, and the feature computation pipeline are freely available at github.com/lucian-ilie/DELPHI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas , Software , Sítios de Ligação , Biologia Computacional , Ligação Proteica , Proteínas/metabolismo , Projetos de Pesquisa
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...