RESUMO
Interspecies hybridization is prevalent in various eukaryotic lineages and plays important roles in phenotypic diversification, adaptation, and speciation. To better understand the changes that occurred in the different subgenomes of a hybrid species and how they facilitate adaptation, we completed chromosome-level de novo assemblies of all chromosomes for a recently formed hybrid yeast, Saccharomyces bayanus strain CBS380, using Nanopore MinION long-read sequencing. We characterized the S. bayanus genome and compared it with its parent species, S. uvarum and S. eubayanus, and other S. bayanus genomes to better understand genome evolution after a relatively recent hybridization event. We observed multiple recombination events between the subgenomes in each chromosome, followed by loss of heterozygosity (LOH) in nine chromosome pairs. In addition to maintaining nearly all gene content and synteny from its parental genomes, S. bayanus has acquired many genes from other yeast species, primarily through the introgression of S. cerevisiae, such as those involved in the maltose metabolism. Finally, the patterns of recombination and LOH suggest an allotetraploid origin of S. bayanus The gene acquisition and rapid LOH in the hybrid genome probably facilitated its adaptation to maltose brewing environments and mitigated the maladaptive effect of hybridization. This manuscript describes the first in-depth study using long-read sequencing technology of an S. bayanus hybrid genome which may serve as an excellent reference for future studies of this important yeast and other yeast strains.
RESUMO
BACKGROUND: Diverse microbiome communities drive biogeochemical processes and evolution of animals in their ecosystems. Many microbiome projects have demonstrated the power of using metagenomics to understand the structures and factors influencing the function of the microbiomes in their environments. In order to characterize the effects from microbiome composition for human health, diseases, and even ecosystems, one must first understand the relationship of microbes and their environment in different samples. Running machine learning model with metagenomic sequencing data is encouraged for this purpose, but it is not an easy task to make an appropriate machine learning model for all diverse metagenomic datasets. RESULTS: We introduce MegaR, an R Shiny package and web application, to build an unbiased machine learning model effortlessly with interactive visual analysis. The MegaR employs taxonomic profiles from either whole metagenome sequencing or 16S rRNA sequencing data to develop machine learning models and classify the samples into two or more categories. It provides various options for model fine tuning throughout the analysis pipeline such as data processing, multiple machine learning techniques, model validation, and unknown sample prediction that can be used to achieve the highest prediction accuracy possible for any given dataset while still maintaining a user-friendly experience. CONCLUSIONS: Metagenomic sample classification and phenotype prediction is important particularly when it applies to a diagnostic method for identifying and predicting microbe-related human diseases. MegaR provides various interactive visualizations for user to build an accurate machine-learning model without difficulty. Unknown sample prediction with a properly trained model using MegaR will enhance researchers to identify the sample property in a fast turnaround time.
Assuntos
Aprendizado de Máquina , Metagenoma , Metagenômica , Humanos , Fenótipo , RNA Ribossômico 16S/genéticaRESUMO
BACKGROUND & AIMS: Chronic atrophic gastritis can lead to gastric metaplasia and increase risk of gastric adenocarcinoma. Metaplasia is a precancerous lesion associated with an increased risk for carcinogenesis, but the mechanism(s) by which inflammation induces metaplasia are poorly understood. We investigated transcriptional programs in mucous neck cells and chief cells as they progress to metaplasia mice with chronic gastritis. METHODS: We analyzed previously generated single-cell RNA-sequencing (scRNA-seq) data of gastric corpus epithelium to define transcriptomes of individual epithelial cells from healthy BALB/c mice (controls) and TxA23 mice, which have chronically inflamed stomachs with metaplasia. Chronic gastritis was induced in B6 mice by Helicobacter pylori infection. Gastric tissues from mice and human patients were analyzed by immunofluorescence to verify findings at the protein level. Pseudotime trajectory analysis of scRNA-seq data was used to predict differentiation of normal gastric epithelium to metaplastic epithelium in chronically inflamed stomachs. RESULTS: Analyses of gastric epithelial transcriptomes revealed that gastrokine 3 (Gkn3) mRNA is a specific marker of mouse gastric corpus metaplasia (spasmolytic polypeptide expressing metaplasia, SPEM). Gkn3 mRNA was undetectable in healthy gastric corpus; its expression in chronically inflamed stomachs (from TxA23 mice and mice with Helicobacter pylori infection) identified more metaplastic cells throughout the corpus than previously recognized. Staining of healthy and diseased human gastric tissue samples paralleled these results. Although mucous neck cells and chief cells from healthy stomachs each had distinct transcriptomes, in chronically inflamed stomachs, these cells had distinct transcription patterns that converged upon a pre-metaplastic pattern, which lacked the metaplasia-associated transcripts. Finally, pseudotime trajectory analysis confirmed the convergence of mucous neck cells and chief cells into a pre-metaplastic phenotype that ultimately progressed to metaplasia. CONCLUSIONS: In analyses of tissues from chronically inflamed stomachs of mice and humans, we expanded the definition of gastric metaplasia to include Gkn3 mRNA and GKN3-positive cells in the corpus, allowing a more accurate assessment of SPEM. Under conditions of chronic inflammation, chief cells and mucous neck cells are plastic and converge into a pre-metaplastic cell type that progresses to metaplasia.
Assuntos
Celulas Principais Gástricas/patologia , Gastrite Atrófica/imunologia , Infecções por Helicobacter/imunologia , Lesões Pré-Cancerosas/diagnóstico , Neoplasias Gástricas/prevenção & controle , Animais , Biomarcadores/análise , Biomarcadores/metabolismo , Carcinogênese/genética , Carcinogênese/imunologia , Proteínas de Transporte/análise , Proteínas de Transporte/metabolismo , Celulas Principais Gástricas/imunologia , Modelos Animais de Doenças , Feminino , Gastrite Atrófica/microbiologia , Gastrite Atrófica/patologia , Infecções por Helicobacter/genética , Infecções por Helicobacter/microbiologia , Infecções por Helicobacter/patologia , Helicobacter pylori/imunologia , Humanos , Masculino , Proteínas de Membrana/análise , Proteínas de Membrana/metabolismo , Metaplasia/diagnóstico , Metaplasia/genética , Metaplasia/imunologia , Metaplasia/patologia , Camundongos , Lesões Pré-Cancerosas/genética , Lesões Pré-Cancerosas/imunologia , Lesões Pré-Cancerosas/patologia , RNA-Seq , Análise de Célula Única , Neoplasias Gástricas/patologiaRESUMO
OBJECTIVE: Spasmolytic polypeptide-expressing metaplasia (SPEM) is a regenerative lesion in the gastric mucosa and is a potential precursor to intestinal metaplasia/gastric adenocarcinoma in a chronic inflammatory setting. The goal of these studies was to define the transcriptional changes associated with SPEM at the individual cell level in response to acute drug injury and chronic inflammatory damage in the gastric mucosa. DESIGN: Epithelial cells were isolated from the gastric corpus of healthy stomachs and stomachs with drug-induced and inflammation-induced SPEM lesions. Single cell RNA sequencing (scRNA-seq) was performed on tissue samples from each of these settings. The transcriptomes of individual epithelial cells from healthy, acutely damaged and chronically inflamed stomachs were analysed and compared. RESULTS: scRNA-seq revealed a population Mucin 6 (Muc6)+gastric intrinsic factor (Gif)+ cells in healthy tissue, but these cells did not express transcripts associated with SPEM. Furthermore, analyses of SPEM cells from drug injured and chronically inflamed corpus yielded two major findings: (1) SPEM and neck cell hyperplasia/hypertrophy are nearly identical in the expression of SPEM-associated transcripts and (2) SPEM programmes induced by drug-mediated parietal cell ablation and chronic inflammation are nearly identical, although the induction of transcripts involved in immunomodulation was unique to SPEM cells in the chronic inflammatory setting. CONCLUSIONS: These data necessitate an expansion of the definition of SPEM to include Tff2+Muc6+ cells that do not express mature chief cell transcripts such as Gif. Our data demonstrate that SPEM arises by a highly conserved cellular programme independent of aetiology and develops immunoregulatory capabilities in a setting of chronic inflammation.
Assuntos
Mucosa Gástrica/metabolismo , Gastrite/induzido quimicamente , Peptídeos e Proteínas de Sinalização Intercelular/metabolismo , Animais , Feminino , Imunofluorescência , Mucosa Gástrica/efeitos dos fármacos , Mucosa Gástrica/patologia , Gastrite/metabolismo , Gastrite/patologia , Perfilação da Expressão Gênica , Hibridização In Situ , Masculino , Metaplasia/induzido quimicamente , Metaplasia/metabolismo , Camundongos , Camundongos Endogâmicos BALB C , Mucina-6/metabolismo , Análise de Sequência de RNA , Análise de Célula Única , Tamoxifeno/farmacologia , Fator Trefoil-2/metabolismoRESUMO
BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.
Assuntos
Algoritmos , Genoma , Análise de Sequência de DNA , Sequência de Bases , Conyza/genética , Bases de Dados Genéticas , Genoma Humano , Genoma de Planta , HumanosRESUMO
Motivation: Reprogramming somatic cells into neurons holds great promise to model neuronal development and disease. The efficiency and success rate of neuronal reprogramming, however, may vary between different conversion platforms and cell types, thereby necessitating an unbiased, systematic approach to estimate neuronal identity of converted cells. Recent studies have demonstrated that long genes (>100 kb from transcription start to end) are highly enriched in neurons, which provides an opportunity to identify neurons based on the expression of these long genes. Results: We have developed a versatile R package, LONGO, to analyze gene expression based on gene length. We propose a systematic analysis of long gene expression (LGE) with a metric termed the long gene quotient (LQ) that quantifies LGE in RNA-seq or microarray data to validate neuronal identity at the single-cell and population levels. This unique feature of neurons provides an opportunity to utilize measurements of LGE in transcriptome data to quickly and easily distinguish neurons from non-neuronal cells. By combining this conceptual advancement and statistical tool in a user-friendly and interactive software package, we intend to encourage and simplify further investigation into LGE, particularly as it applies to validating and improving neuronal differentiation and reprogramming methodologies. Availability and implementation: LONGO is freely available for download at https://github.com/biohpc/longo. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Reprogramação Celular , Perfilação da Expressão Gênica/métodos , Neurônios/metabolismo , Software , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Neurônios/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , TranscriptomaRESUMO
MOTIVATION: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. RESULTS: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. The algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. AVAILABILITY AND IMPLEMENTATION: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biovigilância , Biologia Computacional/métodos , DNA Bacteriano/análise , Genoma Bacteriano , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , HumanosRESUMO
Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date, there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.
Assuntos
Genoma Bacteriano , Bactérias/classificação , Proteínas de Bactérias/genética , Códon , Variação Genética , Tamanho do Genoma , Genômica , Metagenômica , Anotação de Sequência Molecular , Filogenia , Análise de Sequência de DNARESUMO
MOTIVATION: Metagenomic sequencing allows reconstruction of microbial genomes directly from environmental samples. Omega (overlap-graph metagenome assembler) was developed for assembling and scaffolding Illumina sequencing data of microbial communities. RESULTS: Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming short branches. Unitigs were generated based on minimum cost flow analysis of the overlap graph and then merged to contigs and scaffolds using mate-pair information. In comparison with three de Bruijn graph assemblers (SOAPdenovo, IDBA-UD and MetaVelvet), Omega provided comparable overall performance on a HiSeq 100-bp dataset and superior performance on a MiSeq 300-bp dataset. In comparison with Celera on the MiSeq dataset, Omega provided more continuous assemblies overall using a fraction of the computing time of existing overlap-layout-consensus assemblers. This indicates Omega can more efficiently assemble longer Illumina reads, and at deeper coverage, for metagenomic datasets. AVAILABILITY AND IMPLEMENTATION: Implemented in C++ with source code and binaries freely available at http://omega.omicsbio.org.
Assuntos
Biologia Computacional/métodos , DNA Bacteriano/análise , Análise de Sequência de DNA/métodos , Software , Algoritmos , Computadores , Genoma Bacteriano , Internet , Metagenoma , Metagenômica/métodos , Linguagens de ProgramaçãoRESUMO
BACKGROUND: Phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. RESULTS: A total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accurate comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. CONCLUSIONS: Our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.
Assuntos
Archaea/genética , Bactérias/genética , Anotação de Sequência Molecular/métodos , Bases de Dados de Proteínas , Genoma Arqueal , Genoma Bacteriano , FilogeniaRESUMO
SUMMARY: Sipros/ProRata is an open-source software package for end-to-end data analysis in a wide variety of community proteomics measurements. A database-searching program, Sipros 3.0, was developed for accurate general-purpose protein identification and broad-range post-translational modification searches. Hybrid Message Passing Interface/OpenMP parallelism of the new Sipros architecture allowed its computation to be scalable from desktops to supercomputers. The upgraded ProRata 3.0 performs label-free quantification and isobaric chemical labeling quantification in addition to metabolic labeling quantification. Sipros/ProRata is a versatile informatics system that enables identification and quantification of proteins and their variants in many types of community proteomics studies. AVAILABILITY: Both programs are freely available under the GNU GPL license at Sipros.omicsbio.org and ProRata.omicsbio.org.
Assuntos
Proteômica/métodos , Software , Bases de Dados de Proteínas , Processamento de Proteína Pós-Traducional , Proteínas/análise , Proteínas/química , Espectrometria de Massas em TandemRESUMO
Change-point detection is a challenging problem that has a number of applications across various real-world domains. The primary objective of CPD is to identify specific time points where the underlying system undergoes transitions between different states, each characterized by its distinct data distribution. Precise identification of change points in time series omics data can provide insights into the dynamic and temporal characteristics inherent to complex biological systems. Many change-point detection methods have traditionally focused on the direct estimation of data distributions. However, these approaches become unrealistic in high-dimensional data analysis. Density ratio methods have emerged as promising approaches for change-point detection since estimating density ratios is easier than directly estimating individual densities. Nevertheless, the divergence measures used in these methods may suffer from numerical instability during computation. Additionally, the most popular α-relative Pearson divergence cannot measure the dissimilarity between two distributions of data but a mixture of distributions. To overcome the limitations of existing density ratio-based methods, we propose a novel approach called the Pearson-like scaled-Bregman divergence-based (PLsBD) density ratio estimation method for change-point detection. Our theoretical studies derive an analytical expression for the Pearson-like scaled Bregman divergence using a mixture measure. We integrate the PLsBD with a kernel regression model and apply a random sampling strategy to identify change points in both synthetic data and real-world high-dimensional genomics data of Drosophila. Our PLsBD method demonstrates superior performance compared to many other change-point detection methods.
RESUMO
Interspecies hybridization is prevalent in various eukaryotic lineages and plays important roles in phenotypic diversification, adaption, and speciation. To better understand the changes that occurred in the different subgenomes of a hybrid species and how they facilitated adaptation, we completed chromosome-level de novo assemblies of all 16 pairs chromosomes for a recently formed hybrid yeast, Saccharomyces bayanus strain CBS380 (IFO11022), using Nanopore MinION long-read sequencing. Characterization of S. bayanus subgenomes and comparative analysis with the genomes of its parent species, S. uvarum and S. eubayanus, provide several new insights into understanding genome evolution after a relatively recent hybridization. For instance, multiple recombination events between the two subgenomes have been observed in each chromosome, followed by loss of heterozygosity (LOH) in most chromosomes in nine chromosome pairs. In addition to maintaining nearly all gene content and synteny from its parental genomes, S. bayanus has acquired many genes from other yeast species, primarily through the introgression of S. cerevisiae, such as those involved in the maltose metabolism. In addition, the patterns of recombination and LOH suggest an allotetraploid origin of S. bayanus. The gene acquisition and rapid LOH in the hybrid genome probably facilitated its adaption to maltose brewing environments and mitigated the maladaptive effect of hybridization.
RESUMO
The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.
RESUMO
BACKGROUND: Recent advances in sequencing technologies have driven studies identifying the microbiome as a key regulator of overall health and disease in the host. Both 16S amplicon and whole genome shotgun sequencing technologies are currently being used to investigate this relationship, however, the choice of sequencing technology often depends on the nature and experimental design of the study. In principle, the outputs rendered by analysis pipelines are heavily influenced by the data used as input; it is then important to consider that the genomic features produced by different sequencing technologies may emphasize different results. RESULTS: In this work, we use public 16S amplicon and whole genome shotgun sequencing (WGS) data from the same dogs to investigate the relationship between sequencing technology and the captured gut metagenomic landscape in dogs. In our analyses, we compare the taxonomic resolution at the species and phyla levels and benchmark 12 classification algorithms in their ability to accurately identify host phenotype using only taxonomic relative abundance information from 16S and WGS datasets with identical study designs. Our best performing model, a random forest trained by the WGS dataset, identified a species (Bacteroides coprocola) that predominantly contributes to the abundance of leuB, a gene involved in branched chain amino acid biosynthesis; a risk factor for glucose intolerance, insulin resistance, and type 2 diabetes. This trend was not conserved when we trained the model using 16S sequencing profiles from the same dogs. CONCLUSIONS: Our results indicate that WGS sequencing of dog microbiomes detects a greater taxonomic diversity than 16S sequencing of the same dogs at the species level and with respect to four gut-enriched phyla levels. This difference in detection does not significantly impact the performance metrics of machine learning algorithms after down-sampling. Although the important features extracted from our best performing model are not conserved between the two technologies, the important features extracted from either instance indicate the utility of machine learning algorithms in identifying biologically meaningful relationships between the host and microbiome community members. In conclusion, this work provides the first systematic machine learning comparison of dog 16S and WGS microbiomes derived from identical study designs.
RESUMO
Transcription initiation is regulated in a highly organized fashion to ensure proper cellular functions. Accurate identification of transcription start sites (TSSs) and quantitative characterization of transcription initiation activities are fundamental steps for studies of regulated transcriptions and core promoter structures. Several high-throughput techniques have been developed to sequence the very 5'end of RNA transcripts (TSS sequencing) on the genome scale. Bioinformatics tools are essential for processing, analysis, and visualization of TSS sequencing data. Here, we present TSSr, an R package that provides rich functions for mapping TSS and characterizations of structures and activities of core promoters based on all types of TSS sequencing data. Specifically, TSSr implements several newly developed algorithms for accurately identifying TSSs from mapped sequencing reads and inference of core promoters, which are a prerequisite for subsequent functional analyses of TSS data. Furthermore, TSSr also enables users to export various types of TSS data that can be visualized by genome browser for inspection of promoter activities in association with other genomic features, and to generate publication-ready TSS graphs. These user-friendly features could greatly facilitate studies of transcription initiation based on TSS sequencing data. The source code and detailed documentations of TSSr can be freely accessed at https://github.com/Linlab-slu/TSSr.
RESUMO
The pathogen exposure history of an individual is recorded in their T-cell repertoire and can be accessed through the study of T-cell receptors (TCRs) if the tools to identify them were available. For each T-cell, the TCR loci undergoes genetic rearrangement that creates a unique DNA sequence. In theory these unique sequences can be used as biomarkers for tracking T-cell responses and cataloging immunological history. We developed the immune Cell Analysis Tool (iCAT), an R software package that analyzes TCR sequencing data from exposed (positive) and unexposed (negative) samples to identify TCR sequences statistically associated with positive samples. The presence and absence of associated sequences in samples trains a classifier to diagnose pathogen-specific exposure. We demonstrate the high accuracy of iCAT by testing on three TCR sequencing datasets. First, iCAT successfully diagnosed smallpox vaccinated versus naïve samples in an independent cohort of mice with 95% accuracy. Second, iCAT displayed 100% accuracy classifying naïve and monkeypox vaccinated mice. Finally, we demonstrate the use of iCAT on human samples before and after exposure to SARS-CoV-2, the virus behind the COVID-19 global pandemic. We were able to correctly classify the exposed samples with perfect accuracy. These experimental results show that iCAT capitalizes on the power of TCR sequencing to simplify infection diagnostics. iCAT provides the option of a graphical, user-friendly interface on top of usual R interface allowing it to reach a wider audience.
Assuntos
COVID-19 , Animais , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Receptores de Antígenos de Linfócitos T/genética , SARS-CoV-2 , SoftwareRESUMO
The evolution of herbicide-resistant weed species is a serious threat for weed control. Therefore, we need an improved understanding of how gene regulation confers herbicide resistance in order to slow the evolution of resistance. The present study analyzed differentially expressed genes after glyphosate treatment on a glyphosate-resistant Tennessee ecotype (TNR) of horseweed (Conyza canadensis), compared to a susceptible biotype (TNS). A read size of 100.2 M was sequenced on the Illumina platform and subjected to de novo assembly, resulting in 77,072 gene-level contigs, of which 32,493 were uniquely annotated by a BlastX alignment of protein sequence similarity. The most differentially expressed genes were enriched in the gene ontology (GO) term of the transmembrane transport protein. In addition, fifteen upregulated genes were identified in TNR after glyphosate treatment but were not detected in TNS. Ten of these upregulated genes were transmembrane transporter or kinase receptor proteins. Therefore, a combination of changes in gene expression among transmembrane receptor and kinase receptor proteins may be important for endowing non-target-site glyphosate-resistant C. canadensis.
Assuntos
Conyza/genética , Glicina/análogos & derivados , Resistência a Herbicidas/genética , Herbicidas/farmacologia , Biologia Computacional , Conyza/efeitos dos fármacos , DNA de Plantas , Genes de Plantas , Glicina/farmacologia , Análise de Sequência de DNA/métodos , Transcriptoma , Controle de Plantas Daninhas/métodos , GlifosatoRESUMO
A quantitative chimerism test monitors engraftment of donor hematopoietic stem cells or relapse of leukemias or lymphomas in hematopoietic stem cell transplantation patients. The most common method used for chimerism testing is PCR amplification of short tandem repeat loci, followed by capillary gel electrophoresis. Manual data analysis is tedious and time consuming, as it involves the selection of informative loci and the repetition of quantifying chimerism percentage for multiple loci from multiple cell types. It is also susceptible to human errors. Currently, there is no free software to fully automate chimerism data analysis. Rchimerism, an R shiny package, was developed to automatically pick informative loci, calculate chimerism percentage, and display the results through a user-friendly interface. The accuracy of the program was compared with manual calculation on 60 patient samples with 100% concordance. Compared with manual calculation, Rchimerism drastically reduces analysis time from 20 to 40 minutes for single donor transplantation samples and from 40 to 80 minutes for double donor transplantation samples to >1 minute. Rchimerism can be downloaded and used freely by noncommercial laboratories.
Assuntos
Quimera/genética , Quimerismo , Análise de Dados , Rejeição de Enxerto/genética , Sobrevivência de Enxerto/genética , Transplante de Células-Tronco Hematopoéticas , Interface Usuário-Computador , Algoritmos , Alelos , Biomarcadores , Confiabilidade dos Dados , Eletroforese Capilar , Loci Gênicos , Humanos , Repetições de Microssatélites , Reação em Cadeia da Polimerase , Doadores de Tecidos , TransplantadosRESUMO
Zika virus (ZIKV) is a significant global health threat due to its potential for rapid emergence and association with severe congenital malformations during infection in pregnancy. Despite the urgent need, accurate diagnosis of ZIKV infection is still a major hurdle that must be overcome. Contributing to the inaccuracy of most serologically-based diagnostic assays for ZIKV, is the substantial geographic and antigenic overlap with other flaviviruses, including the four serotypes of dengue virus (DENV). Within this study, we have utilized a novel T cell receptor (TCR) sequencing platform to distinguish between ZIKV and DENV infections. Using high-throughput TCR sequencing of lymphocytes isolated from DENV and ZIKV infected mice, we were able to develop an algorithm which could identify virus-associated TCR sequences uniquely associated with either a prior ZIKV or DENV infection in mice. Using this algorithm, we were then able to separate mice that had been exposed to ZIKV or DENV infection with 97% accuracy. Overall this study serves as a proof-of-principle that T cell receptor sequencing can be used as a diagnostic tool capable of distinguishing between closely related viruses. Our results demonstrate the potential for this innovative platform to be used to accurately diagnose Zika virus infection and potentially the next emerging pathogen(s).