Pesquisa | Portal Regional da BVS

1.

Chromosome-level Subgenome-aware de novo Assembly of Saccharomyces bayanus Provides Insight into Genome Divergence after Hybridization.

Gardner, Cory; Chen, Junhao; Hadfield, Christina; Lu, Zhaolian; Debruin, David; Zhan, Yu; Donlin, Maureen J; Lin, Zhenguo; Ahn, Tae-Hyuk.

bioRxiv ; 2024 Mar 19.

Artigo em Inglês | MEDLINE | ID: mdl-38562692

RESUMO

Interspecies hybridization is prevalent in various eukaryotic lineages and plays important roles in phenotypic diversification, adaption, and speciation. To better understand the changes that occurred in the different subgenomes of a hybrid species and how they facilitated adaptation, we completed chromosome-level de novo assemblies of all 16 pairs chromosomes for a recently formed hybrid yeast, Saccharomyces bayanus strain CBS380 (IFO11022), using Nanopore MinION long-read sequencing. Characterization of S. bayanus subgenomes and comparative analysis with the genomes of its parent species, S. uvarum and S. eubayanus, provide several new insights into understanding genome evolution after a relatively recent hybridization. For instance, multiple recombination events between the two subgenomes have been observed in each chromosome, followed by loss of heterozygosity (LOH) in most chromosomes in nine chromosome pairs. In addition to maintaining nearly all gene content and synteny from its parental genomes, S. bayanus has acquired many genes from other yeast species, primarily through the introgression of S. cerevisiae, such as those involved in the maltose metabolism. In addition, the patterns of recombination and LOH suggest an allotetraploid origin of S. bayanus. The gene acquisition and rapid LOH in the hybrid genome probably facilitated its adaption to maltose brewing environments and mitigated the maladaptive effect of hybridization.

2.

MegaD: Deep Learning for Rapid and Accurate Disease Status Prediction of Metagenomic Samples.

Mreyoud, Yassin; Song, Myoungkyu; Lim, Jihun; Ahn, Tae-Hyuk.

Life (Basel) ; 12(5)2022 Apr 30.

Artigo em Inglês | MEDLINE | ID: mdl-35629336

RESUMO

The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.

3.

TSSr: an R package for comprehensive analyses of TSS sequencing data.

Lu, Zhaolian; Berry, Keenan; Hu, Zhenbin; Zhan, Yu; Ahn, Tae-Hyuk; Lin, Zhenguo.

NAR Genom Bioinform ; 3(4): lqab108, 2021 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34805991

RESUMO

Transcription initiation is regulated in a highly organized fashion to ensure proper cellular functions. Accurate identification of transcription start sites (TSSs) and quantitative characterization of transcription initiation activities are fundamental steps for studies of regulated transcriptions and core promoter structures. Several high-throughput techniques have been developed to sequence the very 5'end of RNA transcripts (TSS sequencing) on the genome scale. Bioinformatics tools are essential for processing, analysis, and visualization of TSS sequencing data. Here, we present TSSr, an R package that provides rich functions for mapping TSS and characterizations of structures and activities of core promoters based on all types of TSS sequencing data. Specifically, TSSr implements several newly developed algorithms for accurately identifying TSSs from mapped sequencing reads and inference of core promoters, which are a prerequisite for subsequent functional analyses of TSS data. Furthermore, TSSr also enables users to export various types of TSS data that can be visualized by genome browser for inspection of promoter activities in association with other genomic features, and to generate publication-ready TSS graphs. These user-friendly features could greatly facilitate studies of transcription initiation based on TSS sequencing data. The source code and detailed documentations of TSSr can be freely accessed at https://github.com/Linlab-slu/TSSr.

4.

Novel Candidate Genes Differentially Expressed in Glyphosate-Treated Horseweed (Conyza canadensis).

Yang, Yongil; Gardner, Cory; Gupta, Pallavi; Peng, Yanhui; Piasecki, Cristiano; Millwood, Reginald J; Ahn, Tae-Hyuk; Stewart, C Neal.

Genes (Basel) ; 12(10)2021 10 14.

Artigo em Inglês | MEDLINE | ID: mdl-34681011

RESUMO

The evolution of herbicide-resistant weed species is a serious threat for weed control. Therefore, we need an improved understanding of how gene regulation confers herbicide resistance in order to slow the evolution of resistance. The present study analyzed differentially expressed genes after glyphosate treatment on a glyphosate-resistant Tennessee ecotype (TNR) of horseweed (Conyza canadensis), compared to a susceptible biotype (TNS). A read size of 100.2 M was sequenced on the Illumina platform and subjected to de novo assembly, resulting in 77,072 gene-level contigs, of which 32,493 were uniquely annotated by a BlastX alignment of protein sequence similarity. The most differentially expressed genes were enriched in the gene ontology (GO) term of the transmembrane transport protein. In addition, fifteen upregulated genes were identified in TNR after glyphosate treatment but were not detected in TNS. Ten of these upregulated genes were transmembrane transporter or kinase receptor proteins. Therefore, a combination of changes in gene expression among transmembrane receptor and kinase receptor proteins may be important for endowing non-target-site glyphosate-resistant C. canadensis.

Assuntos

Conyza/genética , Glicina/análogos & derivados , Resistência a Herbicidas/genética , Herbicidas/farmacologia , Biologia Computacional , Conyza/efeitos dos fármacos , DNA de Plantas , Genes de Plantas , Glicina/farmacologia , Análise de Sequência de DNA/métodos , Transcriptoma , Controle de Plantas Daninhas/métodos , Glifosato

5.

Comparison of 16S and whole genome dog microbiomes using machine learning.

Lewis, Scott; Nash, Andrea; Li, Qinghong; Ahn, Tae-Hyuk.

BioData Min ; 14(1): 41, 2021 Aug 21.

Artigo em Inglês | MEDLINE | ID: mdl-34419136

RESUMO

BACKGROUND: Recent advances in sequencing technologies have driven studies identifying the microbiome as a key regulator of overall health and disease in the host. Both 16S amplicon and whole genome shotgun sequencing technologies are currently being used to investigate this relationship, however, the choice of sequencing technology often depends on the nature and experimental design of the study. In principle, the outputs rendered by analysis pipelines are heavily influenced by the data used as input; it is then important to consider that the genomic features produced by different sequencing technologies may emphasize different results. RESULTS: In this work, we use public 16S amplicon and whole genome shotgun sequencing (WGS) data from the same dogs to investigate the relationship between sequencing technology and the captured gut metagenomic landscape in dogs. In our analyses, we compare the taxonomic resolution at the species and phyla levels and benchmark 12 classification algorithms in their ability to accurately identify host phenotype using only taxonomic relative abundance information from 16S and WGS datasets with identical study designs. Our best performing model, a random forest trained by the WGS dataset, identified a species (Bacteroides coprocola) that predominantly contributes to the abundance of leuB, a gene involved in branched chain amino acid biosynthesis; a risk factor for glucose intolerance, insulin resistance, and type 2 diabetes. This trend was not conserved when we trained the model using 16S sequencing profiles from the same dogs. CONCLUSIONS: Our results indicate that WGS sequencing of dog microbiomes detects a greater taxonomic diversity than 16S sequencing of the same dogs at the species level and with respect to four gut-enriched phyla levels. This difference in detection does not significantly impact the performance metrics of machine learning algorithms after down-sampling. Although the important features extracted from our best performing model are not conserved between the two technologies, the important features extracted from either instance indicate the utility of machine learning algorithms in identifying biologically meaningful relationships between the host and microbiome community members. In conclusion, this work provides the first systematic machine learning comparison of dog 16S and WGS microbiomes derived from identical study designs.

6.

iCAT: diagnostic assessment tool of immunological history using high-throughput T-cell receptor sequencing.

Rajeh, Ahmad; Wolf, Kyle; Schiebout, Courtney; Sait, Nabeel; Kosfeld, Tim; DiPaolo, Richard J; Ahn, Tae-Hyuk.

F1000Res ; 10: 65, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34316355

RESUMO

The pathogen exposure history of an individual is recorded in their T-cell repertoire and can be accessed through the study of T-cell receptors (TCRs) if the tools to identify them were available. For each T-cell, the TCR loci undergoes genetic rearrangement that creates a unique DNA sequence. In theory these unique sequences can be used as biomarkers for tracking T-cell responses and cataloging immunological history. We developed the immune Cell Analysis Tool (iCAT), an R software package that analyzes TCR sequencing data from exposed (positive) and unexposed (negative) samples to identify TCR sequences statistically associated with positive samples. The presence and absence of associated sequences in samples trains a classifier to diagnose pathogen-specific exposure. We demonstrate the high accuracy of iCAT by testing on three TCR sequencing datasets. First, iCAT successfully diagnosed smallpox vaccinated versus naïve samples in an independent cohort of mice with 95% accuracy. Second, iCAT displayed 100% accuracy classifying naïve and monkeypox vaccinated mice. Finally, we demonstrate the use of iCAT on human samples before and after exposure to SARS-CoV-2, the virus behind the COVID-19 global pandemic. We were able to correctly classify the exposed samples with perfect accuracy. These experimental results show that iCAT capitalizes on the power of TCR sequencing to simplify infection diagnostics. iCAT provides the option of a graphical, user-friendly interface on top of usual R interface allowing it to reach a wider audience.

Assuntos

COVID-19 , Animais , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Receptores de Antígenos de Linfócitos T/genética , SARS-CoV-2 , Software

7.

MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning.

Dhungel, Eliza; Mreyoud, Yassin; Gwak, Ho-Jin; Rajeh, Ahmad; Rho, Mina; Ahn, Tae-Hyuk.

BMC Bioinformatics ; 22(1): 25, 2021 Jan 18.

Artigo em Inglês | MEDLINE | ID: mdl-33461494

RESUMO

BACKGROUND: Diverse microbiome communities drive biogeochemical processes and evolution of animals in their ecosystems. Many microbiome projects have demonstrated the power of using metagenomics to understand the structures and factors influencing the function of the microbiomes in their environments. In order to characterize the effects from microbiome composition for human health, diseases, and even ecosystems, one must first understand the relationship of microbes and their environment in different samples. Running machine learning model with metagenomic sequencing data is encouraged for this purpose, but it is not an easy task to make an appropriate machine learning model for all diverse metagenomic datasets. RESULTS: We introduce MegaR, an R Shiny package and web application, to build an unbiased machine learning model effortlessly with interactive visual analysis. The MegaR employs taxonomic profiles from either whole metagenome sequencing or 16S rRNA sequencing data to develop machine learning models and classify the samples into two or more categories. It provides various options for model fine tuning throughout the analysis pipeline such as data processing, multiple machine learning techniques, model validation, and unknown sample prediction that can be used to achieve the highest prediction accuracy possible for any given dataset while still maintaining a user-friendly experience. CONCLUSIONS: Metagenomic sample classification and phenotype prediction is important particularly when it applies to a diagnostic method for identifying and predicting microbe-related human diseases. MegaR provides various interactive visualizations for user to build an accurate machine-learning model without difficulty. Unknown sample prediction with a properly trained model using MegaR will enhance researchers to identify the sample property in a fast turnaround time.

Assuntos

Aprendizado de Máquina , Metagenoma , Metagenômica , Humanos , Fenótipo , RNA Ribossômico 16S/genética

8.

Diagnostic differentiation of Zika and dengue virus exposure by analyzing T cell receptor sequences from peripheral blood of infected HLA-A2 transgenic mice.

Hassert, Mariah; Wolf, Kyle J; Rajeh, Ahmad; Schiebout, Courtney; Hoft, Stella G; Ahn, Tae-Hyuk; DiPaolo, Richard J; Brien, James D; Pinto, Amelia K.

PLoS Negl Trop Dis ; 14(12): e0008896, 2020 12.

Artigo em Inglês | MEDLINE | ID: mdl-33270635

RESUMO

Zika virus (ZIKV) is a significant global health threat due to its potential for rapid emergence and association with severe congenital malformations during infection in pregnancy. Despite the urgent need, accurate diagnosis of ZIKV infection is still a major hurdle that must be overcome. Contributing to the inaccuracy of most serologically-based diagnostic assays for ZIKV, is the substantial geographic and antigenic overlap with other flaviviruses, including the four serotypes of dengue virus (DENV). Within this study, we have utilized a novel T cell receptor (TCR) sequencing platform to distinguish between ZIKV and DENV infections. Using high-throughput TCR sequencing of lymphocytes isolated from DENV and ZIKV infected mice, we were able to develop an algorithm which could identify virus-associated TCR sequences uniquely associated with either a prior ZIKV or DENV infection in mice. Using this algorithm, we were then able to separate mice that had been exposed to ZIKV or DENV infection with 97% accuracy. Overall this study serves as a proof-of-principle that T cell receptor sequencing can be used as a diagnostic tool capable of distinguishing between closely related viruses. Our results demonstrate the potential for this innovative platform to be used to accurately diagnose Zika virus infection and potentially the next emerging pathogen(s).

Assuntos

Dengue/diagnóstico , Antígeno HLA-A2/genética , Receptores de Antígenos de Linfócitos T/metabolismo , Infecção por Zika virus/diagnóstico , Animais , Anticorpos Antivirais/sangue , Reações Cruzadas/imunologia , Dengue/sangue , Camundongos , Camundongos Transgênicos , Receptores de Antígenos de Linfócitos T/química , Testes Sorológicos/métodos , Infecção por Zika virus/sangue

9.

Single-Cell Transcriptional Analyses Identify Lineage-Specific Epithelial Responses to Inflammation and Metaplastic Development in the Gastric Corpus.

Bockerstett, Kevin A; Lewis, Scott A; Noto, Christine N; Ford, Eric L; Saenz, José B; Jackson, Nicholas M; Ahn, Tae-Hyuk; Mills, Jason C; DiPaolo, Richard J.

Gastroenterology ; 159(6): 2116-2129.e4, 2020 12.

Artigo em Inglês | MEDLINE | ID: mdl-32835664

RESUMO

BACKGROUND & AIMS: Chronic atrophic gastritis can lead to gastric metaplasia and increase risk of gastric adenocarcinoma. Metaplasia is a precancerous lesion associated with an increased risk for carcinogenesis, but the mechanism(s) by which inflammation induces metaplasia are poorly understood. We investigated transcriptional programs in mucous neck cells and chief cells as they progress to metaplasia mice with chronic gastritis. METHODS: We analyzed previously generated single-cell RNA-sequencing (scRNA-seq) data of gastric corpus epithelium to define transcriptomes of individual epithelial cells from healthy BALB/c mice (controls) and TxA23 mice, which have chronically inflamed stomachs with metaplasia. Chronic gastritis was induced in B6 mice by Helicobacter pylori infection. Gastric tissues from mice and human patients were analyzed by immunofluorescence to verify findings at the protein level. Pseudotime trajectory analysis of scRNA-seq data was used to predict differentiation of normal gastric epithelium to metaplastic epithelium in chronically inflamed stomachs. RESULTS: Analyses of gastric epithelial transcriptomes revealed that gastrokine 3 (Gkn3) mRNA is a specific marker of mouse gastric corpus metaplasia (spasmolytic polypeptide expressing metaplasia, SPEM). Gkn3 mRNA was undetectable in healthy gastric corpus; its expression in chronically inflamed stomachs (from TxA23 mice and mice with Helicobacter pylori infection) identified more metaplastic cells throughout the corpus than previously recognized. Staining of healthy and diseased human gastric tissue samples paralleled these results. Although mucous neck cells and chief cells from healthy stomachs each had distinct transcriptomes, in chronically inflamed stomachs, these cells had distinct transcription patterns that converged upon a pre-metaplastic pattern, which lacked the metaplasia-associated transcripts. Finally, pseudotime trajectory analysis confirmed the convergence of mucous neck cells and chief cells into a pre-metaplastic phenotype that ultimately progressed to metaplasia. CONCLUSIONS: In analyses of tissues from chronically inflamed stomachs of mice and humans, we expanded the definition of gastric metaplasia to include Gkn3 mRNA and GKN3-positive cells in the corpus, allowing a more accurate assessment of SPEM. Under conditions of chronic inflammation, chief cells and mucous neck cells are plastic and converge into a pre-metaplastic cell type that progresses to metaplasia.

Assuntos

Celulas Principais Gástricas/patologia , Gastrite Atrófica/imunologia , Infecções por Helicobacter/imunologia , Lesões Pré-Cancerosas/diagnóstico , Neoplasias Gástricas/prevenção & controle , Animais , Biomarcadores/análise , Biomarcadores/metabolismo , Carcinogênese/genética , Carcinogênese/imunologia , Proteínas de Transporte/análise , Proteínas de Transporte/metabolismo , Celulas Principais Gástricas/imunologia , Modelos Animais de Doenças , Feminino , Gastrite Atrófica/microbiologia , Gastrite Atrófica/patologia , Infecções por Helicobacter/genética , Infecções por Helicobacter/microbiologia , Infecções por Helicobacter/patologia , Helicobacter pylori/imunologia , Humanos , Masculino , Proteínas de Membrana/análise , Proteínas de Membrana/metabolismo , Metaplasia/diagnóstico , Metaplasia/genética , Metaplasia/imunologia , Metaplasia/patologia , Camundongos , Lesões Pré-Cancerosas/genética , Lesões Pré-Cancerosas/imunologia , Lesões Pré-Cancerosas/patologia , RNA-Seq , Análise de Célula Única , Neoplasias Gástricas/patologia

10.

Rchimerism: An R Package for Automated Chimerism Data Analysis.

Siddiqui, Zohair; Maldonado, Juan; Grojean, Jeremy; Ye, Fei; Zhang, David; Longtine, Janina; Ahn, Tae-Hyuk; Guo, Huazhang.

J Mol Diagn ; 22(1): 21-29, 2020 01.

Artigo em Inglês | MEDLINE | ID: mdl-31605803

RESUMO

A quantitative chimerism test monitors engraftment of donor hematopoietic stem cells or relapse of leukemias or lymphomas in hematopoietic stem cell transplantation patients. The most common method used for chimerism testing is PCR amplification of short tandem repeat loci, followed by capillary gel electrophoresis. Manual data analysis is tedious and time consuming, as it involves the selection of informative loci and the repetition of quantifying chimerism percentage for multiple loci from multiple cell types. It is also susceptible to human errors. Currently, there is no free software to fully automate chimerism data analysis. Rchimerism, an R shiny package, was developed to automatically pick informative loci, calculate chimerism percentage, and display the results through a user-friendly interface. The accuracy of the program was compared with manual calculation on 60 patient samples with 100% concordance. Compared with manual calculation, Rchimerism drastically reduces analysis time from 20 to 40 minutes for single donor transplantation samples and from 40 to 80 minutes for double donor transplantation samples to >1 minute. Rchimerism can be downloaded and used freely by noncommercial laboratories.

Assuntos

Quimera/genética , Quimerismo , Análise de Dados , Rejeição de Enxerto/genética , Sobrevivência de Enxerto/genética , Transplante de Células-Tronco Hematopoéticas , Interface Usuário-Computador , Algoritmos , Alelos , Biomarcadores , Confiabilidade dos Dados , Eletroforese Capilar , Loci Gênicos , Humanos , Repetições de Microssatélites , Reação em Cadeia da Polimerase , Doadores de Tecidos , Transplantados

11.

Single-cell transcriptional analyses of spasmolytic polypeptide-expressing metaplasia arising from acute drug injury and chronic inflammation in the stomach.

Bockerstett, Kevin A; Lewis, Scott A; Wolf, Kyle J; Noto, Christine N; Jackson, Nicholas M; Ford, Eric L; Ahn, Tae-Hyuk; DiPaolo, Richard J.

Gut ; 69(6): 1027-1038, 2020 06.

Artigo em Inglês | MEDLINE | ID: mdl-31481545

RESUMO

OBJECTIVE: Spasmolytic polypeptide-expressing metaplasia (SPEM) is a regenerative lesion in the gastric mucosa and is a potential precursor to intestinal metaplasia/gastric adenocarcinoma in a chronic inflammatory setting. The goal of these studies was to define the transcriptional changes associated with SPEM at the individual cell level in response to acute drug injury and chronic inflammatory damage in the gastric mucosa. DESIGN: Epithelial cells were isolated from the gastric corpus of healthy stomachs and stomachs with drug-induced and inflammation-induced SPEM lesions. Single cell RNA sequencing (scRNA-seq) was performed on tissue samples from each of these settings. The transcriptomes of individual epithelial cells from healthy, acutely damaged and chronically inflamed stomachs were analysed and compared. RESULTS: scRNA-seq revealed a population Mucin 6 (Muc6)+gastric intrinsic factor (Gif)+ cells in healthy tissue, but these cells did not express transcripts associated with SPEM. Furthermore, analyses of SPEM cells from drug injured and chronically inflamed corpus yielded two major findings: (1) SPEM and neck cell hyperplasia/hypertrophy are nearly identical in the expression of SPEM-associated transcripts and (2) SPEM programmes induced by drug-mediated parietal cell ablation and chronic inflammation are nearly identical, although the induction of transcripts involved in immunomodulation was unique to SPEM cells in the chronic inflammatory setting. CONCLUSIONS: These data necessitate an expansion of the definition of SPEM to include Tff2+Muc6+ cells that do not express mature chief cell transcripts such as Gif. Our data demonstrate that SPEM arises by a highly conserved cellular programme independent of aetiology and develops immunoregulatory capabilities in a setting of chronic inflammation.

Assuntos

Mucosa Gástrica/metabolismo , Gastrite/induzido quimicamente , Peptídeos e Proteínas de Sinalização Intercelular/metabolismo , Animais , Feminino , Imunofluorescência , Mucosa Gástrica/efeitos dos fármacos , Mucosa Gástrica/patologia , Gastrite/metabolismo , Gastrite/patologia , Perfilação da Expressão Gênica , Hibridização In Situ , Masculino , Metaplasia/induzido quimicamente , Metaplasia/metabolismo , Camundongos , Camundongos Endogâmicos BALB C , Mucina-6/metabolismo , Análise de Sequência de RNA , Análise de Célula Única , Tamoxifeno/farmacologia , Fator Trefoil-2/metabolismo

12.

Using Apache Spark on genome assembly for scalable overlap-graph reduction.

Paul, Alexander J; Lawrence, Dylan; Song, Myoungkyu; Lim, Seung-Hwan; Pan, Chongle; Ahn, Tae-Hyuk.

Hum Genomics ; 13(Suppl 1): 48, 2019 10 22.

Artigo em Inglês | MEDLINE | ID: mdl-31639049

RESUMO

BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Assuntos

Algoritmos , Genoma , Análise de Sequência de DNA , Sequência de Bases , Conyza/genética , Bases de Dados Genéticas , Genoma Humano , Genoma de Planta , Humanos

13.

Massive metagenomic data analysis using abundance-based machine learning.

Harris, Zachary N; Dhungel, Eliza; Mosior, Matthew; Ahn, Tae-Hyuk.

Biol Direct ; 14(1): 12, 2019 08 01.

Artigo em Inglês | MEDLINE | ID: mdl-31370905

RESUMO

BACKGROUND: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples. RESULTS: To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label. CONCLUSION: Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity. REVIEWERS: This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul.

Assuntos

Análise de Dados , Aprendizado de Máquina , Metagenoma , Metagenômica/métodos , Microbiota/genética

14.

YeasTSS: an integrative web database of yeast transcription start sites.

McMillan, Jonathan; Lu, Zhaolian; Rodriguez, Judith S; Ahn, Tae-Hyuk; Lin, Zhenguo.

Database (Oxford) ; 20192019 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-31032841

RESUMO

The transcription initiation landscape of eukaryotic genes is complex and highly dynamic. In eukaryotes, genes can generate multiple transcript variants that differ in 5' boundaries due to usages of alternative transcription start sites (TSSs), and the abundance of transcript isoforms are highly variable. Due to a large number and complexity of the TSSs, it is not feasible to depict details of transcript initiation landscape of all genes using text-format genome annotation files. Therefore, it is necessary to provide data visualization of TSSs to represent quantitative TSS maps and the core promoters (CPs). In addition, the selection and activity of TSSs are influenced by various factors, such as transcription factors, chromatin remodeling and histone modifications. Thus, integration and visualization of functional genomic data related to these features could provide a better understanding of the gene promoter architecture and regulatory mechanism of transcription initiation. Yeast species play important roles for the research and human society, yet no database provides visualization and integration of functional genomic data in yeast. Here, we generated quantitative TSS maps for 12 important yeast species, inferred their CPs and built a public database, YeasTSS (www.yeastss.org). YeasTSS was designed as a central portal for visualization and integration of the TSS maps, CPs and functional genomic data related to transcription initiation in yeast. YeasTSS is expected to benefit the research community and public education for improving genome annotation, studies of promoter structure, regulated control of transcription initiation and inferring gene regulatory network.

Assuntos

Bases de Dados de Ácidos Nucleicos , Internet , Sítio de Iniciação de Transcrição , Leveduras/genética , Leveduras/classificação

15.

Identifying and Tracking Low-Frequency Virus-Specific TCR Clonotypes Using High-Throughput Sequencing.

Wolf, Kyle; Hether, Tyler; Gilchuk, Pavlo; Kumar, Amrendra; Rajeh, Ahmad; Schiebout, Courtney; Maybruck, Julie; Buller, R Mark; Ahn, Tae-Hyuk; Joyce, Sebastian; DiPaolo, Richard J.

Cell Rep ; 25(9): 2369-2378.e4, 2018 11 27.

Artigo em Inglês | MEDLINE | ID: mdl-30485806

RESUMO

Tracking antigen-specific T cell responses over time within individuals is difficult because of lack of knowledge of antigen-specific TCR sequences, limitations in sample size, and assay sensitivities. We hypothesized that analyses of high-throughput sequencing of TCR clonotypes could provide functional readouts of individuals' immunological histories. Using high-throughput TCR sequencing, we develop a database of TCRß sequences from large cohorts of mice before (naive) and after smallpox vaccination. We computationally identify 315 vaccine-associated TCR sequences (VATS) that are used to train a diagnostic classifier that distinguishes naive from vaccinated samples in mice up to 9 months post-vaccination with >99% accuracy. We determine that the VATS library contains virus-responsive TCRs by in vitro expansion assays and virus-specific tetramer sorting. These data outline a platform for advancing our capabilities to identify pathogen-specific TCR sequences, which can be used to identify and quantitate low-frequency pathogen-specific TCR sequences in circulation over time with exceptional sensitivity.

Assuntos

Rastreamento de Células , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Receptores de Antígenos de Linfócitos T/metabolismo , Vírus/metabolismo , Sequência de Aminoácidos , Animais , Células Clonais , Feminino , Biblioteca Gênica , Masculino , Camundongos Endogâmicos C57BL , Orthopoxvirus , Peptídeos/química , Infecções por Poxviridae/virologia , Receptores de Antígenos de Linfócitos T/química , Vacinação

16.

LONGO: an R package for interactive gene length dependent analysis for neuronal identity.

McCoy, Matthew J; Paul, Alexander J; Victor, Matheus B; Richner, Michelle; Gabel, Harrison W; Gong, Haijun; Yoo, Andrew S; Ahn, Tae-Hyuk.

Bioinformatics ; 34(13): i422-i428, 2018 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-29950021

RESUMO

Motivation: Reprogramming somatic cells into neurons holds great promise to model neuronal development and disease. The efficiency and success rate of neuronal reprogramming, however, may vary between different conversion platforms and cell types, thereby necessitating an unbiased, systematic approach to estimate neuronal identity of converted cells. Recent studies have demonstrated that long genes (>100 kb from transcription start to end) are highly enriched in neurons, which provides an opportunity to identify neurons based on the expression of these long genes. Results: We have developed a versatile R package, LONGO, to analyze gene expression based on gene length. We propose a systematic analysis of long gene expression (LGE) with a metric termed the long gene quotient (LQ) that quantifies LGE in RNA-seq or microarray data to validate neuronal identity at the single-cell and population levels. This unique feature of neurons provides an opportunity to utilize measurements of LGE in transcriptome data to quickly and easily distinguish neurons from non-neuronal cells. By combining this conceptual advancement and statistical tool in a user-friendly and interactive software package, we intend to encourage and simplify further investigation into LGE, particularly as it applies to validating and improving neuronal differentiation and reprogramming methodologies. Availability and implementation: LONGO is freely available for download at https://github.com/biohpc/longo. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Reprogramação Celular , Perfilação da Expressão Gênica/métodos , Neurônios/metabolismo , Software , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Neurônios/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , Transcriptoma

17.

Insights from 20 years of bacterial genome sequencing.

Land, Miriam; Hauser, Loren; Jun, Se-Ran; Nookaew, Intawat; Leuze, Michael R; Ahn, Tae-Hyuk; Karpinets, Tatiana; Lund, Ole; Kora, Guruprased; Wassenaar, Trudy; Poudel, Suresh; Ussery, David W.

Funct Integr Genomics ; 15(2): 141-61, 2015 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-25722247

RESUMO

Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date, there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.

Assuntos

Genoma Bacteriano , Bactérias/classificação , Proteínas de Bactérias/genética , Códon , Variação Genética , Tamanho do Genoma , Genômica , Metagenômica , Anotação de Sequência Molecular , Filogenia , Análise de Sequência de DNA

18.

Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance.

Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle.

Bioinformatics ; 31(2): 170-7, 2015 Jan 15.

Artigo em Inglês | MEDLINE | ID: mdl-25266224

RESUMO

MOTIVATION: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. RESULTS: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. The algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. AVAILABILITY AND IMPLEMENTATION: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Biovigilância , Biologia Computacional/métodos , DNA Bacteriano/análise , Genoma Bacteriano , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos

19.

Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam.

Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle.

BMC Evol Biol ; 14: 207, 2014 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-25293379

RESUMO

BACKGROUND: Phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. RESULTS: A total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accurate comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. CONCLUSIONS: Our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.

Assuntos

Archaea/genética , Bactérias/genética , Anotação de Sequência Molecular/métodos , Bases de Dados de Proteínas , Genoma Arqueal , Genoma Bacteriano , Filogenia

20.

Diverse and divergent protein post-translational modifications in two growth stages of a natural microbial community.

Li, Zhou; Wang, Yingfeng; Yao, Qiuming; Justice, Nicholas B; Ahn, Tae-Hyuk; Xu, Dong; Hettich, Robert L; Banfield, Jillian F; Pan, Chongle.

Nat Commun ; 5: 4405, 2014 Jul 25.

Artigo em Inglês | MEDLINE | ID: mdl-25059763

RESUMO

Detailed characterization of post-translational modifications (PTMs) of proteins in microbial communities remains a significant challenge. Here we directly identify and quantify a broad range of PTMs (hydroxylation, methylation, citrullination, acetylation, phosphorylation, methylthiolation, S-nitrosylation and nitration) in a natural microbial community from an acid mine drainage site. Approximately 29% of the identified proteins of the dominant Leptospirillum group II bacteria are modified, and 43% of modified proteins carry multiple PTM types. Most PTM events, except S-nitrosylations, have low fractional occupancy. Notably, PTM events are detected on Cas proteins involved in antiviral defense, an aspect of Cas biochemistry not considered previously. Further, Cas PTM profiles from Leptospirillum group II differ in early versus mature biofilms. PTM patterns are divergent on orthologues of two closely related, but ecologically differentiated, Leptospirillum group II bacteria. Our results highlight the prevalence and dynamics of PTMs of proteins, with potential significance for ecological adaptation and microbial evolution.

Assuntos

Consórcios Microbianos/fisiologia , Processamento de Proteína Pós-Traducional , Acetilação , Bactérias/crescimento & desenvolvimento , Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Biofilmes , California , Ecossistema , Escherichia coli/metabolismo , Hidroxilação , Metilação , Fosforilação , Proteoma/análise

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA