RESUMO
Identification of neoepitopes that can control tumor growth in vivo remains a challenge even 10 y after the first genomics-defined cancer neoepitopes were identified. In this study, we identify a neoepitope, resulting from a mutation in the junction plakoglobin (Jup) gene (chromosome 11), from the mouse colon cancer line MC38-FABF (C57BL/6). This neoepitope, Jup mutant (JupMUT), was detected during mass spectrometry of MHC class I-eluted peptides from the tumor. JupMUT has a predicted binding affinity of 564 nM for the Kb molecule and a higher predicted affinity of 82 nM for Db. However, whereas structural modeling of JupMUT and its unmutated counterpart Jup wild-type indicates that there are little conformational differences between the two epitopes bound to Db, large structural divergences are predicted between the two epitopes bound to Kb. Together with in vitro binding data with RMA-S cells, these data suggest that Kb rather than Db is the relevant MHC class I molecule of JupMUT. Immunization of naive C57BL/6 mice with JupMUT elicits CD8-dependent tumor control of a MC38-FABF challenge. Despite the CD8 dependence of JupMUT-mediated tumor control in vivo, CD8+ T cells from JupMUT-immunized mice do not produce higher levels of IFN-γ than do naive mice. The structural and immunological characteristics of JupMUT are substantially different from those of many other neoepitopes that have been shown to mediate tumor control.
RESUMO
BACKGROUND: Personalized cancer vaccines are emerging as one of the most promising approaches to immunotherapy of advanced cancers. However, only a small proportion of the neoepitopes generated by somatic DNA mutations in cancer cells lead to tumor rejection. Since it is impractical to experimentally assess all candidate neoepitopes prior to vaccination, developing accurate methods for predicting tumor-rejection mediating neoepitopes (TRMNs) is critical for enabling routine clinical use of cancer vaccines. RESULTS: In this paper we introduce Positive-unlabeled Learning using AuTOml (PLATO), a general semi-supervised approach to improving accuracy of model-based classifiers. PLATO generates a set of high confidence positive calls by applying a stringent filter to model-based predictions, then rescores remaining candidates by using positive-unlabeled learning. To achieve robust performance on clinical samples with large patient-to-patient variation, PLATO further integrates AutoML hyper-parameter tuning, classification threshold selection based on spies, and support for bootstrapping. CONCLUSIONS: Experimental results on real datasets demonstrate that PLATO has improved performance compared to model-based approaches for two key steps in TRMN prediction, namely somatic variant calling from exome sequencing data and peptide identification from MS/MS data.
Assuntos
Imunoterapia , Neoplasias/terapia , Peptídeos/análise , Medicina de Precisão , Aprendizado de Máquina Supervisionado , Epitopos/imunologia , Epitopos/metabolismo , Humanos , Polimorfismo de Nucleotídeo Único , Espectrometria de Massas em Tandem , Sequenciamento do ExomaRESUMO
BACKGROUND: Single cell transcriptomics is critical for understanding cellular heterogeneity and identification of novel cell types. Leveraging the recent advances in single cell RNA sequencing (scRNA-Seq) technology requires novel unsupervised clustering algorithms that are robust to high levels of technical and biological noise and scale to datasets of millions of cells. RESULTS: We present novel computational approaches for clustering scRNA-seq data based on the Term Frequency - Inverse Document Frequency (TF-IDF) transformation that has been successfully used in the field of text analysis. CONCLUSIONS: Empirical experimental results show that TF-IDF methods consistently outperform commonly used scRNA-Seq clustering approaches.
Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Algoritmos , Análise por Conglomerados , Análise de Célula ÚnicaRESUMO
BACKGROUND: While RNA is often created from linear splicing during transcription, recent studies have found that non-canonical splicing sometimes occurs. Non-canonical splicing joins 3' and 5' and forms the so-called circular RNA. It is now believed that circular RNA plays important biological roles such as affecting susceptibility of some diseases. During the past several years, multiple experimental methods have been developed to enrich circular RNA while degrade linear RNA. Although several useful software tools for circular RNA detection have been developed as well, these tools are based on reads mapping may miss many circular RNA. Also, existing tools are slow for large data due to their dependence on reads mapping. METHOD: In this paper, we present a new computational approach, named CircMarker, based on k-mers rather than reads mapping for circular RNA detection. CircMarker takes advantage of transcriptome annotation files to create the k-mer table for circular RNA detection. RESULTS: Empirical results show that CircMarker outperforms existing tools in circular RNA detection on accuracy and efficiency in many simulated and real datasets. CONCLUSIONS: We develop a new circular RNA detection method called CircMarker based on k-mer analysis. Our results on both simulation data and real data demonstrate that CircMarker runs much faster and can find more circular RNA with higher consensus-based sensitivity and high accuracy ratio compared with existing tools.
Assuntos
Algoritmos , RNA/análise , Análise de Sequência de RNA/métodos , Humanos , RNA/química , RNA Circular , SoftwareRESUMO
Dendritic cells play a critical role in initiating T-cell responses. In spite of this recognition, they have not been used widely as adjuvants, nor is the mechanism of their adjuvanticity fully understood. Here, using a mutated neoepitope of a mouse fibrosarcoma as the antigen, and tumor rejection as the end point, we show that dendritic cells but not macrophages possess superior adjuvanticity. Several types of dendritic cells, such as bone marrow-derived dendritic cells (GM-CSF cultured or FLT3-ligand induced) or monocyte-derived ones, are powerful adjuvants, although GM-CSF-cultured cells show the highest activity. Among these, the CD11c+ MHCIIlo sub-set, distinguishable by a distinct transcriptional profile including a higher expression of heat shock protein receptors CD91 and LOX1, mannose receptors and TLRs, is significantly superior to the CD11c+ MHCIIhi sub-set. Finally, dendritic cells exert their adjuvanticity by acting as both antigen donor cells (i.e., antigen reservoirs) as well as antigen presenting cells.
Assuntos
Antígeno CD11c/imunologia , Células Dendríticas/imunologia , Células Dendríticas/transplante , Fibrossarcoma/terapia , Fator Estimulador de Colônias de Granulócitos e Macrófagos/farmacologia , Antígenos de Histocompatibilidade Classe II/imunologia , Imunoterapia Adotiva/métodos , Animais , Antígenos de Neoplasias/imunologia , Células da Medula Óssea/efeitos dos fármacos , Células da Medula Óssea/imunologia , Células Dendríticas/efeitos dos fármacos , Epitopos/imunologia , Feminino , Fibrossarcoma/imunologia , Camundongos , Camundongos Endogâmicos BALB C , Camundongos Endogâmicos C57BL , Linfócitos T/imunologiaRESUMO
SUMMARY: This note presents IsoEM2 and IsoDE2, new versions with enhanced features and faster runtime of the IsoEM and IsoDE packages for expression level estimation and differential expression. IsoEM2 estimates fragments per kilobase million (FPKM) and transcript per million (TPM) levels for genes and isoforms with confidence intervals through bootstrapping, while IsoDE2 performs differential expression analysis using the bootstrap samples generated by IsoEM2. Both tools are available with a command line interface as well as a graphical user interface (GUI) through wrappers for the Galaxy platform. AVAILABILITY AND IMPLEMENTATION: The source code of this software suite is available at https://github.com/mandricigor/isoem2. The Galaxy wrappers are available at https://toolshed.g2.bx.psu.edu/view/saharlcc/isoem2_isode2/. CONTACT: imandric1@student.gsu.edu or ion@engr.uconn.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional/métodos , Intervalos de Confiança , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , SoftwareRESUMO
BACKGROUND: For fighting cancer, earlier detection is crucial. Circulating auto-antibodies produced by the patient's own immune system after exposure to cancer proteins are promising bio-markers for the early detection of cancer. Since an antibody recognizes not the whole antigen but 4-7 critical amino acids within the antigenic determinant (epitope), the whole proteome can be represented by a random peptide phage display library. This opens the possibility to develop an early cancer detection test based on a set of peptide sequences identified by comparing cancer patients' and healthy donors' global peptide profiles of antibody specificities. RESULTS: Due to the enormously large number of peptide sequences contained in global peptide profiles generated by next generation sequencing, the large number of cancer and control sera is required to identify cancer-specific peptides with high degree of statistical significance. To decrease the number of peptides in profiles generated by nextgen sequencing without losing cancer-specific sequences we used for generation of profiles the phage library enriched by panning on the pool of cancer sera. To further decrease the complexity of profiles we used computational methods for transforming a list of peptides constituting the mimotope profiles to the list motifs formed by similar peptide sequences. CONCLUSION: We have shown that the amino-acid order is meaningful in mimotope motifs since they contain significantly more peptides than motifs among peptides where amino-acids are randomly permuted. Also the single sample motifs significantly differ from motifs in peptides drawn from multiple samples. Finally, multiple cancer-specific motifs have been identified.
Assuntos
Autoanticorpos , Biomarcadores Tumorais/sangue , Epitopos , Neoplasias , Autoanticorpos/química , Autoanticorpos/imunologia , Biologia Computacional , Detecção Precoce de Câncer , Epitopos/química , Epitopos/imunologia , Humanos , Neoplasias/sangue , Neoplasias/química , Neoplasias/diagnóstico , Neoplasias/imunologia , Biblioteca de PeptídeosRESUMO
BACKGROUND: The retina as a model system with extensive information on genes involved in development/maintenance is of great value for investigations employing deep sequencing to capture transcriptome change over time. This in turn could enable us to find patterns in gene expression across time to reveal transition in biological processes. METHODS: We developed a bioinformatics pipeline to categorize genes based on their differential expression and their alternative splicing status across time by binning genes based on their transcriptional kinetics. Genes within same bins were then leveraged to query gene annotation databases to discover molecular programs employed by the developing retina. RESULTS: Using our pipeline on RNA-Seq data obtained from fractionated (nucleus/cytoplasm) developing retina at embryonic day (E) 16 and postnatal day (P) 0, we captured high-resolution as in the difference between the cytoplasm and the nucleus at the same developmental time. We found de novo transcription of genes whose transcripts were exclusively found in the nuclear transcriptome at P0. Further analysis showed that these genes enriched for functions that are known to be executed during postnatal development, thus showing that the P0 nuclear transcriptome is temporally ahead of that of its cytoplasm. We extended our strategy to perform temporal analysis comparing P0 data to either P21-Nrl-wildtype (WT) or P21-Nrl-knockout (KO) retinae, which predicted that the KO retina would have compromised vasculature. Indeed, histological manifestation of vasodilation has been reported at a later time point (P60). CONCLUSIONS: Thus, our approach was predictive of a phenotype before it presented histologically. Our strategy can be extended to investigating the development and/or disease progression of other tissue types.
Assuntos
Retina/metabolismo , Transcriptoma , Processamento Alternativo , Animais , Biologia Computacional , Progressão da Doença , Perfilação da Expressão Gênica , Cinética , Camundongos , Camundongos Knockout , Retina/anormalidades , Retina/embriologia , Análise de Sequência de RNA , Análise Espaço-TemporalRESUMO
BACKGROUND: Assessing pathway activity levels is a plausible way to quantify metabolic differences between various conditions. This is usually inferred from microarray expression data. Wide availability of NGS technology has triggered a demand for bioinformatics tools capable of analyzing pathway activity directly from RNA-Seq data. In this paper we introduce XPathway, a set of tools that compares pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs. RESULTS: XPathway tools have been applied to RNA-Seq data from the marine bryozoan Bugula neritina with and without its symbiotic bacterium "Candidatus Endobugula sertula". We successfully identified several metabolic pathways with differential activity levels. The expression of enzymes from the identified pathways has been further validated through quantitative PCR (qPCR). CONCLUSIONS: Our results show that XPathway is able to detect and quantify the metabolic difference in two samples. The software is implemented in C, Python and shell scripting and is capable of running on Linux/Unix platforms. The source code and installation instructions are available at http://alan.cs.gsu.edu/NGS/?q=content/xpathway .
Assuntos
Redes e Vias Metabólicas , Transcriptoma , Animais , Briozoários/genética , Briozoários/metabolismo , Biologia Computacional , Análise de Sequência de RNA , Software , SimbioseRESUMO
MOTIVATION: Next-generation sequencing (NGS) allows for analyzing a large number of viral sequences from infected patients, providing an opportunity to implement large-scale molecular surveillance of viral diseases. However, despite improvements in technology, traditional protocols for NGS of large numbers of samples are still highly cost and labor intensive. One of the possible cost-effective alternatives is combinatorial pooling. Although a number of pooling strategies for consensus sequencing of DNA samples and detection of SNPs have been proposed, these strategies cannot be applied to sequencing of highly heterogeneous viral populations. RESULTS: We developed a cost-effective and reliable protocol for sequencing of viral samples, that combines NGS using barcoding and combinatorial pooling and a computational framework including algorithms for optimal virus-specific pools design and deconvolution of individual samples from sequenced pools. Evaluation of the framework on experimental and simulated data for hepatitis C virus showed that it substantially reduces the sequencing costs and allows deconvolution of viral populations with a high accuracy. AVAILABILITY AND IMPLEMENTATION: The source code and experimental data sets are available at http://alan.cs.gsu.edu/NGS/?q=content/pooling.
Assuntos
Algoritmos , Biologia Computacional/métodos , DNA Viral/genética , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Vírus/classificação , Vírus/genética , Variação Genética , Hepacivirus/classificação , Hepacivirus/genética , HumanosRESUMO
Metabolic pathways are composed of a series of chemical reactions occurring within a cell. In each pathway, enzymes catalyze the conversion of substrates into structurally similar products. Thus, structural similarity provides a potential means for mapping newly identified biochemical compounds to known metabolic pathways. In this paper, we present TrackSM, a cheminformatics tool designed to associate a chemical compound to a known metabolic pathway based on molecular structure matching techniques. Validation experiments show that TrackSM is capable of associating 93% of tested structures to their correct KEGG pathway class and 88% to their correct individual KEGG pathway. This suggests that TrackSM may be a valuable tool to aid in associating previously unknown small molecules to known biochemical pathways and improve our ability to link metabolomics, proteomic, and genomic data sets. TrackSM is freely available at http://metabolomics.pharm.uconn.edu/?q=Software.html .
Assuntos
Algoritmos , Redes e Vias Metabólicas , Metabolômica/métodos , Estrutura Molecular , Reprodutibilidade dos Testes , SoftwareRESUMO
BACKGROUND: Interest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to orient, order, and link contigs into larger structures referred to as scaffolds. Due to library preparation artifacts and erroneous mapping of reads originating from repeats, scaffolding remains a challenging problem. In this paper, we provide a scalable scaffolding algorithm (SILP2) employing a maximum likelihood model capturing read mapping uncertainty and/or non-uniformity of contig coverage which is solved using integer linear programming. A Non-Serial Dynamic Programming (NSDP) paradigm is applied to render our algorithm useful in the processing of larger mammalian genomes. To compare scaffolding tools, we employ novel quantitative metrics in addition to the extant metrics in the field. We have also expanded the set of experiments to include scaffolding of low-complexity metagenomic samples. RESULTS: SILP2 achieves better scalability throughg a more efficient NSDP algorithm than previous release of SILP. The results show that SILP2 compares favorably to previous methods OPERA and MIP in both scalability and accuracy for scaffolding single genomes of up to human size, and significantly outperforms them on scaffolding low-complexity metagenomic samples. CONCLUSIONS: Equipped with NSDP, SILP2 is able to scaffold large mammalian genomes, resulting in the longest and most accurate scaffolds. The ILP formulation for the maximum likelihood model is shown to be flexible enough to handle metagenomic samples.
Assuntos
Genoma , Genômica/métodos , Funções Verossimilhança , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenômica/métodos , Probabilidade , Programação LinearRESUMO
BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data.
Assuntos
Algoritmos , Biomarcadores Tumorais/análise , Biologia Computacional/métodos , Bases de Dados Factuais , Modelos Biológicos , Neoplasias/classificação , Variações do Número de Cópias de DNA/genética , Mineração de Dados , Humanos , MicroRNAs/genética , Neoplasias/genética , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Polimorfismo de Nucleotídeo Único/genética , Proteínas/metabolismo , RNA Mensageiro/genética , Células Tumorais CultivadasRESUMO
A major application of RNA-Seq is to perform differential gene expression analysis. Many tools exist to analyze differentially expressed genes in the presence of biological replicates. Frequently, however, RNA-Seq experiments have no or very few biological replicates and development of methods for detecting differentially expressed genes in these scenarios is still an active research area. In this paper we introduce a novel method, called IsoDE, for differential gene expression analysis based on bootstrapping. We compared IsoDE against four existing methods (Fisher's exact test, GFOLD, edgeR and Cuffdiff) on RNA-Seq datasets generated using three different sequencing technologies, both with and without replicates. Experiments on MAQC RNA-Seq datasets without replicates show that IsoDE has consistently high accuracy as defined by the qPCR ground truth, frequently higher than that of the compared methods, particularly for low coverage data and at lower fold change thresholds. In experiments on RNA-Seq datasets with up to 7 replicates, IsoDE has also achieved high accuracy. Furthermore, unlike GFOLD and edgeR, IsoDE accuracy varies smoothly with the number of replicates, and is relatively uniform across the entire range of gene expression levels. The proposed non-parametric method based on bootstrapping has practical running time, and achieves robust performance over a broad range of technologies, number of replicates, sequencing depths, and minimum fold change thresholds.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Biologia Computacional , SoftwareRESUMO
BACKGROUND: Highly mutable RNA viruses exist in infected hosts as heterogeneous populations of genetically close variants known as quasispecies. Next-generation sequencing (NGS) allows for analysing a large number of viral sequences from infected patients, presenting a novel opportunity for studying the structure of a viral population and understanding virus evolution, drug resistance and immune escape. Accurate reconstruction of genetic composition of intra-host viral populations involves assembling the NGS short reads into whole-genome sequences and estimating frequencies of individual viral variants. Although a few approaches were developed for this task, accurate reconstruction of quasispecies populations remains greatly unresolved. RESULTS: Two new methods, AmpMCF and ShotMCF, for reconstruction of the whole-genome intra-host viral variants and estimation of their frequencies were developed, based on Multicommodity Flows (MCFs). AmpMCF was designed for NGS reads obtained from individual PCR amplicons and ShotMCF for NGS shotgun reads. While AmpMCF, based on covering formulation, identifies a minimal set of quasispecies explaining all observed reads, ShotMCS, based on packing formulation, engages the maximal number of reads to generate the most probable set of quasispecies. Both methods were evaluated on simulated data in comparison to Maximum Bandwidth and ViSpA, previously developed state-of-the-art algorithms for estimating quasispecies spectra from the NGS amplicon and shotgun reads, respectively. Both algorithms were accurate in estimation of quasispecies frequencies, especially from large datasets. CONCLUSIONS: The problem of viral population reconstruction from amplicon or shotgun NGS reads was solved using the MCF formulation. The two methods, ShotMCF and AmpMCF, developed here afford accurate reconstruction of the structure of intra-host viral population from NGS reads. The implementations of the algorithms are available at http://alan.cs.gsu.edu/vira.html (AmpMCF) and http://alan.cs.gsu.edu/NGS/?q=content/shotmcf (ShotMCF).
Assuntos
Algoritmos , Variação Genética , Genoma Viral , Vírus de RNA/genética , Análise de Sequência de RNA/métodos , Hepacivirus/classificação , Hepacivirus/genética , Vírus de RNA/classificaçãoRESUMO
Current methods of structure identification in mass-spectrometry-based nontargeted metabolomics rely on matching experimentally determined features of an unknown compound to those of candidate compounds contained in biochemical databases. A major limitation of this approach is the relatively small number of compounds currently included in these databases. If the correct structure is not present in a database, it cannot be identified, and if it cannot be identified, it cannot be included in a database. Thus, there is an urgent need to augment metabolomics databases with rationally designed biochemical structures using alternative means. Here we present the In Vivo/In Silico Metabolites Database (IIMDB), a database of in silico enzymatically synthesized metabolites, to partially address this problem. The database, which is available at http://metabolomics.pharm.uconn.edu/iimdb/, includes ~23,000 known compounds (mammalian metabolites, drugs, secondary plant metabolites, and glycerophospholipids) collected from existing biochemical databases plus more than 400,000 computationally generated human phase-I and phase-II metabolites of these known compounds. IIMDB features a user-friendly web interface and a programmer-friendly RESTful web service. Ninety-five percent of the computationally generated metabolites in IIMDB were not found in any existing database. However, 21,640 were identical to compounds already listed in PubChem, HMDB, KEGG, or HumanCyc. Furthermore, the vast majority of these in silico metabolites were scored as biological using BioSM, a software program that identifies biochemical structures in chemical structure space. These results suggest that in silico biochemical synthesis represents a viable approach for significantly augmenting biochemical databases for nontargeted metabolomics applications.
Assuntos
Bases de Dados Factuais , Enzimas/metabolismo , Metabolômica/métodos , Animais , Glicerofosfolipídeos/metabolismo , Humanos , Internet , Preparações Farmacêuticas/metabolismo , Plantas/metabolismo , Interface Usuário-ComputadorRESUMO
The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS, there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest or searching large chemical databases which include large numbers of nonbiochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if nonbiochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1565 scaffolds. Analysis of two additional biological data sets containing 2330 human metabolites (HMDB) and 2416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds, respectively. When a data set of 3895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3927 biochemical compounds and reevaluated the above data sets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the data set comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space and, thus, will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
Assuntos
Mamíferos/metabolismo , Metabolômica/métodos , Algoritmos , Animais , Inteligência Artificial , Líquidos Corporais/química , Citocromos , Bases de Dados de Proteínas , Humanos , Modelos Químicos , Modelos Moleculares , Reprodutibilidade dos Testes , Bibliotecas de Moléculas PequenasRESUMO
High-throughput DNA and RNA sequencing are revolutionizing precision oncology, enabling personalized therapies such as cancer vaccines designed to target tumor-specific neoepitopes generated by somatic mutations expressed in cancer cells. Identification of these neoepitopes from next-generation sequencing data of clinical samples remains challenging and requires the use of complex bioinformatics pipelines. In this paper, we present GeNeo, a bioinformatics toolbox for genomics-guided neoepitope prediction. GeNeo includes a comprehensive set of tools for somatic variant calling and filtering, variant validation, and neoepitope prediction and filtering. For ease of use, GeNeo tools can be accessed via web-based interfaces deployed on a Galaxy portal publicly accessible at https://neo.engr.uconn.edu/. A virtual machine image for running GeNeo locally is also available to academic users upon request.
Assuntos
Neoplasias , Humanos , Neoplasias/genética , Neoplasias/terapia , Medicina de Precisão , Genômica/métodos , Biologia Computacional , Imunoterapia , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
BACKGROUND: Massively parallel transcriptome sequencing (RNA-Seq) is becoming the method of choice for studying functional effects of genetic variability and establishing causal relationships between genetic variants and disease. However, RNA-Seq poses new technical and computational challenges compared to genome sequencing. In particular, mapping transcriptome reads onto the genome is more challenging than mapping genomic reads due to splicing. Furthermore, detection and genotyping of single nucleotide variants (SNVs) requires statistical models that are robust to variability in read coverage due to unequal transcript expression levels. RESULTS: In this paper we present a strategy to more reliably map transcriptome reads by taking advantage of the availability of both the genome reference sequence and transcript databases such as CCDS. We also present a novel Bayesian model for SNV discovery and genotyping based on quality scores. CONCLUSIONS: Experimental results on RNA-Seq data generated from blood cell tissue of three Hapmap individuals show that our methods yield increased accuracy compared to several widely used methods. The open source code implementing our methods, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/.
Assuntos
Técnicas de Genotipagem/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de RNA/métodos , Transcriptoma , Teorema de Bayes , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , SoftwareRESUMO
The inference of disease transmission networks is an important problem in epidemiology. One popular approach for building transmission networks is to reconstruct a phylogenetic tree using sequences from disease strains sampled from infected hosts and infer transmissions based on this tree. However, most existing phylogenetic approaches for transmission network inference are highly computationally intensive and cannot take within-host strain diversity into account. Here, we introduce a new phylogenetic approach for inferring transmission networks, TNet, that addresses these limitations. TNet uses multiple strain sequences from each sampled host to infer transmissions and is simpler and more accurate than existing approaches. Furthermore, TNet is highly scalable and able to distinguish between ambiguous and unambiguous transmission inferences. We evaluated TNet on a large collection of 560 simulated transmission networks of various sizes and diverse host, sequence, and transmission characteristics, as well as on 10 real transmission datasets with known transmission histories. Our results show that TNet outperforms two other recently developed methods, phyloscanner and SharpTNI, that also consider within-host strain diversity. We also applied TNet to a large collection of SARS-CoV-2 genomes sampled from infected individuals in many countries around the world, demonstrating how our inference framework can be adapted to accurately infer geographical transmission networks. TNet is freely available from https://compbio.engr.uconn.edu/software/TNet/.