RESUMO
One of the aims of population genetics is to identify genetic differences/similarities among individuals of multiple ancestries. Many approaches including principal component analysis, clustering, and maximum likelihood techniques can be used to assign individuals to a given ancestry based on their genetic makeup. Although there are several tools that implement such algorithms, there is a lack of interactive visual platforms to run a variety of algorithms in one place. Therefore, we developed PopMLvis, a platform that offers an interactive environment to visualize genetic similarity data using several algorithms, and generate figures that can be easily integrated into scientific articles.
Assuntos
Algoritmos , Genética Populacional , Estudo de Associação Genômica Ampla , Genótipo , Software , Estudo de Associação Genômica Ampla/métodos , Genética Populacional/métodos , Humanos , Análise de Componente PrincipalRESUMO
Stochastic epigenetic mutations (SEMs) have been proposed as novel aging biomarkers to capture heterogeneity in age-related DNA methylation changes. SEMs are defined as outlier methylation patterns at cytosine-guanine dinucleotide sites, categorized as hypermethylated (hyperSEM) or hypomethylated (hypoSEM) relative to a reference. Because SEMs are defined by their outlier status, it is critical to differentiate extreme values due to technical noise or data artifacts from those due to real biology. Using technical replicate data, we found SEM detection is not reliable: across 3 datasets, 24 to 39% of hypoSEM and 46 to 67% of hyperSEM are not shared between replicates. We identified factors influencing SEM reliability-including blood cell type composition, probe beta-value statistics, genomic location, and presence of SNPs. We used these factors in a training dataset to build a machine learning-based filter that removes unreliable SEMs, and found this filter enhances reliability in two independent validation datasets. We assessed associations between SEM loads and aging phenotypes in the Framingham Heart Study and discovered that associations with aging outcomes were in large part driven by hypoSEMs at baseline methylated probes and hyperSEMs at baseline unmethylated probes, which are the same subsets that demonstrate highest technical reliability. These aging associations were preserved after filtering out unreliable SEMs and were enhanced after adjusting for blood cell composition. Finally, we utilized these insights to formulate best practices for SEM detection and introduce a novel R package, SEMdetectR, which uses parallel programming for efficient SEM detection with comprehensive options for detection, filtering, and analysis.
Assuntos
Envelhecimento , Metilação de DNA , Epigênese Genética , Humanos , Metilação de DNA/genética , Envelhecimento/genética , Envelhecimento/fisiologia , Epigênese Genética/genética , Feminino , Masculino , Reprodutibilidade dos Testes , Idoso , Doenças Cardiovasculares/genética , Pessoa de Meia-Idade , Mutação , Processos Estocásticos , Aprendizado de MáquinaRESUMO
BACKGROUND: Aging is a phenomenon which occurs over time and leads to the decay of living organisms. During the progression of aging, some age-associated diseases including cardiovascular disease, cancers, and neurological, mental, and physical disorders could develop. Genetic and epigenetic factors like microRNAs, as one of the post-transcriptional regulators of genes, play important roles in senescence. The self-renewal and differentiation capacity of mesenchymal stem cells makes them good candidates for regenerative medicine. OBJECTIVE: The objective of this study is to evaluate senescence-related miRNAs in human MSCs using bioinformatics analysis. METHODS: In this study, the Gene Expression Omnibus (GEO) database was used to investigate the senescence-related genome profile. Then, down-regulated genes were selected for further bioinformatics analysis with the assumption that their decreased expression is associated with an increased aging process. Considering that miRNAs can interfere in gene expression, miRNAs complementary to these genes were identified using bioinformatics software. RESULTS: Through bioinformatics analysis, we predicted hsa-miR-590-3p, hsa-miR-10b-3p, hsamiR- 548 family, hsa-miR-144-3p, and hsa-miR-30b-5p which involve in cellular senescence and the aging of human MSCs. CONCLUSION: miRNA mimics or anti-miRNA agents have the potential to be used as anti-aging tools for MSCs.
RESUMO
CRISPR/Cas9 stands as a revolutionary and versatile gene editing technology. At its core, the Cas9 DNA endonuclease is guided with precision by a specifically designed single-guide RNA (gRNA). This guidance system facilitates the introduction of double-stranded breaks (DSBs) within the DNA. Subsequent imprecise repairs, mainly through the non-homologous end-joining (NHEJ) pathway, yield insertions or deletions, resulting in frameshift mutations. These mutations are instrumental in achieving the successful knockout of the target gene. In this chapter, we describe all necessary steps to create and design a gRNA for a gene knockout to a target gene before to transfer it to a target plant.
Assuntos
Sistemas CRISPR-Cas , Edição de Genes , Técnicas de Inativação de Genes , RNA Guia de Sistemas CRISPR-Cas , RNA Guia de Sistemas CRISPR-Cas/genética , Técnicas de Inativação de Genes/métodos , Edição de Genes/métodos , Simulação por Computador , Reparo do DNA por Junção de Extremidades/genéticaRESUMO
Here, we present AtacAnnoR, a two-round annotation method for scATAC-seq data using well-annotated scRNA-seq data as reference. We evaluate AtacAnnoR's performance against six competing methods on 11 benchmark datasets. Our results show that AtacAnnoR achieves the highest mean accuracy and the highest mean balanced accuracy and performs particularly well when unpaired scRNA-seq data are used as the reference. Furthermore, AtacAnnoR implements a 'Combine and Discard' strategy to further improve annotation accuracy when annotations of multiple references are available. AtacAnnoR has been implemented in an R package and can be directly integrated into currently popular scATAC-seq analysis pipelines.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Análise de Célula Única , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Análise de Célula Única/métodos , Benchmarking , Agricultura , Sequenciamento do Exoma , Análise de Sequência de RNA/métodosRESUMO
MOTIVATION: Compositional heterogeneity-when the proportions of nucleotides and amino acids are not broadly similar across the dataset-is a cause of a great number of phylogenetic artefacts. Whilst a variety of methods can identify it post-hoc, few metrics exist to quantify compositional heterogeneity prior to the computationally intensive task of phylogenetic tree reconstruction. Here we assess the efficacy of one such existing, widely used, metric: Relative Composition Frequency Variability (RCFV), using both real and simulated data. RESULTS: Our results show that RCFV can be biased by sequence length, the number of taxa, and the number of possible character states within the dataset. However, we also find that missing data does not appear to have an appreciable effect on RCFV. We discuss the theory behind this, the consequences of this for the future of the usage of the RCFV value and propose a new metric, nRCFV, which accounts for these biases. Alongside this, we present a new software that calculates both RCFV and nRCFV, called nRCFV_Reader. AVAILABILITY AND IMPLEMENTATION: nRCFV has been implemented in RCFV_Reader, available at: https://github.com/JFFleming/RCFV_Reader . Both our simulation and real data are available at Datadryad: https://doi.org/10.5061/dryad.wpzgmsbpn .
Assuntos
Aminoácidos , Nucleotídeos , Filogenia , Software , Simulação por ComputadorRESUMO
Liquid chromatography coupled with bottom-up mass spectrometry (LC-MS/MS)-based proteomics is increasingly used to detect changes in posttranslational modifications (PTMs) in samples from different conditions. Analysis of data from such experiments faces numerous statistical challenges. These include the low abundance of modified proteoforms, the small number of observed peptides that span modification sites, and confounding between changes in the abundance of PTM and the overall changes in the protein abundance. Therefore, statistical approaches for detecting differential PTM abundance must integrate all the available information pertaining to a PTM site and consider all the relevant sources of confounding and variation. In this manuscript, we propose such a statistical framework, which is versatile, accurate, and leads to reproducible results. The framework requires an experimental design, which quantifies, for each sample, both peptides with PTMs and peptides from the same proteins with no modification sites. The proposed framework supports both label-free and tandem mass tag-based LC-MS/MS acquisitions. The statistical methodology separately summarizes the abundances of peptides with and without the modification sites, by fitting separate linear mixed effects models appropriate for the experimental design. Next, model-based inferences regarding the PTM and the protein-level abundances are combined to account for the confounding between these two sources. Evaluations on computer simulations, a spike-in experiment with known ground truth, and three biological experiments with different organisms, modification types, and data acquisition types demonstrate the improved fold change estimation and detection of differential PTM abundance, as compared to currently used approaches. The proposed framework is implemented in the free and open-source R/Bioconductor package MSstatsPTM.
Assuntos
Proteômica , Espectrometria de Massas em Tandem , Proteômica/métodos , Cromatografia Líquida , Processamento de Proteína Pós-Traducional , Proteínas , Peptídeos/químicaRESUMO
Existing small noncoding RNA analysis tools are optimized for processing short sequencing reads (17-35 nucleotides) to monitor microRNA expression. However, these strategies under-represent many biologically relevant classes of small noncoding RNAs in the 36-200 nucleotides length range (tRNAs, snoRNAs, etc.). To address this, we developed DANSR, a tool for the detection of annotated and novel small RNAs using sequencing reads with variable lengths (ranging from 17-200 nt). While DANSR is broadly applicable to any small RNA dataset, we applied it to a cohort of matched normal, primary, and distant metastatic colorectal cancer specimens to demonstrate its ability to quantify annotated small RNAs, discover novel genes, and calculate differential expression. DANSR is available as an open source tool.
RESUMO
Recently a likely prion was found in the proteome of Arabidopsis thaliana based on inclusive compositional similarity to known yeast prion-like domains (PrLDs) and gene ontology analysis. A total of 474 proteins in the Arabidopsis thaliana proteome showed significant compositional similarity to known PrLDs in yeast warranting further analysis. In this chapter, we describe the use and limitations of the PLAAC (Prion-Like Amino Acid Composition) software for the identification of prions, specifically as it has recently been applied to identifying the first prion in plants. Our interest in this method, though presented from a plant-based perspective here, is broad and is primarily in using the method for comparative assessment with novel prion identification algorithms currently under development in our lab. This chapter is not meant to serve as a replete description of the architecture and use of HMM in prion prediction in general but is intended to serve as a reference for implementation and interpretation of output from PLAAC and its application to plant proteomes.
Assuntos
Príons/análise , Arabidopsis/genética , Proteoma , Saccharomyces cerevisiae , Proteínas de Saccharomyces cerevisiaeRESUMO
metaQuantome is a software suite that enables the quantitative analysis, statistical evaluation. and visualization of mass-spectrometry-based metaproteomics data. In the latest update of this software, we have provided several extensions, including a step-by-step training guide, the ability to perform statistical analysis on samples from multiple conditions, and a comparative analysis of metatranscriptomics data. The training module, accessed via the Galaxy Training Network, will help users to use the suite effectively both for functional as well as for taxonomic analysis. We extend the ability of metaQuantome to now perform multi-data-point quantitative and statistical analyses so that studies with measurements across multiple conditions, such as time-course studies, can be analyzed. With an eye on the multiomics analysis of microbial communities, we have also initiated the use of metaQuantome statistical and visualization tools on outputs from metatranscriptomics data, which complements the metagenomic and metaproteomic analyses already available. For this, we have developed a tool named MT2MQ ("metatranscriptomics to metaQuantome"), which takes in outputs from the ASaiM metatranscriptomics workflow and transforms them so that the data can be used as an input for comparative statistical analysis and visualization via metaQuantome. We believe that these improvements to metaQuantome will facilitate the use of the software for quantitative metaproteomics and metatranscriptomics and will enable multipoint data analysis. These improvements will take us a step toward integrative multiomic microbiome analysis so as to understand dynamic taxonomic and functional responses of these complex systems in a variety of biological contexts. The updated metaQuantome and MT2MQ are open-source software and are available via the Galaxy Toolshed and GitHub.
Assuntos
Microbiota , Proteômica , Espectrometria de Massas , Metagenômica , SoftwareRESUMO
BACKGROUND: Life scientists routinely face massive and heterogeneous data analysis tasks and must find and access the most suitable databases or software in a jungle of web-accessible resources. The diversity of information used to describe life-scientific digital resources presents an obstacle to their utilization. Although several standardization efforts are emerging, no information schema has been sufficiently detailed to enable uniform semantic and syntactic description-and cataloguing-of bioinformatics resources. FINDINGS: Here we describe biotoolsSchema, a formalized information model that balances the needs of conciseness for rapid adoption against the provision of rich technical information and scientific context. biotoolsSchema results from a series of community-driven workshops and is deployed in the bio.tools registry, providing the scientific community with >17,000 machine-readable and human-understandable descriptions of software and other digital life-science resources. We compare our approach to related initiatives and provide alignments to foster interoperability and reusability. CONCLUSIONS: biotoolsSchema supports the formalized, rigorous, and consistent specification of the syntax and semantics of bioinformatics resources, and enables cataloguing efforts such as bio.tools that help scientists to find, comprehend, and compare resources. The use of biotoolsSchema in bio.tools promotes the FAIRness of research software, a key element of open and reproducible developments for data-intensive sciences.
Assuntos
Disciplinas das Ciências Biológicas , Biologia Computacional , Bases de Dados Factuais , Humanos , Semântica , SoftwareRESUMO
INTRODUCTION: Cellular metabolites are generated by a complex network of biochemical reactions. This makes interpreting changes in metabolites exceptionally challenging. OBJECTIVES: To develop a computational tool that integrates multiomics data at the level of reactions. METHODS: Changes in metabolic reactions are modeled with input from transcriptomics/proteomics measurements of enzymes and metabolomic measurements of metabolites. RESULTS: We developed SUMMER, which identified more relevant signals, key metabolic reactions, and relevant underlying biological pathways in a real-world case study. CONCLUSION: SUMMER performs integrative analysis for data interpretation and exploration. SUMMER is freely accessible at http://summer.salk.edu and the code is available at https://bitbucket.org/salkigc/summer .
Assuntos
Biologia Computacional/métodos , Metabolômica , Software , Algoritmos , Animais , Análise de Dados , Perfilação da Expressão Gênica/métodos , Humanos , Metabolômica/métodos , Camundongos , Proteômica/métodos , Curva ROC , Navegador , Fluxo de TrabalhoRESUMO
Stochastic changes in DNA methylation (i.e., spontaneous epimutations) contribute to methylome diversity in plants. Here, we describe AlphaBeta, a computational method for estimating the precise rate of such stochastic events using pedigree-based DNA methylation data as input. We demonstrate how AlphaBeta can be employed to study transgenerationally heritable epimutations in clonal or sexually derived mutation accumulation lines, as well as somatic epimutations in long-lived perennials. Application of our method to published and new data reveals that spontaneous epimutations accumulate neutrally at the genome-wide scale, originate mainly during somatic development and that they can be used as a molecular clock for age-dating trees.
Assuntos
Metilação de DNA , Epigenoma , Genoma de Planta , Genômica/métodos , Software , Arabidopsis , Populus , TaraxacumRESUMO
BACKGROUND: Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution. RESULTS: We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. CONCLUSIONS: TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.
Assuntos
Genoma Humano , Sequências de Repetição em Tandem , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Sensibilidade e Especificidade , Análise de Sequência de DNA , Sequenciamento Completo do GenomaRESUMO
iMOKA (interactive multi-objective k-mer analysis) is a software that enables comprehensive analysis of sequencing data from large cohorts to generate robust classification models or explore specific genetic elements associated with disease etiology. iMOKA uses a fast and accurate feature reduction step that combines a Naïve Bayes classifier augmented by an adaptive entropy filter and a graph-based filter to rapidly reduce the search space. By using a flexible file format and distributed indexing, iMOKA can easily integrate data from multiple experiments and also reduces disk space requirements and identifies changes in transcript levels and single nucleotide variants. iMOKA is available at https://github.com/RitchieLabIGH/iMOKA and Zenodo https://doi.org/10.5281/zenodo.4008947 .
Assuntos
Análise de Sequência de DNA , Software , Algoritmos , Neoplasias da Mama/classificação , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Resistencia a Medicamentos Antineoplásicos/genética , Feminino , Humanos , Neoplasias Ovarianas/tratamento farmacológico , Neoplasias Ovarianas/genética , Variantes FarmacogenômicosRESUMO
Over the past decade, modern methods of MS (MS) have emerged that allow reliable, fast and cost-effective identification of pathogenic microorganisms. Although MALDI-TOF MS has already revolutionized the way microorganisms are identified, recent years have witnessed also substantial progress in the development of liquid chromatography (LC)-MS based proteomics for microbiological applications. For example, LC-tandem MS (LC-MS2) has been proposed for microbial characterization by means of multiple discriminative peptides that enable identification at the species, or sometimes at the strain level. However, such investigations can be laborious and time-consuming, especially if the experimental LC-MS2 data are tested against sequence databases covering a broad panel of different microbiological taxa. In this proof of concept study, we present an alternative bottom-up proteomics method for microbial identification. The proposed approach involves efficient extraction of proteins from cultivated microbial cells, digestion by trypsin and LC-MS measurements. Peptide masses are then extracted from MS1 data and systematically tested against an in silico library of all possible peptide mass data compiled in-house. The library has been computed from the UniProt Knowledgebase covering Swiss-Prot and TrEMBL databases and comprises more than 12,000 strain-specific in silico profiles, each containing tens of thousands of peptide mass entries. Identification analysis involves computation of score values derived from correlation coefficients between experimental and strain-specific in silico peptide mass profiles and compilation of score ranking lists. The taxonomic positions of the microbial samples are then determined by using the best-matching database entries. The suggested method is computationally efficient - less than 2 mins per sample - and has been successfully tested by a test set of 39 LC-MS1 peak lists obtained from 19 different microbial pathogens. The proposed method is rapid, simple and automatable and we foresee wide application potential for future microbiological applications.
Assuntos
Bactérias/isolamento & purificação , Simulação por Computador , Biblioteca de Peptídeos , Espectrometria de Massas em Tandem , Cromatografia Líquida , Análise de Dados , Especificidade da EspécieRESUMO
Pathway analyses are key methods to analyze 'omics experiments. Nevertheless, integrating data from different 'omics technologies and different species still requires considerable bioinformatics knowledge.Here we present the novel ReactomeGSA resource for comparative pathway analyses of multi-omics datasets. ReactomeGSA can be used through Reactome's existing web interface and the novel ReactomeGSA R Bioconductor package with explicit support for scRNA-seq data. Data from different species is automatically mapped to a common pathway space. Public data from ExpressionAtlas and Single Cell ExpressionAtlas can be directly integrated in the analysis. ReactomeGSA greatly reduces the technical barrier for multi-omics, cross-species, comparative pathway analyses.We used ReactomeGSA to characterize the role of B cells in anti-tumor immunity. We compared B cell rich and poor human cancer samples from five of the Cancer Genome Atlas (TCGA) transcriptomics and two of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) proteomics studies. B cell-rich lung adenocarcinoma samples lacked the otherwise present activation through NFkappaB. This may be linked to the presence of a specific subset of tumor associated IgG+ plasma cells that lack NFkappaB activation in scRNA-seq data from human melanoma. This showcases how ReactomeGSA can derive novel biomedical insights by integrating large multi-omics datasets.
Assuntos
Bases de Dados Genéticas , Proteômica , Software , Linfócitos B/imunologia , Humanos , Internet , Interface Usuário-ComputadorRESUMO
Tandem mass tag (TMT) is a multiplexing technology widely-used in proteomic research. It enables relative quantification of proteins from multiple biological samples in a single MS run with high efficiency and high throughput. However, experiments often require more biological replicates or conditions than can be accommodated by a single run, and involve multiple TMT mixtures and multiple runs. Such larger-scale experiments combine sources of biological and technical variation in patterns that are complex, unique to TMT-based workflows, and challenging for the downstream statistical analysis. These patterns cannot be adequately characterized by statistical methods designed for other technologies, such as label-free proteomics or transcriptomics. This manuscript proposes a general statistical approach for relative protein quantification in MS- based experiments with TMT labeling. It is applicable to experiments with multiple conditions, multiple biological replicate runs and multiple technical replicate runs, and unbalanced designs. It is based on a flexible family of linear mixed-effects models that handle complex patterns of technical artifacts and missing values. The approach is implemented in MSstatsTMT, a freely available open-source R/Bioconductor package compatible with data processing tools such as Proteome Discoverer, MaxQuant, OpenMS, and SpectroMine. Evaluation on a controlled mixture, simulated datasets, and three biological investigations with diverse designs demonstrated that MSstatsTMT balanced the sensitivity and the specificity of detecting differentially abundant proteins, in large-scale experiments with multiple biological mixtures.
Assuntos
Marcação por Isótopo , Proteoma/metabolismo , Estatística como Assunto , Espectrometria de Massas em Tandem , Humanos , ProteômicaRESUMO
A key point in achieving accurate intact glycopeptide identification is the definition of the glycan composition file that is used to match experimental with theoretical masses by a glycoproteomics search engine. At present, these files are mainly built from searching the literature and/or querying data sources focused on posttranslational modifications. Most glycoproteomics search engines include a default composition file that is readily used when processing MS data. We introduce here a glycan composition visualizing and comparative tool associated with the GlyConnect database and called GlyConnect Compozitor. It offers a web interface through which the database can be queried to bring out contextual information relative to a set of glycan compositions. The tool takes advantage of compositions being related to one another through shared monosaccharide counts and outputs interactive graphs summarizing information searched in the database. These results provide a guide for selecting or deselecting compositions in a file in order to reflect the context of a study as closely as possible. They also confirm the consistency of a set of compositions based on the content of the GlyConnect database. As part of the tool collection of the Glycomics@ExPASy initiative, Compozitor is hosted at https://glyconnect.expasy.org/compozitor/ where it can be run as a web application. It is also directly accessible from the GlyConnect database.
Assuntos
Glicômica , Polissacarídeos/metabolismo , Animais , Células CHO , Cricetulus , Bases de Dados Factuais , Humanos , Imunoglobulina G/metabolismo , Integrinas/metabolismo , Mucinas/metabolismo , Polissacarídeos/químicaRESUMO
Mass spectrometry (MS) has made enormous contributions to comprehensive protein identification and quantification in proteomics. MS is also gaining momentum for structural biology in a variety of ways, complementing conventional structural biology techniques. Here, we will review how MS-based techniques, such as hydrogen/deuterium exchange, covalent labeling, and chemical cross-linking, enable the characterization of protein structure, dynamics, and interactions, especially from a perspective of their data analyses. Structural information encoded by chemical probes in intact proteins is decoded by interpreting MS data at a peptide level, i.e., revealing conformational and dynamic changes in local regions of proteins. The structural MS data are not amenable to data analyses in traditional proteomics workflow, requiring dedicated software for each type of data. We first provide basic principles of data interpretation, including isotopic distribution and peptide sequencing. We then focus particularly on computational methods for structural MS data analyses and discuss outstanding challenges in a proteome-wide large scale analysis.