RESUMO
Bioinformatics analysis and visualization of high-throughput gene expression data require extensive computer programming skills, posing a bottleneck for many wet-lab scientists. In this work, we present an intuitive user-friendly platform for gene expression data analysis and visualization called FungiExpresZ. FungiExpresZ aims to help wet-lab scientists with little to no knowledge of computer programming to become self-reliant in bioinformatics analysis and generating publication-ready figures. The platform contains many commonly used data analysis tools and an extensive collection of pre-processed public ribonucleic acid sequencing (RNA-seq) datasets of many fungal species, including important human, plant and insect pathogens. Users may analyse their data alone or in combination with public RNA-seq data for an integrated analysis. The FungiExpresZ platform helps wet-lab scientists to overcome their limitations in genomics data analysis and can be applied to analyse data of any organism. FungiExpresZ is available as an online web-based tool (https://cparsania.shinyapps.io/FungiExpresZ/) and an offline R-Shiny package (https://github.com/cparsania/FungiExpresZ).
Assuntos
Genômica , Software , Humanos , Perfilação da Expressão Gênica , Análise de Dados , RNA/genética , Expressão GênicaRESUMO
The rapid expansion of biological sequence databases due to high-throughput genomic and proteomic sequencing methods has left a considerable number of identified protein sequences with unclear or incomplete functional annotations. Domains of unknown function (DUFs) are protein domains that lack functional annotations but are present in numerous proteins. To address the challenge of finding functional annotations for DUFs, we have developed a computational method that efficiently identifies and annotates these enigmatic protein domains by utilizing the position-specific iterative basic local alignment search tool (PSI-BLAST) and data mining techniques. Our pipeline identifies putative potential functionalities of DUFs, thereby decreasing the gap between known sequences and functions. The tool can also take user input sequences to annotate. We executed our pipeline on 5111 unique DUF sequences obtained from Pfam, resulting in putative annotations for 2007 of these. These annotations were subsequently incorporated into a comprehensive database and interfaced with a web-based server named "AnnoDUF". AnnoDUF is freely accessible to both academic and industrial users, via the World Wide Web at the link http://bts.ibab.ac.in/annoduf.php. All scripts used in this study are uploaded to the GitHub repository, and these can be accessed from https://github.com/BioToolSuite/AnnoDUF.
Assuntos
Bases de Dados de Proteínas , Internet , Anotação de Sequência Molecular , Software , Domínios Proteicos , Biologia Computacional/métodos , Mineração de Dados , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Proteômica/métodosRESUMO
MAIN CONCLUSION: Mfind is a tool to analyze the impact of microsatellite presence on DNA barcode specificity. We found a significant correlation between barcode entropy and microsatellite count in angiosperm. Genetic barcodes and microsatellites are some of the identification methods in taxonomy and biodiversity research. It is important to establish a relationship between microsatellite quantification and genetic information in barcodes. In order to clarify the association between the genetic information in barcodes (expressed as Shannon's Measure of Information, SMI) and microsatellites count, a total of 330,809 DNA barcodes from the BOLD database (Barcode of Life Data System) were analyzed. A parallel sliding-window algorithm was developed to compute the Shannon entropy of the barcodes, and this was compared with the quantification of microsatellites like (AT)n, (AC)n, and (AG)n. The microsatellite search method utilized an algorithm developed in the Java programming language, which systematically examined the genetic barcodes from an angiosperm database. For this purpose, a computational tool named Mfind was developed, and its search methodology is detailed. This comprehensive study revealed a broad overview of microsatellites within barcodes, unveiling an inverse correlation between the sumz of microsatellites count and barcodes information. The utilization of the Mfind tool demonstrated that the presence of microsatellites impacts the barcode information when considering entropy as a metric. This effect might be attributed to the concise length of DNA barcodes and the repetitive nature of microsatellites, resulting in a direct influence on the entropy of the barcodes.
Assuntos
Algoritmos , Código de Barras de DNA Taxonômico , Magnoliopsida , Repetições de Microssatélites , Repetições de Microssatélites/genética , Código de Barras de DNA Taxonômico/métodos , Magnoliopsida/genética , DNA de Plantas/genéticaRESUMO
Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High-quality long-sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read-backed phasing) to identify macrohaplotypes, and thus it can detect multi-allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced â¼8-kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole-genome long-read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high-throughput sequencing data and supports applications using discriminating macrohaplotypes.
Assuntos
Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Polimorfismo de Nucleotídeo Único , Poliploidia , Análise de Sequência de DNA , Software , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Biologia Computacional/métodos , DNA/genética , DNA/análise , Repetições de Microssatélites/genética , Genética Forense/métodos , Técnicas de Genotipagem/métodosRESUMO
BACKGROUND: A variety of high-throughput analyses, such as transcriptome, proteome, and metabolome analysis, have been developed, producing unprecedented amounts of omics data. These studies generate large gene lists, of which the biological significance shall be deeply understood. However, manually interpreting these lists is difficult, especially for non-bioinformatics-savvy scientists. RESULTS: We developed an R package and a corresponding web server-Genekitr, to assist biologists in exploring large gene sets. Genekitr comprises four modules: gene information retrieval, ID (identifier) conversion, enrichment analysis and publication-ready plotting. Currently, the information retrieval module can retrieve information on up to 23 attributes for genes of 317 organisms. The ID conversion module assists in ID-mapping of genes, probes, proteins, and aliases. The enrichment analysis module organizes 315 gene set libraries in different biological contexts by over-representation analysis and gene set enrichment analysis. The plotting module performs customizable and high-quality illustrations that can be used directly in presentations or publications. CONCLUSIONS: This web server tool will make bioinformatics more accessible to scientists who might not have programming expertise, allowing them to perform bioinformatics tasks without coding.
Assuntos
Biologia Computacional , Computadores , Biblioteca Gênica , Armazenamento e Recuperação da Informação , Poder PsicológicoRESUMO
The selection of a suitable proteotypic peptide remains a challenge for designing a targeted quantitative proteomics assay. Although the criteria are well-established in the literature, the selection of these peptides is often performed in a subjective and time-consuming manner. Here, we have developed a practical and semiautomated workflow implemented in an open-source program named Typic. Typic is designed to run in a command line and a graphical interface to help selecting a list of proteotypic peptides for targeted quantitation. The tool combines the input data and downloads additional data from public repositories to produce a file per protein as output. Each output file includes relevant information to the selection of proteotypic peptides organized in a table, a colored ranking of peptides according to their potential value as targets for quantitation and auxiliary plots to assist users in the task of proteotypic peptides selection. Taken together, Typic leads to a practical and straightforward data extraction from multiple data sets, allowing the identification of most suitable proteotypic peptides based on established criteria, in an unbiased and standardized manner, ultimately leading to a more robust targeted proteomics assay.
Assuntos
Proteoma , Proteômica , PeptídeosRESUMO
BACKGROUND: The visual sequence logo has been a hot area in the development of bioinformatics tools. ggseqlogo written in R language has been the most popular API since it was published. With the popularity of artificial intelligence and deep learning, Python is currently the most popular programming language. The programming language used by bioinformaticians began to shift to Python. Providing APIs in Python that are similar to those in R can reduce the learning cost of relearning a programming language. And compared to ggplot2 in R, drawing framework is not as easy to use in Python. The appearance of plotnine (ggplot2 in Python version) makes it possible to unify the programming methods of bioinformatics visualization tools between R and Python. RESULTS: Here, we introduce plotnineSeqSuite, a new plotnine-based Python package provides a ggseqlogo-like API for programmatic drawing of sequence logos, sequence alignment diagrams and sequence histograms. To be more precise, it supports custom letters, color themes, and fonts. Moreover, the class for drawing layers is based on object-oriented design so that users can easily encapsulate and extend it. CONCLUSIONS: plotnineSeqSuite is the first ggplot2-style package to implement visualization of sequence -related graphs in Python. It enhances the uniformity of programmatic plotting between R and Python. Compared with tools appeared already, the categories supported by plotnineSeqSuite are much more complete. The source code of plotnineSeqSuite can be obtained on GitHub ( https://github.com/caotianze/plotnineseqsuite ) and PyPI ( https://pypi.org/project/plotnineseqsuite ), and the documentation homepage is freely available on GitHub at ( https://caotianze.github.io/plotnineseqsuite/ ).
Assuntos
Inteligência Artificial , Software , Linguagens de Programação , Biologia Computacional , Matrizes de Pontuação de Posição EspecíficaRESUMO
Mucopolysaccharidoses VI (Maroteaux Lamy syndrome) is a metabolic disorder due to the loss of enzyme activity of N-acetyl galactosamine-4-sulphatase arising from mutations in the ARSB gene. The mutated ARSB is the origin for the accumulation of GAGs within the lysosome leading to severe growth deformities, causing lysosomal storage disease. The main focus of this study is to identify the deleterious variants by applying bioinformatics tools to predict the conservation, pathogenicity, stability, and effect of the ARSB variants. We examined 170 missense variants, of which G137V and G144R were the resultant variants predicted detrimental to the progression of the disease. The native along with G137V and G144R structures were fixed as the receptors and subjected to Molecular docking with the small molecule Odiparcil to analyze the binding efficiency and the varied interactions of the receptors towards the drug. The interaction resulted in similar docking scores of - 7.3 kcal/mol indicating effective binding and consistent interactions of the drug with residues CYS117, GLN118, THR182, and GLN517 for native, along with G137V and G144R structures. Molecular Dynamics were conducted to validate the stability and flexibility of the native and variant structures on ligand binding. The overall study indicates that the drug has similar therapeutic towards the native and variant based on the higher binding affinity and also the complexes show stability with an average of 0.2 nm RMS value. This can aid in the future development therapeutics for the Maroteaux Lamy syndrome.
RESUMO
Detecting copy number variations (CNVs) and alterations (CNAs) in the BRCA1 and BRCA2 genes is essential for testing patients for targeted therapy applicability. However, the available bioinformatics tools were initially designed for identifying CNVs/CNAs in whole-genome or -exome (WES) NGS data or targeted NGS data without adaptation to the BRCA1/2 genes. Most of these tools were tested on sample cohorts of limited size, with their use restricted to specific library preparation kits or sequencing platforms. We developed BRACNAC, a new tool for detecting CNVs and CNAs in the BRCA1 and BRCA2 genes in NGS data of different origin. The underlying mechanism of this tool involves various coverage normalization steps complemented by CNV probability evaluation. We estimated the sensitivity and specificity of our tool to be 100% and 94%, respectively, with an area under the curve (AUC) of 94%. The estimation was performed using the NGS data obtained from 213 ovarian and prostate cancer samples tested with in-house and commercially available library preparation kits and additionally using multiplex ligation-dependent probe amplification (MLPA) (12 CNV-positive samples). Using freely available WES and targeted NGS data from other research groups, we demonstrated that BRACNAC could also be used for these two types of data, with an AUC of up to 99.9%. In addition, we determined the limitations of the tool in terms of the minimum number of samples per NGS run (≥20 samples) and the minimum expected percentage of CNV-negative samples (≥80%). We expect that our findings will improve the efficacy of BRCA1/2 diagnostics. BRACNAC is freely available at the GitHub server.
Assuntos
Variações do Número de Cópias de DNA , Neoplasias Ovarianas , Neoplasias da Próstata , Feminino , Humanos , Masculino , Proteína BRCA1/genética , Proteína BRCA2/genética , Genes BRCA2 , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias Ovarianas/genética , Neoplasias Ovarianas/diagnóstico , Neoplasias da Próstata/genéticaRESUMO
In liquid-chromatography-tandem-mass-spectrometry-based proteomics, information about the presence and stoichiometry of protein modifications is not readily available. To overcome this problem, we developed multiFLEX-LF, a computational tool that builds upon FLEXIQuant, which detects modified peptide precursors and quantifies their modification extent by monitoring the differences between observed and expected intensities of the unmodified precursors. multiFLEX-LF relies on robust linear regression to calculate the modification extent of a given precursor relative to a within-study reference. multiFLEX-LF can analyze entire label-free discovery proteomics data sets in a precursor-centric manner without preselecting a protein of interest. To analyze modification dynamics and coregulated modifications, we hierarchically clustered the precursors of all proteins based on their computed relative modification scores. We applied multiFLEX-LF to a data-independent-acquisition-based data set acquired using the anaphase-promoting complex/cyclosome (APC/C) isolated at various time points during mitosis. The clustering of the precursors allows for identifying varying modification dynamics and ordering the modification events. Overall, multiFLEX-LF enables the fast identification of potentially differentially modified peptide precursors and the quantification of their differential modification extent in large data sets using a personal computer. Additionally, multiFLEX-LF can drive the large-scale investigation of the modification dynamics of peptide precursors in time-series and case-control studies. multiFLEX-LF is available at https://gitlab.com/SteenOmicsLab/multiflex-lf.
Assuntos
Proteínas , Proteômica , Cromatografia Líquida , Espectrometria de Massas , PeptídeosRESUMO
In clinical cancer treatment, genomic alterations would often affect the response of patients to anticancer drugs. Studies have shown that molecular features of tumors could be biomarkers predictive of sensitivity or resistance to anticancer agents, but the identification of actionable mutations are often constrained by the incomplete understanding of cancer genomes. Recent progresses of next-generation sequencing technology greatly facilitate the extensive molecular characterization of tumors and promote precision medicine in cancers. More and more clinical studies, cancer cell lines studies, CRISPR screening studies as well as patient-derived model studies were performed to identify potential actionable mutations predictive of drug response, which provide rich resources of molecularly and pharmacologically profiled cancer samples at different levels. Such abundance of data also enables the development of various computational models and algorithms to solve the problem of drug sensitivity prediction, biomarker identification and in silico drug prioritization by the integration of multiomics data. Here, we review the recent development of methods and resources that identifies mutation-dependent effects for cancer treatment in clinical studies, functional genomics studies and computational studies and discuss the remaining gaps and future directions in this area.
Assuntos
Antineoplásicos , Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias , Medicina de Precisão , Antineoplásicos/uso terapêutico , Genômica , Humanos , Terapia de Alvo Molecular , Mutação , Neoplasias/genética , Neoplasias/terapia , Medicina de Precisão/métodosRESUMO
The current COronaVIrus Disease 2019 (COVID-19) pandemic started in December 2019. COVID-19 cases are confirmed by the detection of SARS-CoV-2 RNA in biological samples by RT-qPCR. However, limited numbers of SARS-CoV-2 genomes were available when the first RT-qPCR methods were developed in January 2020 for initial in silico specificity evaluation and to verify whether the targeted loci are highly conserved. Now that more whole genome data have become available, we used the bioinformatics tool SCREENED and a total of 4755 publicly available SARS-CoV-2 genomes, downloaded at two different time points, to evaluate the specificity of 12 RT-qPCR tests (consisting of a total of 30 primers and probe sets) used for SARS-CoV-2 detection and the impact of the virus' genetic evolution on four of them. The exclusivity of these methods was also assessed using the human reference genome and 2624 closely related other respiratory viral genomes. The specificity of the assays was generally good and stable over time. An exception is the first method developed by the China Center for Disease Control and prevention (CDC), which exhibits three primer mismatches present in 358 SARS-CoV-2 genomes sequenced mainly in Europe from February 2020 onwards. The best results were obtained for the assay of Chan et al. (2020) targeting the gene coding for the spiking protein (S). This demonstrates that our user-friendly strategy can be used for a first in silico specificity evaluation of future RT-qPCR tests, as well as verifying that the former methods are still capable of detecting circulating SARS-CoV-2 variants.
Assuntos
Betacoronavirus/genética , Infecções por Coronavirus/diagnóstico , Genoma Viral , Pneumonia Viral/diagnóstico , RNA Viral/metabolismo , Reação em Cadeia da Polimerase em Tempo Real/métodos , Betacoronavirus/isolamento & purificação , COVID-19 , Infecções por Coronavirus/virologia , Bases de Dados Genéticas , Humanos , Fases de Leitura Aberta/genética , Pandemias , Pneumonia Viral/virologia , Polimorfismo de Nucleotídeo Único , RNA Viral/análise , RNA Polimerase Dependente de RNA/genética , SARS-CoV-2 , Sensibilidade e Especificidade , Sequenciamento Completo do GenomaRESUMO
BACKGROUND: miRNAs regulate the expression of several genes with one miRNA able to target multiple genes and with one gene able to be simultaneously targeted by more than one miRNA. Therefore, it has become indispensable to shorten the long list of miRNA-target interactions to put in the spotlight in order to gain insight into understanding the regulatory mechanism orchestrated by miRNAs in various cellular processes. A reasonable solution is certainly to prioritize miRNA-target interactions to maximize the effectiveness of the downstream analysis. RESULTS: We propose a new and easy-to-use web tool MIENTURNET (MicroRNA ENrichment TURned NETwork) that receives in input a list of miRNAs or mRNAs and tackles the problem of prioritizing miRNA-target interactions by performing a statistical analysis followed by a fully featured network-based visualization and analysis. The statistics is used to assess the significance of an over-representation of miRNA-target interactions and then MIENTURNET filters based on the statistical significance associated with each miRNA-target interaction. In addition, the holistic approach of the network theory is used to infer possible evidences of miRNA regulation by capturing emergent properties of the miRNA-target regulatory network that would be not evident through a pairwise analysis of the individual components. CONCLUSION: MIENTURNET offers the possibility to consistently perform both statistical and network-based analyses by using only a single tool leading to a more effective prioritization of the miRNA-target interactions. This has the potential to avoid researchers without computational and informatics skills to navigate multiple websites and thus to independently investigate miRNA activity in every cellular process of interest in an easy and at the same time exhaustive way thanks to the intuitive web interface. The web application along with a well-documented and comprehensive user guide are freely available at http://userver.bio.uniroma1.it/apps/mienturnet/ without any login requirement.
Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , Biologia Computacional/instrumentação , Redes Reguladoras de Genes , Internet , RNA Mensageiro/genéticaRESUMO
moFF is a modular and operating-system-independent tool for quantitative analysis of label-free mass-spectrometry-based proteomics data. The moFF workflow, comprising matching-between-runs and apex quantification, can be applied to any upstream search engine's output, along with the corresponding Thermo or mzML raw file. We here present moFF 2.0, with improvements in speed through multithreading, the use of a new raw file access library, and a novel filtering approach in the matching-between-runs module. This filter allows moFF to correctly identify features that are present in one run but not in another, as demonstrated using spiked-in iRT peptides. Moreover, moFF 2.0 also provides a new peptide summary export that can be used in downstream statistical analysis. moFF is open source and freely available and can be downloaded from https://github.com/compomics/moFF.
Assuntos
Algoritmos , Interpretação Estatística de Dados , Proteômica/métodos , Análise de Dados , Peptídeos/análise , Peptídeos/química , SoftwareRESUMO
Evaluation of the functional impact of cancer-associated missense variants is more difficult than for protein-truncating mutations and consequently standard guidelines for the interpretation of sequence variants have been recently proposed. A number of algorithms and software products were developed to predict the impact of cancer-associated missense mutations on protein structure and function. Importantly, direct assessment of the variants using high-throughput functional assays using simple genetic systems can help in speeding up the functional evaluation of newly identified cancer-associated variants. We developed the web tool CRIMEtoYHU (CTY) to help geneticists in the evaluation of the functional impact of cancer-associated missense variants. Humans and the yeast Saccharomyces cerevisiae share thousands of protein-coding genes although they have diverged for a billion years. Therefore, yeast humanization can be helpful in deciphering the functional consequences of human genetic variants found in cancer and give information on the pathogenicity of missense variants. To humanize specific positions within yeast genes, human and yeast genes have to share functional homology. If a mutation in a specific residue is associated with a particular phenotype in humans, a similar substitution in the yeast counterpart may reveal its effect at the organism level. CTY simultaneously finds yeast homologous genes, identifies the corresponding variants and determines the transferability of human variants to yeast counterparts by assigning a reliability score (RS) that may be predictive for the validity of a functional assay. CTY analyzes newly identified mutations or retrieves mutations reported in the COSMIC database, provides information about the functional conservation between yeast and human and shows the mutation distribution in human genes. CTY analyzes also newly found mutations and aborts when no yeast homologue is found. Then, on the basis of the protein domain localization and functional conservation between yeast and human, the selected variants are ranked by the RS. The RS is assigned by an algorithm that computes functional data, type of mutation, chemistry of amino acid substitution and the degree of mutation transferability between human and yeast protein. Mutations giving a positive RS are highly transferable to yeast and, therefore, yeast functional assays will be more predictable. To validate the web application, we have analyzed 8078 cancer-associated variants located in 31 genes that have a yeast homologue. More than 50% of variants are transferable to yeast. Incidentally, 88% of all transferable mutations have a reliability score >0. Moreover, we analyzed by CTY 72 functionally validated missense variants located in yeast genes at positions corresponding to the human cancer-associated variants. All these variants gave a positive RS. To further validate CTY, we analyzed 3949 protein variants (with positive RS) by the predictive algorithm PROVEAN. This analysis shows that yeast-based functional assays will be more predictable for the variants with positive RS. We believe that CTY could be an important resource for the cancer research community by providing information concerning the functional impact of specific mutations, as well as for the design of functional assays useful for decision support in precision medicine.
Assuntos
Variação Biológica da População , Biologia Computacional/métodos , Análise Mutacional de DNA , Biologia Molecular/métodos , Proteínas Mutantes/genética , Neoplasias/genética , Saccharomyces cerevisiae/genética , Humanos , Internet , Proteínas Mutantes/metabolismo , Mutação de Sentido IncorretoRESUMO
Short structural variants (SSVs) are short genomic variants (<50 bp) other than SNPs. It has been suggested that SSVs contribute to many human complex traits. However, high-throughput analysis of SSVs presents numerous technical challenges. In order to facilitate the discovery and assessment of SSVs, we have developed a prototype bioinformatics tool, "SSV evaluation system," which is a searchable, annotated database of SSVs in the human genome, with associated customizable scoring software that is used to evaluate and prioritize SSVs that are most likely to have significant biological effects and impact on disease risk. This new bioinformatics tool is a component in a larger strategy that we have been using to discover potentially important SSVs within candidate genomic regions that have been identified in genome-wide association studies, with the goal to prioritize potential functional/causal SSVs and focus the follow-up experiments on a relatively small list of strong candidate SSVs. We describe our strategy and discuss how we have used the SSV evaluation system to discover candidate causal variants related to complex neurodegenerative diseases. We present the SSV evaluation system as a powerful tool to guide genetic investigations aiming to uncover SSVs that underlie human complex diseases including neurodegenerative diseases in aging.
Assuntos
Biologia Computacional/métodos , Predisposição Genética para Doença , Variação Genética , Estudo de Associação Genômica Ampla , Genômica , Humanos , SoftwareRESUMO
BACKGROUND: Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata). RESULTS: We have developed a modular bioinformatics pipeline to improve genome sequence assemblies, which can handle various types of data. The pipeline comprises several semi-automated tools. However, unlike classical automated tools that are based on global parameters, the semi-automated tools proposed an expert mode for a user who can decide on suggested improvements through local compromises. The pipeline was used to improve the draft genome sequence of Musa acuminata. Genotyping by sequencing (GBS) of a segregating population and paired-end sequencing were used to detect and correct scaffold misassemblies. Long insert size paired-end reads identified scaffold junctions and fusions missed by automated assembly methods. GBS markers were used to anchor scaffolds to pseudo-molecules with a new bioinformatics approach that avoids the tedious step of marker ordering during genetic map construction. Furthermore, a genome map was constructed and used to assemble scaffolds into super scaffolds. Finally, a consensus gene annotation was projected on the new assembly from two pre-existing annotations. This approach reduced the total Musa scaffold number from 7513 to 1532 (i.e. by 80%), with an N50 that increased from 1.3 Mb (65 scaffolds) to 3.0 Mb (26 scaffolds). 89.5% of the assembly was anchored to the 11 Musa chromosomes compared to the previous 70%. Unknown sites (N) were reduced from 17.3 to 10.0%. CONCLUSION: The release of the Musa acuminata reference genome version 2 provides a platform for detailed analysis of banana genome variation, function and evolution. Bioinformatics tools developed in this work can be used to improve genome sequence assemblies in other species.
Assuntos
Biologia Computacional/métodos , Genoma de Planta , Musa/genética , Mapeamento de Sequências Contíguas , Marcadores Genéticos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Análise de Sequência de DNARESUMO
Gly m Bd 28K is one of the major allergens in soybeans, but there is limited information on its IgG-binding epitopes. Thirty-four overlapping peptides that covered the entire sequence of Gly m Bd 28K were synthesized, and 3 monoclonal antibodies against Gly m Bd 28K were utilized to identify the IgG-binding regions of Gly m Bd 28K. Three dominant peptides corresponding to (28)GDKKSPKSLFLMSNS(42)(G28-S42), (56)LKSHGGRIFYRHMHI(70)(L56-I70), and (154)ETFQSFYIGGGANSH(168)(E154-H168) were recognized. L56-I70 is the most important epitope, and a competitive ELISA indicated that it could inhibit the binding of monoclonal antibody to Gly m Bd 28K protein. Alanine scanning of L56-I70 documented that F64, Y65, and R66 were the critical amino acids of this epitope. Two bioinformatics tools, ABCpred and BepiPred, were used to predict the epitopes of Gly m Bd 28K, and the predictions were compared with the epitopes that we had located by monoclonal antibodies.
Assuntos
Antígenos de Plantas/química , Antígenos de Plantas/imunologia , Mapeamento de Epitopos , Glicoproteínas/química , Glicoproteínas/imunologia , Imunoglobulina G/imunologia , Proteínas de Soja/química , Proteínas de Soja/imunologia , Sequência de Aminoácidos , Animais , Antígenos de Plantas/metabolismo , Biologia Computacional , Glicoproteínas/antagonistas & inibidores , Glicoproteínas/metabolismo , Camundongos , Peptídeos/química , Peptídeos/farmacologia , Proteínas de Soja/antagonistas & inibidores , Proteínas de Soja/metabolismoRESUMO
Somatically acquired chromosomal rearrangements occur at early stages during tumorigenesis and can be used to indirectly detect tumor cells, serving as highly sensitive and tumor-specific biomarkers. Advances in high-throughput sequencing have allowed the genome-wide identification of patient-specific chromosomal rearrangements to be used as personalized biomarkers to efficiently assess response to treatment, detect residual disease and monitor disease recurrence. However, sequencing and data processing costs still represent major obstacles for the widespread application of personalized biomarkers in oncology. We developed a computational pipeline (ICRmax) for the cost-effective identification of a minimal set of tumor-specific interchromosomal rearrangements (ICRs). We examined ICRmax performance on sequencing data from rectal tumors and simulated data achieving an average accuracy of 68% for ICR identification. ICRmax identifies ICRs from low-coverage sequenced tumors, eliminates the need to sequence a matched normal tissue and significantly reduces the costs that limit the utilization of personalized biomarkers in the clinical setting.
Assuntos
Biomarcadores Tumorais/metabolismo , Aberrações Cromossômicas , Biologia Computacional/métodos , Neoplasias/diagnóstico , HumanosRESUMO
Gene expression profiling technologies have revolutionized cell biology, enabling researchers to identify gene signatures linked to various biological attributes of melanomas, such as pigmentation status, differentiation state, proliferative versus invasive capacity, and disease progression. Although the discovery of gene signatures has significantly enhanced our understanding of melanocytic phenotypes, reconciling the numerous signatures reported across independent studies and different profiling platforms remains a challenge. Current methods for classifying melanocytic gene signatures depend on exact gene overlap and comparison with unstandardized baseline transcriptomes. In this study, we aimed to categorize published gene signatures into clusters based on their similar patterns of expression across clinical cutaneous melanoma specimens. We analyzed nearly 800 melanoma samples from six gene expression repositories and developed a classification framework for gene signatures that is resilient against biases in gene identification across profiling platforms and inconsistencies in baseline standards. Using 39 frequently cited published gene signatures, our analysis revealed seven principal classes of gene signatures that correlate with previously identified phenotypes: Differentiated, Mitotic/MYC, AXL, Amelanotic, Neuro, Hypometabolic, and Invasive. Each class is consistent with the phenotypes that the constituent gene signatures represent, and our classification method does not rely on overlapping genes between signatures. To facilitate broader application, we created WIMMS (what is my melanocytic signature, available at https://wimms.tanlab.org/), a user-friendly web application. WIMMS allows users to categorize any gene signature, determining its relationship to predominantly cited signatures and its representation within the seven principal classes.