RESUMO
We describe an update of MirGeneDB, the manually curated microRNA gene database. Adhering to uniform and consistent criteria for microRNA annotation and nomenclature, we substantially expanded MirGeneDB with 30 additional species representing previously missing metazoan phyla such as sponges, jellyfish, rotifers and flatworms. MirGeneDB 2.1 now consists of 75 species spanning over â¼800 million years of animal evolution, and contains a total number of 16 670 microRNAs from 1549 families. Over 6000 microRNAs were added in this update using â¼550 datasets with â¼7.5 billion sequencing reads. By adding new phylogenetically important species, especially those relevant for the study of whole genome duplication events, and through updating evolutionary nodes of origin for many families and genes, we were able to substantially refine our nomenclature system. All changes are traceable in the specifically developed MirGeneDB version tracker. The performance of read-pages is improved and microRNA expression matrices for all tissues and species are now also downloadable. Altogether, this update represents a significant step toward a complete sampling of all major metazoan phyla, and a widely needed foundation for comparative microRNA genomics and transcriptomics studies. MirGeneDB 2.1 is part of RNAcentral and Elixir Norway, publicly and freely available at http://www.mirgenedb.org/.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Evolução Molecular , Genômica , Animais , Humanos , MicroRNAs/classificação , MicroRNAs/genética , FilogeniaRESUMO
Fecal microRNAs represent promising molecules with potential clinical interest as non-invasive diagnostic and prognostic biomarkers. Colorectal cancer (CRC) screening based on the fecal immunochemical test (FIT) is an effective tool for prevention of cancer development. However, due to the poor sensitivity of FIT especially for premalignant lesions, there is a need for implementation of complementary tests. Improving the identification of individuals who would benefit from further investigation with colonoscopy using molecular analysis, such as miRNA profiling of FIT samples, would be ideal due to their widespread use. In the present study, we assessed the feasibility of applying small RNA sequencing to measure human miRNAs in FIT leftover buffer in samples from two European screening populations. We showed robust detection of miRNAs with profiles similar to those obtained from specimens sampled using the established protocol of RNA stabilizing buffers, or in long-term archived samples. Detected miRNAs exhibited differential abundances for CRC, advanced adenoma, and control samples that were consistent for FIT and RNA-stabilizing buffers. Interestingly, the sequencing data also allowed for concomitant evaluation of small RNA-based microbial profiles. We demonstrated that it is possible to explore the human miRNome in FIT leftover samples across populations and envision that the analysis of small RNA biomarkers can complement the FIT in large scale screening settings.
Assuntos
Neoplasias Colorretais , MicroRNAs , Humanos , MicroRNAs/genética , Neoplasias Colorretais/diagnóstico , Neoplasias Colorretais/genética , Neoplasias Colorretais/patologia , Fezes/química , Detecção Precoce de Câncer/métodos , BiomarcadoresRESUMO
BACKGROUND AND AIMS: Gallbladder cancer (GBC) is a highly aggressive malignancy of the biliary tract. Most cases of GBC are diagnosed in low-income and middle-income countries, and research into this disease has long been limited. In this study we therefore investigate the epigenetic changes along the model of GBC carcinogenesis represented by the sequence gallstone disease â dysplasia â GBC in Chile, the country with the highest incidence of GBC worldwide. APPROACH AND RESULTS: To perform epigenome-wide methylation profiling, genomic DNA extracted from sections of formalin-fixed, paraffin-embedded gallbladder tissue was analyzed using Illumina Infinium MethylationEPIC BeadChips. Preprocessed, quality-controlled data from 82 samples (gallstones n = 32, low-grade dysplasia n = 13, high-grade dysplasia n = 9, GBC n = 28) were available to identify differentially methylated markers, regions, and pathways as well as changes in copy number variations (CNVs). The number and magnitude of epigenetic changes increased with disease development and predominantly involved the hypermethylation of cytosine-guanine dinucleotide islands and gene promoter regions. The methylation of genes implicated in Wnt signaling, Hedgehog signaling, and tumor suppression increased with tumor grade. CNVs also increased with GBC development and affected cyclin-dependent kinase inhibitor 2A, MDM2 proto-oncogene, tumor protein P53, and cyclin D1 genes. Gains in the targetable Erb-B2 receptor tyrosine kinase 2 gene were detected in 14% of GBC samples. CONCLUSIONS: Our results indicate that GBC carcinogenesis comprises three main methylation stages: early (gallstone disease and low-grade dysplasia), intermediate (high-grade dysplasia), and late (GBC). The identified gradual changes in methylation and CNVs may help to enhance our understanding of the mechanisms underlying this aggressive disease and eventually lead to improved treatment and early diagnosis of GBC.
Assuntos
Metilação de DNA , Epigênese Genética , Neoplasias da Vesícula Biliar/genética , Cálculos Biliares/genética , Hiperplasia/genética , Carcinogênese , Linhagem Celular Tumoral , Variações do Número de Cópias de DNA , Feminino , Genes Neoplásicos/genética , Humanos , MasculinoRESUMO
Introduction: Effective strategies for early detection of epithelial ovarian cancer are lacking. We evaluated whether a panel of 14 previously established circulating microRNAs could discriminate between cases diagnosed <2 years after serum collection and those diagnosed 2-7 years after serum collection. miRNA sequencing data from subsequent ovarian cancer cases were obtained as part of the ongoing multi-cancer JanusRNA project, utilizing pre-diagnostic serum samples from the Janus Serum Bank and linked to the Cancer Registry of Norway for cancer outcomes. Methods: We included a total of 80 ovarian cancer cases contributing 80 serum samples and compared 40 serum samples from cases with samples collected <2 years prior to diagnosis with 40 serum samples from cases with sample collection ≥2 to 7 years. We employed the extreme gradient boosting (XGBoost) algorithm to train a binary classification model using 70% of the available data, while the model was tested on the remaining 30% of the dataset. Results: The performance of the model was evaluated using repeated holdout validation. The previously established set of miRNAs achieved a median area under the receiver operating characteristic curve (AUC) of 0.771 in the test sets. Four out of 14 miRNAs (hsa-miR-200a-3p, hsa-miR-1246, hsa-miR-203a-3p, hsa-miR-23b-3p) exhibited higher expression levels closer to diagnosis, consistent with the previously reported upregulation in cancer cases, with statistical significance observed only for hsa-miR-200a-3p (beta=0.14; p=0.04). Discussion: The discrimination potential of the selected models provides evidence of the robustness of the miRNA signature for ovarian cancer.
RESUMO
Lung cancer (LC) prognosis is closely linked to the stage of disease when diagnosed. We investigated the biomarker potential of serum RNAs for the early detection of LC in smokers at different prediagnostic time intervals and histological subtypes. In total, 1061 samples from 925 individuals were analyzed. RNA sequencing with an average of 18 million reads per sample was performed. We generated machine learning models using normalized serum RNA levels and found that smokers later diagnosed with LC in 10 years can be robustly separated from healthy controls regardless of histology with an average area under the ROC curve (AUC) of 0.76 (95% CI, 0.68-0.83). Furthermore, the strongest models that took both time to diagnosis and histology into account successfully predicted non-small cell LC (NSCLC) between 6 and 8 years, with an AUC of 0.82 (95% CI, 0.76-0.88), and SCLC between 2 and 5 years, with an AUC of 0.89 (95% CI, 0.77-1.0), before diagnosis. The most important separators were microRNAs, miscellaneous RNAs, isomiRs, and tRNA-derived fragments. We have shown that LC can be detected years before diagnosis and manifestation of disease symptoms independently of histological subtype. However, the highest AUCs were achieved for specific subtypes and time intervals before diagnosis. The collection of models may therefore also predict the severity of cancer development and its histology. Our study demonstrates that serum RNAs can be promising prediagnostic biomarkers in an LC screening setting, from early detection to risk assessment.
Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , MicroRNAs , RNA Neoplásico , Biomarcadores Tumorais/sangue , Biomarcadores Tumorais/genética , Carcinoma Pulmonar de Células não Pequenas/sangue , Carcinoma Pulmonar de Células não Pequenas/genética , Detecção Precoce de Câncer , Humanos , Neoplasias Pulmonares/sangue , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , MicroRNAs/sangue , MicroRNAs/genética , RNA Neoplásico/sangue , RNA Neoplásico/genética , Curva ROCRESUMO
BACKGROUND: Computational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software. RESULTS: We find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs. CONCLUSIONS: Our findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish-possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate.
Assuntos
Biologia Computacional , Software , EditoraçãoRESUMO
BACKGROUND: Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, nonexpert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines. RESULTS: We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples. CONCLUSION: As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.
Assuntos
Software , Transcriptoma , Análise de Célula Única , Fluxo de Trabalho , Análise de Sequência de RNA , Perfilação da Expressão Gênica , RNARESUMO
Long noncoding RNAs (lncRNAs) play key roles in cell processes and are good candidates for cancer risk prediction. Few studies have investigated the association between individual genotypes and lncRNA expression. Here we integrate three separate datasets with information on lncRNA expression only, both lncRNA expression and genotype, and genotype information only to identify circulating lncRNAs associated with the risk of gallbladder cancer (GBC) using robust linear and logistic regression techniques. In the first dataset, we preselect lncRNAs based on expression changes along the sequence "gallstones â dysplasia â GBC". In the second dataset, we validate associations between genetic variants and serum expression levels of the preselected lncRNAs (cis-lncRNA-eQTLs) and build lncRNA expression prediction models. In the third dataset, we predict serum lncRNA expression based on individual genotypes and assess the association between genotype-based expression and GBC risk. AC084082.3 and LINC00662 showed increasing expression levels (p-value = 0.009), while C22orf34 expression decreased in the sequence from gallstones to GBC (p-value = 0.04). We identified and validated two cis-LINC00662-eQTLs (r2 = 0.26) and three cis-C22orf34-eQTLs (r2 = 0.24). Only LINC00662 showed a genotyped-based serum expression associated with GBC risk (OR = 1.25 per log2 expression unit, 95% CI 1.04-1.52, p-value = 0.02). Our results suggest that preselection of lncRNAs based on tissue samples and exploitation of cis-lncRNA-eQTLs may facilitate the identification of circulating noncoding RNAs linked to cancer risk.
RESUMO
Elucidation of microRNA activity is a crucial step in understanding gene regulation. One key problem in this effort is how to model the pairwise interactions of microRNAs with their targets. As this interaction is strongly mediated by their sequences, it is desired to set-up a probabilistic model to explain the binding preferences between a microRNA sequence and the sequence of a putative target. To this end, we introduce a new model of microRNA-target binding, which transforms an aligned duplex to a new sequence and defines the likelihood of this sequence using a Variable Length Markov Chain. It offers a complementary representation of microRNA-mRNA pairs for microRNA target prediction tools or other probabilistic frameworks of integrative gene regulation analysis. The performance of present model is evaluated by its ability to predict microRNA-target mRNA interaction given a mature microRNA sequence and a putative mRNA binding site. In regard to classification accuracy, it outperforms two recent methods based on thermodynamic stability and sequence complementarity. The experiments can also unveil the effects of base pairing types and non-seed region in duplex formation.
Assuntos
Simulação por Computador , MicroRNAs/química , Modelos Químicos , RNA Mensageiro/química , ProbabilidadeRESUMO
Although testicular germ cell tumor (TGCT) overall is highly curable, patients may experience late effects after treatment. An increased understanding of the mechanisms behind the development of TGCT may pave the way for better outcome for patients. To elucidate molecular changes prior to TGCT diagnosis we sequenced small RNAs in serum from 69 patients who were later diagnosed with TGCT and 111 matched controls. The deep RNA profiles, with on average 18 million sequences per sample, comprised of nine classes of RNA, including microRNA. We found that circulating RNA signals differed significantly between cases and controls regardless of time to diagnosis. Different levels of TSIX related to X-chromosome inactivation and TEX101 involved in spermatozoa production are among the interesting findings. The RNA signals differed between seminoma and non-seminoma TGCT subtypes, with seminoma cases showing lower levels of RNAs and non-seminoma cases showing higher levels of RNAs, compared with controls. The differentially expressed RNAs were typically associated with cancer related pathways. Our results indicate that circulating RNA profiles change during TGCT development according to histology and may be useful for early detection of this tumor type.
RESUMO
Cancer cell lines allow the identification of clinically relevant alterations and the prediction of drug response. However, sequencing data for hepatobiliary cancer cell lines in general, and particularly gallbladder cancer (GBC), are sparse. Here, we apply RNA sequencing to characterize 10 GBC, eight hepatocellular carcinoma, and five cholangiocarcinoma (CCA) cell lines. RNA extraction, quality control, library preparation, sequencing, and pre-processing of sequencing data were implemented using state-of-the-art techniques. Public data from the MSK-IMPACT database and a large cohort of Japanese biliary tract cancer patients were used to illustrate the usage of the released data. The total number of exonic mutations varied from 7207 for the cell line NOZ to 9760 for HuCCT1. Researchers planning experiments that require TP53 mutations could use the cell lines NOZ, OCUG-1, SNU308, or YoMi. Mz-Cha-1 showed mutations in ATM, SNU308 presented SMAD4 mutations, and the only investigated cell line that showed ARID1A mutations was GB-d1. SNU478 was the cell line with the global gene expression pattern most similar to GBC, intrahepatic CCA, and extrahepatic CCA. EGFR, KMT2D, and KMT2C generally presented a higher expression in the investigated cell lines than in Japanese primary GBC tumors. We provide the scientific community with detailed mutation and gene expression data, together with three showcase applications, with the aim of facilitating the design of future in vitro cell culture assays for research on hepatobiliary cancer.
RESUMO
Small non-coding RNAs (sncRNA) are regulators of cell functions and circulating sncRNAs from the majority of RNA classes are potential non-invasive biomarkers. Understanding how common traits influence ncRNA expression is essential for assessing their biomarker potential. In this study, we identify associations between sncRNA expression and common traits (sex, age, self-reported smoking, body mass, self-reported physical activity). We used RNAseq data from 526 serum samples from the Janus Serum Bank and traits from health examination surveys. Ageing showed the strongest association with sncRNA expression, both in terms of statistical significance and number of RNAs, regardless of RNA class. piRNAs were abundant in the serum samples and they were associated to sex. Interestingly, smoking cessation generally restored RNA expression to non-smoking levels, although for some sncRNAs smoking-related expression levels persisted. Pathway analysis suggests that smoking-related sncRNAs target the cholinergic synapses and may therefore potentially play a role in smoking addiction. Our results show that common traits influence circulating sncRNA expression. It is clear that sncRNA biomarker analyses should be adjusted for age and sex. In addition, for specific sncRNAs, analyses should also be adjusted for body mass, smoking, physical activity and technical factors.
Assuntos
Envelhecimento/sangue , Exercício Físico , Pequeno RNA não Traduzido/sangue , Fumar/sangue , Adulto , Idoso , Envelhecimento/genética , Índice de Massa Corporal , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , RNA Interferente Pequeno/sangue , RNA Interferente Pequeno/genética , Pequeno RNA não Traduzido/genética , Fumar/genética , TranscriptomaRESUMO
Classifying sequences is one of the central problems in computational biosciences. Several tools have been released to map an unknown molecular entity to one of the known classes using solely its sequence data. However, all of the existing tools are problem-specific and restricted to an alphabet constrained by relevant biological structure. Here, we introduce TRAINER, a new online tool designed to serve as a generic sequence classification platform to enable users provide their own training data with any alphabet therein defined. TRAINER allows users to select among several feature representation schemes and supervised machine learning methods with relevant parameters. Trained models can be saved for future use without retraining by other users. Two case studies are reported for effective use of the system for DNA and protein sequences; candidate effector prediction and nucleolar localization signal prediction. Biological relevance of the results is discussed.