RESUMEN
Since its establishment in 2009, single-cell RNA sequencing (RNA-seq) has been a major driver behind progress in biomedical research. In developmental biology and stem cell studies, the ability to profile single cells confers particular benefits. Although most studies still focus on individual tissues or organs, the recent development of ultra-high-throughput single-cell RNA-seq has demonstrated potential power in characterizing more complex systems or even the entire body. However, although multiple ultra-high-throughput single-cell RNA-seq systems have attracted attention, no systematic comparison of these systems has been performed. Here, with the same cell line and bioinformatics pipeline, we developed directly comparable datasets for each of three widely used droplet-based ultra-high-throughput single-cell RNA-seq systems, inDrop, Drop-seq, and 10X Genomics Chromium. Although each system is capable of profiling single-cell transcriptomes, their detailed comparison revealed the distinguishing features and suitable applications for each system.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Técnicas Analíticas Microfluídicas , ARN/genética , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Transcriptoma , Automatización de Laboratorios , Secuencia de Bases , Línea Celular , Biología Computacional , Análisis Costo-Beneficio , Código de Barras del ADN Taxonómico , Perfilación de la Expresión Génica/economía , Secuenciación de Nucleótidos de Alto Rendimiento/economía , Humanos , Técnicas Analíticas Microfluídicas/economía , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN/economía , Análisis de la Célula Individual/economía , Flujo de TrabajoRESUMEN
Filopodia are slender, actin-filled membrane projections used by various cell types for environment exploration. Analyzing filopodia often involves visualizing them using actin, filopodia tip or membrane markers. Due to the diversity of cell types that extend filopodia, from amoeboid to mammalian, it can be challenging for some to find a reliable filopodia analysis workflow suited for their cell type and preferred visualization method. The lack of an automated workflow capable of analyzing amoeboid filopodia with only a filopodia tip label prompted the development of filoVision. filoVision is an adaptable deep learning platform featuring the tools filoTips and filoSkeleton. filoTips labels filopodia tips and the cytosol using a single tip marker, allowing information extraction without actin or membrane markers. In contrast, filoSkeleton combines tip marker signals with actin labeling for a more comprehensive analysis of filopodia shafts in addition to tip protein analysis. The ZeroCostDL4Mic deep learning framework facilitates accessibility and customization for different datasets and cell types, making filoVision a flexible tool for automated analysis of tip-marked filopodia across various cell types and user data.
Asunto(s)
Actinas , Aprendizaje Profundo , Animales , Actinas/metabolismo , Seudópodos/metabolismo , Mamíferos/metabolismoRESUMEN
Although Top-down (TD) proteomics techniques, aimed at the analysis of intact proteins and proteoforms, are becoming increasingly popular, efforts are needed at different levels to generalise their adoption. In this context, there are numerous improvements that are possible in the area of open science practices, including a greater application of the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. These include, for example, increased data sharing practices and readily available open data standards. Additionally, the field would benefit from the development of open data analysis workflows that can enable data reuse of public datasets, something that is increasingly common in other proteomics fields.
Asunto(s)
Proteínas , Proteómica , Proteómica/métodos , Proteínas/análisis , Flujo de TrabajoRESUMEN
DNA methylation is a major epigenetic modification involved in many physiological processes. Normal methylation patterns are disrupted in many diseases and methylation-based biomarkers have shown promise in several contexts. Marker discovery typically involves the analysis of publicly available DNA methylation data from high-throughput assays. Numerous methods for identification of differentially methylated biomarkers have been developed, making the need for best practices guidelines and context-specific analyses workflows exceedingly high. To this end, here we propose TASA, a novel method for simulating methylation array data in various scenarios. We then comprehensively assess different data analysis workflows using real and simulated data and suggest optimal start-to-finish analysis workflows. Our study demonstrates that the choice of analysis pipeline for DNA methylation-based marker discovery is crucial and different across different contexts.
Asunto(s)
Investigación Biomédica , Metilación de ADN , Flujo de Trabajo , Epigénesis Genética , Análisis de DatosRESUMEN
BACKGROUND: With the increase of the dimensionality in flow cytometry data over the past years, there is a growing need to replace or complement traditional manual analysis (i.e. iterative 2D gating) with automated data analysis pipelines. A crucial part of these pipelines consists of pre-processing and applying quality control filtering to the raw data, in order to use high quality events in the downstream analyses. This part can in turn be split into a number of elementary steps: signal compensation or unmixing, scale transformation, debris, doublets and dead cells removal, batch effect correction, etc. However, assembling and assessing the pre-processing part can be challenging for a number of reasons. First, each of the involved elementary steps can be implemented using various methods and R packages. Second, the order of the steps can have an impact on the downstream analysis results. Finally, each method typically comes with its specific, non standardized diagnostic and visualizations, making objective comparison difficult for the end user. RESULTS: Here, we present CytoPipeline and CytoPipelineGUI, two R packages to build, compare and assess pre-processing pipelines for flow cytometry data. To exemplify these new tools, we present the steps involved in designing a pre-processing pipeline on a real life dataset and demonstrate different visual assessment use cases. We also set up a benchmarking comparing two pre-processing pipelines differing by their quality control methods, and show how the package visualization utilities can provide crucial user insight into the obtained benchmark metrics. CONCLUSION: CytoPipeline and CytoPipelineGUI are two Bioconductor R packages that help building, visualizing and assessing pre-processing pipelines for flow cytometry data. They increase productivity during pipeline development and testing, and complement benchmarking tools, by providing user intuitive insight into benchmarking results.
Asunto(s)
Análisis de Datos , Programas Informáticos , Citometría de Flujo/métodosRESUMEN
Mouse tumour models are extensively used as a pre-clinical research tool in the field of oncology, playing an important role in anticancer drugs discovery. Accordingly, in cancer genomics research, the demand for next-generation sequencing (NGS) is increasing, and consequently, the need for data analysis pipelines is likewise growing. Most NGS data analysis solutions to date do not support mouse data or require highly specific configuration for their use. Here, we present a genome analysis pipeline for mouse tumour NGS data including the whole-genome sequence (WGS) data analysis flow for somatic variant discovery, and the RNA-seq data flow for differential expression, functional analysis and neoantigen prediction. The pipeline is based on standards and best practices and integrates mouse genome references and annotations. In a recent study, the pipeline was applied to demonstrate the efficacy of low dose 6-thioguanine (6TG) treatment on low-mutation melanoma in a pre-clinical mouse model. Here, we further this study and describe in detail the pipeline and the results obtained in terms of tumour mutational burden (TMB) and number of predicted neoantigens, and correlate these with 6TG effects on tumour volume. Our pipeline was expanded to include a neoantigen analysis, resulting in neopeptide prediction and MHC class I antigen presentation evaluation. We observed that the number of predicted neoepitopes were more accurate indicators of tumour immune control than TMB. In conclusion, this study demonstrates the usability of the proposed pipeline, and suggests it could be an essential robust genome analysis platform for future mouse genomic analysis.
Asunto(s)
Melanoma , Tioguanina , Animales , Ratones , Tioguanina/farmacología , Genómica/métodos , Mutación , RNA-SeqRESUMEN
Chromatin immunoprecipitation coupled with sequencing (ChIP-seq) is a technique used to identify protein-DNA interaction sites through antibody pull-down, sequencing and analysis; with enrichment 'peak' calling being the most critical analytical step. Benchmarking studies have consistently shown that peak callers have distinct selectivity and specificity characteristics that are not additive and seldom completely overlap in many scenarios, even after parameter optimization. We therefore developed ChIP-AP, an integrated ChIP-seq analysis pipeline utilizing four independent peak callers, which seamlessly processes raw sequencing files to final result. This approach enables (1) better gauging of peak confidence through detection by multiple algorithms, and (2) more thoroughly surveys the binding landscape by capturing peaks not detected by individual callers. Final analysis results are then integrated into a single output table, enabling users to explore their data by applying selectivity and sensitivity thresholds that best address their biological questions, without needing any additional reprocessing. ChIP-AP therefore presents investigators with a more comprehensive coverage of the binding landscape without requiring additional wet-lab observations.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Benchmarking , Línea Celular , Inmunoprecipitación de Cromatina , Programas Informáticos , Factores de TranscripciónRESUMEN
The nucleolus is a large nuclear body that serves as the primary site for ribosome biogenesis. Recent studies have suggested that it also plays an important role in organizing chromatin architecture. However, to establish a causal relationship between nucleolar ribosome assembly and chromatin architecture, genetic tools are required to disrupt nucleolar ribosome biogenesis. In this study, we used ATAC-seq to investigate changes in chromatin accessibility upon specific depletion of two ribosome biogenesis components, RPOA-2 and GRWD-1, in the model organism Caenorhabditis elegans. To facilitate the analysis of ATAC-seq data, we introduced two tools: SRAlign, an extensible NGS data processing workflow, and SRAtac, a customizable end-to-end ATAC-seq analysis pipeline. Our results revealed highly comparable changes in chromatin accessibility following both RPOA-2 and GRWD-1 perturbations. However, we observed a weak correlation between changes in chromatin accessibility and gene expression. While our findings corroborate the idea of a feedback mechanism between ribosomal RNA synthesis, nucleolar ribosome large subunit biogenesis, and chromatin structure during the L1 stage of C. elegans development, they also prompt questions regarding the functional impact of these alterations on gene expression.
Asunto(s)
Caenorhabditis elegans , Secuenciación de Inmunoprecipitación de Cromatina , Animales , Caenorhabditis elegans/genética , Cromatina/genética , ARN Ribosómico/genética , RibosomasRESUMEN
Microorganisms in deep-sea hydrothermal vents provide valuable insights into life under extreme conditions. Mass spectrometry-based proteomics has been widely used to identify protein expression and function. However, the metaproteomic studies in deep-sea microbiota have been constrained largely by the low identification rates of protein or peptide. To improve the efficiency of metaproteomics for hydrothermal vent microbiota, we firstly constructed a microbial gene database (HVentDB) based on 117 public metagenomic samples from hydrothermal vents and proposed a metaproteomic analysis strategy, which takes the advantages of not only the sample-matched metagenome, but also the metagenomic information released publicly in the community of hydrothermal vents. A two-stage false discovery rate method was followed up to control the risk of false positive. By applying our community-supported strategy to a hydrothermal vent sediment sample, about twice as many peptides were identified when compared with the ways against the sample-matched metagenome or the public reference database. In addition, more enriched and explainable taxonomic and functional profiles were detected by the HVentDB-based approach exclusively, as well as many important proteins involved in methane, amino acid, sugar, glycan metabolism and DNA repair, etc. The new metaproteomic analysis strategy will enhance our understanding of microbiota, including their lifestyles and metabolic capabilities in extreme environments. The database HVentDB is freely accessible from http://lilab.life.sjtu.edu.cn:8080/HventDB/main.html.
Asunto(s)
Respiraderos Hidrotermales/microbiología , Metagenoma , Metagenómica/métodos , Microbiota/genética , Péptidos/genética , Proteogenómica/métodos , Proteoma/genética , Secuencia de Aminoácidos/genética , ADN Ribosómico/genética , Bases de Datos Genéticas , Genes Microbianos , FilogeniaRESUMEN
BACKGROUND: Genome-wide protein-DNA binding is popularly assessed using specific antibody pulldown in Chromatin Immunoprecipitation Sequencing (ChIP-Seq) or Cleavage Under Targets and Release Using Nuclease (CUT&RUN) sequencing experiments. These technologies generate high-throughput sequencing data that necessitate the use of multiple sophisticated, computationally intensive genomic tools to make discoveries, but these genomic tools often have a high barrier to use because of computational resource constraints. RESULTS: We present a comprehensive, infrastructure-independent, computational pipeline called SEAseq, which leverages field-standard, open-source tools for processing and analyzing ChIP-Seq/CUT&RUN data. SEAseq performs extensive analyses from the raw output of the experiment, including alignment, peak calling, motif analysis, promoters and metagene coverage profiling, peak annotation distribution, clustered/stitched peaks (e.g. super-enhancer) identification, and multiple relevant quality assessment metrics, as well as automatic interfacing with data in GEO/SRA. SEAseq enables rapid and cost-effective resource for analysis of both new and publicly available datasets as demonstrated in our comparative case studies. CONCLUSIONS: The easy-to-use and versatile design of SEAseq makes it a reliable and efficient resource for ensuring high quality analysis. Its cloud implementation enables a broad suite of analyses in environments with constrained computational resources. SEAseq is platform-independent and is aimed to be usable by everyone with or without programming skills. It is available on the cloud at https://platform.stjude.cloud/workflows/seaseq and can be locally installed from the repository at https://github.com/stjude/seaseq .
Asunto(s)
Cromatina , Programas Informáticos , Inmunoprecipitación de Cromatina , Secuenciación de Inmunoprecipitación de Cromatina , Nube Computacional , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
BACKGROUND: Bioassessment and biomonitoring of meat products are aimed at identifying and quantifying adulterants and contaminants, such as meat from unexpected sources and microbes. Several methods for determining the biological composition of mixed samples have been used, including metabarcoding, metagenomics and mitochondrial metagenomics. In this study, we aimed to develop a method based on next-generation DNA sequencing to estimate samples that might contain meat from 15 mammalian and avian species that are commonly related to meat bioassessment and biomonitoring. RESULTS: In this project, we found the meat composition from 15 species could not be identified with the metabarcoding approach because of the lack of universal primers or insufficient discrimination power. Consequently, we developed and evaluated a meat mitochondrial metagenomics (3MG) method. The 3MG method has four steps: (1) extraction of sequencing reads from mitochondrial genomes (mitogenomes); (2) assembly of mitogenomes; (3) mapping of mitochondrial reads to the assembled mitogenomes; and (4) biomass estimation based on the number of uniquely mapped reads. The method was implemented in a python script called 3MG. The analysis of simulated datasets showed that the method can determine contaminant composition at a proportion of 2% and the relative error was < 5%. To evaluate the performance of 3MG, we constructed and analysed mixed samples derived from 15 animal species in equal mass. Then, we constructed and analysed mixed samples derived from two animal species (pork and chicken) in different ratios. DNAs were extracted and used in constructing 21 libraries for next-generation sequencing. The analysis of the 15 species mix with the method showed the successful identification of 12 of the 15 (80%) animal species tested. The analysis of the mixed samples of the two species revealed correlation coefficients of 0.98 for pork and 0.98 for chicken between the number of uniquely mapped reads and the mass proportion. CONCLUSION: To the best of our knowledge, this study is the first to demonstrate the potential of the non-targeted 3MG method as a tool for accurately estimating biomass in meat mix samples. The method has potential broad applications in meat product safety.
Asunto(s)
Genoma Mitocondrial , Metagenómica , Animales , Mamíferos , Carne , Análisis de Secuencia de ADNRESUMEN
The reproducibility crisis in neuroimaging has led to an increased demand for standardized data processing workflows. Within the ENIGMA consortium, we developed HALFpipe (Harmonized Analysis of Functional MRI pipeline), an open-source, containerized, user-friendly tool that facilitates reproducible analysis of task-based and resting-state fMRI data through uniform application of preprocessing, quality assessment, single-subject feature extraction, and group-level statistics. It provides state-of-the-art preprocessing using fMRIPrep without the requirement for input data in Brain Imaging Data Structure (BIDS) format. HALFpipe extends the functionality of fMRIPrep with additional preprocessing steps, which include spatial smoothing, grand mean scaling, temporal filtering, and confound regression. HALFpipe generates an interactive quality assessment (QA) webpage to rate the quality of key preprocessing outputs and raw data in general. HALFpipe features myriad post-processing functions at the individual subject level, including calculation of task-based activation, seed-based connectivity, network-template (or dual) regression, atlas-based functional connectivity matrices, regional homogeneity (ReHo), and fractional amplitude of low-frequency fluctuations (fALFF), offering support to evaluate a combinatorial number of features or preprocessing settings in one run. Finally, flexible factorial models can be defined for mixed-effects regression analysis at the group level, including multiple comparison correction. Here, we introduce the theoretical framework in which HALFpipe was developed, and present an overview of the main functions of the pipeline. HALFpipe offers the scientific community a major advance toward addressing the reproducibility crisis in neuroimaging, providing a workflow that encompasses preprocessing, post-processing, and QA of fMRI data, while broadening core principles of data analysis for producing reproducible results. Instructions and code can be found at https://github.com/HALFpipe/HALFpipe.
Asunto(s)
Procesamiento de Imagen Asistido por Computador , Imagen por Resonancia Magnética , Encéfalo/diagnóstico por imagen , Encéfalo/fisiología , Mapeo Encefálico/métodos , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Imagen por Resonancia Magnética/métodos , Neuroimagen/métodos , Reproducibilidad de los ResultadosRESUMEN
Imaging Mass Cytometry (IMC) is a powerful high-throughput technique enabling resolution of up to 37 markers in a single fixed tissue section while also preserving in situ spatial relationships. Currently, IMC processing and analysis necessitates the use of multiple different software, labour-intensive pipeline development, different operating systems and knowledge of bioinformatics, all of which are a barrier to many potential users. Here we present TITAN - an open-source, single environment, end-to-end pipeline that can be utilized for image visualization, segmentation, analysis and export of IMC data. TITAN is implemented as an extension within the publicly available 3D Slicer software. We demonstrate the utility, application, reliability and comparability of TITAN using publicly available IMC data from recently-published breast cancer and COVID-19 lung injury studies. Compared with current IMC analysis methods, TITAN provides a user-friendly, efficient single environment to accurately visualize, segment, and analyze IMC data for all users.
Asunto(s)
COVID-19 , Análisis de Datos , Humanos , Citometría de Imagen/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
BACKGROUND: Rapid analysis of SARS-CoV-2 genomic data plays a crucial role in surveillance and adoption of measures in controlling spread of Covid-19. Fast, inclusive and adaptive methods are required for the heterogenous SARS-CoV-2 sequence data generated at an unprecedented rate. RESULTS: We present an updated version of the SARS-CoV-2 analysis module of our automated computational pipeline, Infectious Pathogen Detector (IPD) 2.0, to perform genomic analysis to understand the variability and dynamics of the virus. It adopts the recent clade nomenclature and demonstrates the clade prediction accuracy of 92.8%. IPD 2.0 also contains a SARS-CoV-2 updater module, allowing automatic upgrading of the variant database using genome sequences from GISAID. As a proof of principle, analyzing 208,911 SARS-CoV-2 genome sequences, we generate an extensive database of 2.58 million sample-wise variants. A comparative account of lineage-specific mutations in the newer SARS-CoV-2 strains emerging in the UK, South Africa and Brazil and data reported from India identify overlapping and lineages specific acquired mutations suggesting a repetitive convergent and adaptive evolution. CONCLUSIONS: A novel and dynamic feature of the SARS-CoV-2 module of IPD 2.0 makes it a contemporary tool to analyze the diverse and growing genomic strains of the virus and serve as a vital tool to help facilitate rapid genomic surveillance in a population to identify variants involved in breakthrough infections. IPD 2.0 is freely available from http://www.actrec.gov.in/pi-webpages/AmitDutt/IPD/IPD.html and the web-application is available at http://ipd.actrec.gov.in/ipdweb/ .
Asunto(s)
COVID-19 , SARS-CoV-2 , Brasil , Genoma Viral , Humanos , Mutación , FilogeniaRESUMEN
BACKGROUND: Current high-throughput technologies-i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.-generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples' pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample's pipeline. RESULTS: Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2-2.4 compared to the NPS. CONCLUSIONS: Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools' developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.
Asunto(s)
Biología Computacional , Secuenciación del Exoma , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Secuenciación de Inmunoprecipitación de Cromatina , Biología Computacional/métodos , Secuenciación del Exoma/normas , Flujo de TrabajoRESUMEN
Molecular profiling of tumor biopsies plays an increasingly important role not only in cancer research, but also in the clinical management of cancer patients. Multi-omics approaches hold the promise of improving diagnostics, prognostics and personalized treatment. To deliver on this promise of precision oncology, appropriate bioinformatics methods for managing, integrating and analyzing large and complex data are necessary. Here, we discuss the specific requirements of bioinformatics methods and software that arise in the setting of clinical oncology, owing to a stricter regulatory environment and the need for rapid, highly reproducible and robust procedures. We describe the workflow of a molecular tumor board and the specific bioinformatics support that it requires, from the primary analysis of raw molecular profiling data to the automatic generation of a clinical report and its delivery to decision-making clinical oncologists. Such workflows have to various degrees been implemented in many clinical trials, as well as in molecular tumor boards at specialized cancer centers and university hospitals worldwide. We review these and more recent efforts to include other high-dimensional multi-omics patient profiles into the tumor board, as well as the state of clinical decision support software to translate molecular findings into treatment recommendations.
Asunto(s)
Biología Computacional , Oncología Médica , Medicina de Precisión , Secuenciación de Nucleótidos de Alto Rendimiento , HumanosRESUMEN
Fish skin contains a mucosal microbiome for the largest and oldest group of vertebrates, a location ideal for microbial community ecology and practical applications in agriculture and veterinary medicine. These selective microbiomes are dominated by Proteobacteria, with compositions different from the surrounding water. Core taxa are a small percentage of those present and are currently functionally uncharacterized. Methods for skin sampling, DNA extraction and amplification, and sequence data processing are highly varied across the field, and reanalysis of recent studies using a consistent pipeline revealed that some conclusions did change in statistical significance. Further, the 16S gene sequencing approaches lack quantitation of microbes and copy number adjustment. Thus, consistency in the field is a serious limitation in comparing across studies. The most significant area for future study, requiring metagenomic and metabolomics data, is the biochemical pathways and functions within the microbiome community, the interactions between members, and the resulting effects on fish host health being linked to specific nutrients and microbial species. Genes linked to skin colonization, such as those for attachment or mucin degradation, need to be uncovered and explored. Skin immunity factors need to be directly linked to microbiome composition and individual taxa. The basic foundation has been laid, and many exciting future discoveries remain.
Asunto(s)
Bacterias , Microbiota , Animales , Bacterias/genética , Peces , Metagenómica , Microbiota/genética , PielRESUMEN
BACKGROUND: Analysing whole genome bisulfite sequencing datasets is a data-intensive task that requires comprehensive and reproducible workflows to generate valid results. While many algorithms have been developed for tasks such as alignment, comprehensive end-to-end pipelines are still sparse. Furthermore, previous pipelines lack features or show technical deficiencies, thus impeding analyses. RESULTS: We developed wg-blimp (whole genome bisulfite sequencing methylation analysis pipeline) as an end-to-end pipeline to ease whole genome bisulfite sequencing data analysis. It integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input. Comparing wg-blimp to previous end-to-end pipelines reveals similar setups for common sequence processing tasks, but shows differences for post-alignment analyses. We improve on previous pipelines by providing a more comprehensive analysis workflow as well as an interactive user interface. To demonstrate wg-blimp's ability to produce correct results we used it to call differentially methylated regions for two publicly available datasets. We were able to replicate 112 of 114 previously published regions, and found results to be consistent with previous findings. We further applied wg-blimp to a publicly available sample of embryonic stem cells to showcase methylome segmentation. As expected, unmethylated regions were in close proximity of transcription start sites. Segmentation results were consistent with previous analyses, despite different reference genomes and sequencing techniques. CONCLUSIONS: wg-blimp provides a comprehensive analysis pipeline for whole genome bisulfite sequencing data as well as a user interface for simplified result inspection. We demonstrated its applicability by analysing multiple publicly available datasets. Thus, wg-blimp is a relevant alternative to previous analysis pipelines and may facilitate future epigenetic research.
Asunto(s)
Análisis de Secuencia de ADN , Programas Informáticos , Sulfitos/química , Secuenciación Completa del Genoma , Metilación de ADN , Bases de Datos Genéticas , Humanos , Interfaz Usuario-ComputadorRESUMEN
Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de la Célula Individual/métodos , Guías como Asunto , Secuenciación de Nucleótidos de Alto Rendimiento , Internet , Análisis de Secuencia de ARN , Flujo de TrabajoRESUMEN
BACKGROUND: RNA-Seq technology is routinely used to characterize the transcriptome, and to detect gene expression differences among cell types, genotypes and conditions. Advances in short-read sequencing instruments such as Illumina Next-Seq have yielded easy-to-operate machines, with high throughput, at a lower price per base. However, processing this data requires bioinformatics expertise to tailor and execute specific solutions for each type of library preparation. RESULTS: In order to enable fast and user-friendly data analysis, we developed an intuitive and scalable transcriptome pipeline that executes the full process, starting from cDNA sequences derived by RNA-Seq [Nat Rev Genet 10:57-63, 2009] and bulk MARS-Seq [Science 343:776-779, 2014] and ending with sets of differentially expressed genes. Output files are placed in structured folders, and results summaries are provided in rich and comprehensive reports, containing dozens of plots, tables and links. CONCLUSION: Our User-friendly Transcriptome Analysis Pipeline (UTAP) is an open source, web-based intuitive platform available to the biomedical research community, enabling researchers to efficiently and accurately analyse transcriptome sequence data.