RESUMEN
Quantifying microbiome species and composition from metagenomic assays is often challenging due to its time-consuming nature and computational complexity. In Bioinformatics, k-mer-based approaches were long established to expedite the analysis of large sequencing data and are now widely used to annotate metagenomic data. We make use of k-mer counting techniques for efficient and accurate compositional analysis of microbiota from whole metagenome sequencing. Mibianto solves this problem by operating directly on read files, without manual preprocessing or complete data exchange. It handles diverse sequencing platforms, including short single-end, paired-end, and long read technologies. Our sketch-based workflow significantly reduces the data volume transferred from the user to the server (up to 99.59% size reduction) to subsequently perform taxonomic profiling with enhanced efficiency and privacy. Mibianto offers functionality beyond k-mer quantification; it supports advanced community composition estimation, including diversity, ordination, and differential abundance analysis. Our tool aids in the standardization of computational workflows, thus supporting reproducibility of scientific sequencing studies. It is adaptable to small- and large-scale experimental designs and offers a user-friendly interface, thus making it an invaluable tool for both clinical and research-oriented metagenomic studies. Mibianto is freely available without the need for a login at: https://www.ccb.uni-saarland.de/mibianto.
Asunto(s)
Metagenómica , Microbiota , Programas Informáticos , Metagenómica/métodos , Microbiota/genética , Humanos , Metagenoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Internet , Flujo de Trabajo , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodosRESUMEN
MOTIVATION: Automated chromatin segmentation based on ChIP-seq (chromatin immunoprecipitation followed by sequencing) data reveals insights into the epigenetic regulation of chromatin accessibility. Existing segmentation methods are constrained by simplifying modeling assumptions, which may have a negative impact on the segmentation quality. RESULTS: We introduce EpiSegMix, a novel segmentation method based on a hidden Markov model with flexible read count distribution types and state duration modeling, allowing for a more flexible modeling of both histone signals and segment lengths. In a comparison with existing tools, ChromHMM, Segway, and EpiCSeg, we show that EpiSegMix is more predictive of cell biology, such as gene expression. Its flexible framework enables it to fit an accurate probabilistic model, which has the potential to increase the biological interpretability of chromatin states. AVAILABILITY AND IMPLEMENTATION: Source code: https://gitlab.com/rahmannlab/episegmix.
Asunto(s)
Cromatina , Epigénesis Genética , Análisis de Secuencia de ADN/métodos , Histonas/metabolismo , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
MOTIVATION: Clustering T-cell receptor repertoire (TCRR) sequences according to antigen specificity is challenging. The previously published tool GLIPH needs several days to weeks for clustering large repertoires, making its use impractical in larger studies. In addition, the methodology used in GLIPH suffers from shortcomings, including non-determinism, potential loss of significant antigen-specific sequences or inclusion of too many unspecific sequences. RESULTS: We present an algorithm for clustering TCRR sequences that scales efficiently to large repertoires. We clustered 36 real datasets with up to 62 000 unique CDR3ß sequences using both an implementation of our method called ting, GLIPH and its successor GLIPH2. While GLIPH required multiple weeks, ting only needed about one minute for the same task. GLIPH2 is comparably fast, but uses a different grouping paradigm. In addition, we found that in naïve repertoires, where no or very few antigen-specific CDR3 sequences or clusters should exist, our method indeed selects much fewer motifs and produces smaller clusters. AVAILABILITY AND IMPLEMENTATION: Our method has been implemented in Python as a tool called ting. It is available from GitHub (https://github.com/FelixMoelder/ting) or PyPI under the MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
The translation of successful preclinical and clinical proof-of-concept studies on cardioprotection to the benefit of patients with reperfused acute myocardial infarction has been difficult so far. This difficulty has been attributed to confounders which patients with myocardial infarction typically have but experimental animals usually not have. The metabolic syndrome is a typical confounder. We hypothesised that there may also be a genuine non-responsiveness to cardioprotection and used Ossabaw minipigs which have the genetic predisposition to develop a diet-induced metabolic syndrome, but before they had developed the diseased phenotype. Using a prospective study design, a reperfused acute myocardial infarction was induced in 62 lean Ossabaw minipigs by 60 min coronary occlusion and 180 min reperfusion. Ischaemic preconditioning by 3 cycles of 5 min coronary occlusion and 10 min reperfusion was used as cardioprotective intervention. Ossabaw minipigs were stratified for their single nucleotide polymorphism as homozygous for valine (V/V) or isoleucine (I/I)) in the γ-subunit of adenosine monophosphate-activated protein kinase. Endpoints were infarct size and area of no-reflow. Infarct size (V/V: 54 ± 8, I/I: 54 ± 13% of area at risk, respectively) was not reduced by ischaemic preconditioning (V/V: 55 ± 11, I/I: 46 ± 11%) nor was the area of no-reflow (V/V: 57 ± 18, I/I: 49 ± 21 vs. V/V: 57 ± 21, I/I: 47 ± 21% of infarct size). Bioinformatic comparison of the Ossabaw genome to that of Sus scrofa and Göttingen minipigs identified differences in clusters of genes encoding mitochondrial and inflammatory proteins, including the janus kinase (JAK)-signal transducer and activator of transcription (STAT) pathway. The phosphorylation of STAT3 at early reperfusion was not increased by ischaemic preconditioning, different from the established STAT3 activation by cardioprotective interventions in other pig strains. Ossabaw pigs have not only the genetic predisposition to develop a metabolic syndrome but also are not amenable to cardioprotection by ischaemic preconditioning.
RESUMEN
MOTIVATION: Genome Architecture Mapping (GAM) was recently introduced as a digestion- and ligation-free method to detect chromatin conformation. Orthogonal to existing approaches based on chromatin conformation capture (3C), GAM's ability to capture both inter- and intra-chromosomal contacts from low amounts of input data makes it particularly well suited for allele-specific analyses in a clinical setting. Allele-specific analyses are powerful tools to investigate the effects of genetic variants on many cellular phenotypes including chromatin conformation, but require the haplotypes of the individuals under study to be known a priori. So far, however, no algorithm exists for haplotype reconstruction and phasing of genetic variants from GAM data, hindering the allele-specific analysis of chromatin contact points in non-model organisms or individuals with unknown haplotypes. RESULTS: We present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies from GAM data and employs a GAM-specific probabilistic model of haplotype capture to optimize phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAMIBHEAR to infer accurate genome-wide haplotypes from GAM data. AVAILABILITY AND IMPLEMENTATION: GAMIBHEAR is available as an R package under the open-source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
BACKGROUND: Analysing whole genome bisulfite sequencing datasets is a data-intensive task that requires comprehensive and reproducible workflows to generate valid results. While many algorithms have been developed for tasks such as alignment, comprehensive end-to-end pipelines are still sparse. Furthermore, previous pipelines lack features or show technical deficiencies, thus impeding analyses. RESULTS: We developed wg-blimp (whole genome bisulfite sequencing methylation analysis pipeline) as an end-to-end pipeline to ease whole genome bisulfite sequencing data analysis. It integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input. Comparing wg-blimp to previous end-to-end pipelines reveals similar setups for common sequence processing tasks, but shows differences for post-alignment analyses. We improve on previous pipelines by providing a more comprehensive analysis workflow as well as an interactive user interface. To demonstrate wg-blimp's ability to produce correct results we used it to call differentially methylated regions for two publicly available datasets. We were able to replicate 112 of 114 previously published regions, and found results to be consistent with previous findings. We further applied wg-blimp to a publicly available sample of embryonic stem cells to showcase methylome segmentation. As expected, unmethylated regions were in close proximity of transcription start sites. Segmentation results were consistent with previous analyses, despite different reference genomes and sequencing techniques. CONCLUSIONS: wg-blimp provides a comprehensive analysis pipeline for whole genome bisulfite sequencing data as well as a user interface for simplified result inspection. We demonstrated its applicability by analysing multiple publicly available datasets. Thus, wg-blimp is a relevant alternative to previous analysis pipelines and may facilitate future epigenetic research.
Asunto(s)
Análisis de Secuencia de ADN , Programas Informáticos , Sulfitos/química , Secuenciación Completa del Genoma , Metilación de ADN , Bases de Datos Genéticas , Humanos , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: Hybridization is a central mechanism in evolution, producing new species or introducing important genetic variation into existing species. In plant-pathogenic fungi, adaptation and specialization to exploit a host species are key determinants of evolutionary success. Here, we performed experimental crosses between the two pathogenic Microbotryum species, M. lychnidis-dioicae and M. silenes-acaulis that are specialized to different hosts. The resulting offspring were analyzed on phenotypic and genomic levels to describe genomic characteristics of hybrid offspring and genetic factors likely involved in host-specialization. RESULTS: Genomic analyses of interspecific fungal hybrids revealed that individuals were most viable if the majority of loci were inherited from one species. Interestingly, species-specific loci were strictly controlled by the species' origin of the mating type locus. Moreover we detected signs of crossing over and chromosome duplications in the genomes of the analyzed hybrids. In Microbotryum, mitochondrial DNA was found to be uniparentally inherited from the a2 mating type. Genome comparison revealed that most gene families are shared and the majority of genes are conserved between the two species, indicating very similar biological features, including infection and pathogenicity processes. Moreover, we detected 211 candidate genes that were retained under host-driven selection of backcrossed lines. These genes and might therefore either play a crucial role in host specialization or be linked to genes that are essential for specialization. CONCLUSION: The combination of genome analyses with experimental selection and hybridization is a promising way to investigate host-pathogen interactions. This study manifests genetic factors of host specialization that are required for successful biotrophic infection of the post-zygotic stage, but also demonstrates the strong influence of intra-genomic conflicts or instabilities on the viability of hybrids in the haploid host-independent stage.
Asunto(s)
Basidiomycota , Genoma Fúngico , Meiosis , Recombinación Genética , Basidiomycota/genética , Basidiomycota/patogenicidad , Cruzamientos Genéticos , ADN Mitocondrial/genética , Especificidad de la Especie , VirulenciaRESUMEN
The switch/sucrose non-fermenting (SWI/SNF) complex is an ATP-dependent chromatin remodeller that regulates the spacing of nucleosomes and thereby controls gene expression. Heterozygous mutations in genes encoding subunits of the SWI/SNF complex have been reported in individuals with Coffin-Siris syndrome (CSS), with the majority of the mutations in ARID1B. CSS is a rare congenital disorder characterized by facial dysmorphisms, digital anomalies, and variable intellectual disability. We hypothesized that mutations in genes encoding subunits of the ubiquitously expressed SWI/SNF complex may lead to alterations of the nucleosome profiles in different cell types. We performed the first study on CSS-patient samples and investigated the nucleosome landscapes of cell-free DNA (cfDNA) isolated from blood plasma by whole-genome sequencing. In addition, we studied the nucleosome landscapes of CD14+ monocytes from CSS-affected individuals by nucleosome occupancy and methylome-sequencing (NOMe-seq) as well as their expression profiles. In cfDNA of CSS-affected individuals with heterozygous ARID1B mutations, we did not observe major changes in the nucleosome profile around transcription start sites. In CD14+ monocytes, we found few genomic regions with different nucleosome occupancy when compared to controls. RNA-seq analysis of CD14+ monocytes of these individuals detected only few differentially expressed genes, which were not in proximity to any of the identified differential nucleosome-depleted regions. In conclusion, we show that heterozygous mutations in the human SWI/SNF subunit ARID1B do not have a major impact on the nucleosome landscape or gene expression in blood cells. This might be due to functional redundancy, cell-type specificity, or alternative functions of ARID1B.
Asunto(s)
Anomalías Múltiples/genética , Proteínas de Unión al ADN/genética , Cara/anomalías , Deformidades Congénitas de la Mano/genética , Discapacidad Intelectual/genética , Micrognatismo/genética , Cuello/anomalías , Proteínas Nucleares/genética , Nucleosomas/genética , Factores de Transcripción/genética , Adolescente , Ácidos Nucleicos Libres de Células/sangre , Ácidos Nucleicos Libres de Células/genética , Niño , Preescolar , Femenino , Genoma Humano/genética , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Monocitos/citología , Adulto JovenRESUMEN
The notion of cancer as a complex evolutionary system has been validated by in-depth molecular analyses of tumor progression over the last years. While a complex interplay of cell-autonomous programs and cell-cell interactions determines proliferation and differentiation during normal development, intrinsic and acquired plasticity of cancer cells allow for evasion of growth factor limitations, apoptotic signals, or attacks from the immune system. Treatment-induced molecular selection processes have been described by a number of studies already, but understanding of those events facilitating metastatic spread, organ-specific homing, and resistance to anoikis is still in its early days. In principle, somatic events giving rise to cancer progression should be easier to follow in childhood tumors bearing fewer mutations and genomic aberrations than their counterparts in adulthood. We have previously reported on the genetic events accompanying relapsing neuroblastoma, a solid tumor of early childhood. Our results indicated significantly higher single nucleotide variants in relapse tumors, gave hints for branched tumor evolution upon treatment and clonal selection as deduced from shifts in allelic frequencies between primary and relapsing neuroblastoma. Here, we will review these findings and give an outlook on dealing with intratumoral heterogeneity and sub-clonal diversity in neuroblastoma for future targeted treatments.
Asunto(s)
Células Clonales/patología , Mutación/genética , Neuroblastoma/genética , Neuroblastoma/patología , Animales , Humanos , Inmunoterapia , Neuroblastoma/inmunología , Neuroblastoma/terapia , Recurrencia , Microambiente TumoralRESUMEN
BACKGROUND: High-throughput sequencing (HTS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. This approach is at the moment largely limited to prokaryotic communities and communities of few eukaryotic species with sequenced genomes. For eukaryotes the analysis is hindered mainly by a low and fragmented coverage of the reference databases to infer the community composition, but also by lack of automated workflows for the task. RESULTS: From the databases of the National Center for Biotechnology Information and Marine Microbial Eukaryote Transcriptome Sequencing Project, 142 references were selected in such a way that the taxa represent the main lineages within each of the seven supergroups of eukaryotes and possess predominantly complete transcriptomes or genomes. From these references, we created an annotated microeukaryotic reference database. We developed a tool called TaxMapper for a reliably mapping of sequencing reads against this database and filtering of unreliable assignments. For filtering, a classifier was trained and tested on each of the following: sequences of taxa in the database, sequences of taxa related to those in the database, and random sequences. Additionally, TaxMapper is part of a metatranscriptomic Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and described in detail to empower researchers to apply it for metatranscriptome analysis of any environmental sample. CONCLUSIONS: TaxMapper shows superior performance compared to standard approaches, resulting in a higher number of true positive taxonomic assignments. Both the TaxMapper tool and the workflow are available as open-source code at Bitbucket under the MIT license: https://bitbucket.org/dbeisser/taxmapper and as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html .
Asunto(s)
Bases de Datos Genéticas , Eucariontes/genética , Metagenómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento , Estándares de Referencia , Programas InformáticosRESUMEN
MOTIVATION: Third generation sequencing methods provide longer reads than second generation methods and have distinct error characteristics. While there exist many read simulators for second generation data, there is a very limited choice for third generation data. RESULTS: We analyzed public data from Pacific Biosciences (PacBio) SMRT sequencing, developed an error model and implemented it in a new read simulator called SimLoRD. It offers options to choose the read length distribution and to model error probabilities depending on the number of passes through the sequencer. The new error model makes SimLoRD the most realistic SMRT read simulator available. AVAILABILITY AND IMPLEMENTATION: SimLoRD is available open source at http://bitbucket.org/genomeinformatics/simlord/ and installable via Bioconda (http://bioconda.github.io). CONTACT: Bianca.Stoecker@uni-due.de or Sven.Rahmann@uni-due.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN/métodos , Simulación por Computador , Genómica/métodos , Programas InformáticosRESUMEN
Toll-like receptor (TLR) 13 and TLR2 are the major sensors of Gram-positive bacteria in mice. TLR13 recognizes Sa19, a specific 23S ribosomal (r) RNA-derived fragment and bacterial modification of Sa19 ablates binding to TLR13, and to antibiotics such as erythromycin. Similarly, RNase A-treated Staphylococcus aureus activate human peripheral blood mononuclear cells (PBMCs) only via TLR2, implying single-stranded (ss) RNA as major stimulant. Here, we identify human TLR8 as functional TLR13 equivalent that promiscuously senses ssRNA. Accordingly, Sa19 and mitochondrial (mt) 16S rRNA sequence-derived oligoribonucleotides (ORNs) stimulate PBMCs in a MyD88-dependent manner. These ORNs, as well as S. aureus-, Escherichia coli-, and mt-RNA, also activate differentiated human monocytoid THP-1 cells, provided they express TLR8. Moreover, Unc93b1(-/-)- and Tlr8(-/-)-THP-1 cells are refractory, while endogenous and ectopically expressed TLR8 confers responsiveness in a UR/URR RNA ligand consensus motif-dependent manner. If TLR8 function is inhibited by suppression of lysosomal function, antibiotic treatment efficiently blocks bacteria-driven inflammatory responses in infected human whole blood cultures. Sepsis therapy might thus benefit from interfering with TLR8 function.
Asunto(s)
Escherichia coli/genética , Escherichia coli/inmunología , ARN Bacteriano/química , ARN Bacteriano/inmunología , ARN/química , ARN/inmunología , Receptor Toll-Like 8/inmunología , Animales , Línea Celular Tumoral , Humanos , Leucocitos Mononucleares/inmunología , Ratones , Oligorribonucleótidos , ARN/genética , ARN Bacteriano/genética , ARN Mitocondrial , ARN Ribosómico 16S , Staphylococcus aureus/genética , Staphylococcus aureus/inmunología , Receptor Toll-Like 8/química , Receptor Toll-Like 8/genéticaRESUMEN
The pathogenesis of T-cell large granular lymphocytic leukemia (T-LGL) is poorly understood, as STAT3 mutations are the only known frequent genetic lesions. Here, we identified non-synonymous alterations in the TNFAIP3 tumor suppressor gene in 3 of 39 T-LGL. In two cases these were somatic mutations, in one case the somatic origin was likely. A further case harbored a SNP that is a known risk allele for autoimmune diseases and B cell lymphomas. Thus, TNFAIP3 mutations represent recurrent genetic lesions in T-LGL that affect about 8% of cases, likely contributing to deregulated NF-κB activity in this leukemia.
Asunto(s)
Proteínas de Unión al ADN/genética , Variación Genética , Péptidos y Proteínas de Señalización Intracelular/genética , Leucemia Linfocítica Granular Grande/genética , Proteínas Nucleares/genética , Estudios de Cohortes , Variaciones en el Número de Copia de ADN , Proteínas de Unión al ADN/metabolismo , Exones , Humanos , Péptidos y Proteínas de Señalización Intracelular/metabolismo , Leucemia Linfocítica Granular Grande/metabolismo , Leucemia Linfocítica Granular Grande/patología , Mutación , Proteínas Nucleares/metabolismo , Polimorfismo de Nucleótido Simple , Factor de Transcripción STAT3/genética , Análisis de Secuencia de ADN , Proteína 3 Inducida por el Factor de Necrosis Tumoral alfaRESUMEN
Inferring ecosystem functioning and ecosystem services through inspections of the species inventory is a major aspect of ecological field studies. Ecosystem functions are often stable despite considerable species turnover. Using metatranscriptome analyses, we analyse a thus-far unparalleled freshwater data set which comprises 21 mainland European freshwater lakes from the Sierra Nevada (Spain) to the Carpathian Mountains (Romania) and from northern Germany to the Apennines (Italy) and covers an altitudinal range from 38 m above sea level (a.s.l) to 3110 m a.s.l. The dominant taxa were Chlorophyta and streptophytic algae, Ciliophora, Bacillariophyta and Chrysophyta. Metatranscriptomics provided insights into differences in community composition and into functional diversity via the relative share of taxa to the overall read abundance of distinct functional genes on the ecosystem level. The dominant metabolic pathways in terms of the fraction of expressed sequences in the cDNA libraries were affiliated with primary metabolism, specifically oxidative phosphorylation, photosynthesis and the TCA cycle. Our analyses indicate that community composition is a good first proxy for the analysis of ecosystem functions. However, differential gene regulation modifies the relative importance of taxa in distinct pathways. Whereas taxon composition varies considerably between lakes, the relative importance of distinct metabolic pathways is much more stable, indicating that ecosystem functioning is buffered against shifts in community composition through a functional redundancy of taxa.
Asunto(s)
Biodiversidad , Ecosistema , Lagos , Chlorophyta/clasificación , Cilióforos/clasificación , Diatomeas/clasificación , Alemania , Italia , Rumanía , España , TranscriptomaAsunto(s)
Inhibidor p16 de la Quinasa Dependiente de Ciclina/deficiencia , Inhibidor p16 de la Quinasa Dependiente de Ciclina/metabolismo , Ependimoma/patología , Neoplasias Supratentoriales/patología , Factor de Transcripción ReIA/metabolismo , Estudios de Cohortes , Ependimoma/diagnóstico , Ependimoma/tratamiento farmacológico , Ependimoma/metabolismo , Humanos , Pronóstico , Neoplasias Supratentoriales/metabolismoRESUMEN
BACKGROUND: An ion mobility (IM) spectrometer coupled with a multi-capillary column (MCC) measures volatile organic compounds (VOCs) in the air or in exhaled breath. This technique is utilized in several biotechnological and medical applications. Each peak in an MCC/IM measurement represents a certain compound, which may be known or unknown. For clustering and classification of measurements, the raw data matrix must be reduced to a set of peaks. Each peak is described by its coordinates (retention time in the MCC and reduced inverse ion mobility) and shape (signal intensity, further shape parameters). This fundamental step is referred to as peak extraction. It is the basis for identifying discriminating peaks, and hence putative biomarkers, between two classes of measurements, such as a healthy control group and a group of patients with a confirmed disease. Current state-of-the-art peak extraction methods require human interaction, such as hand-picking approximate peak locations, assisted by a visualization of the data matrix. In a high-throughput context, however, it is preferable to have robust methods for fully automated peak extraction. RESULTS: We introduce PEAX, a modular framework for automated peak extraction. The framework consists of several steps in a pipeline architecture. Each step performs a specific sub-task and can be instantiated by different methods implemented as modules. We provide open-source software for the framework and several modules for each step. Additionally, an interface that allows easy extension by a new module is provided. Combining the modules in all reasonable ways leads to a large number of peak extraction methods. We evaluate all combinations using intrinsic error measures and by comparing the resulting peak sets with an expert-picked one. CONCLUSIONS: Our software PEAX is able to automatically extract peaks from MCC/IM measurements within a few seconds. The automatically obtained results keep up with the results provided by current state-of-the-art peak extraction methods. This opens a high-throughput context for the MCC/IM application field. Our software is available at http://www.rahmannlab.de/research/ims.
Asunto(s)
Biología Computacional/métodos , Procesamiento de Señales Asistido por Computador , Programas Informáticos , Análisis Espectral/métodos , Compuestos Orgánicos Volátiles/análisis , Biomarcadores/análisis , Pruebas Respiratorias , Estudios de Casos y Controles , Humanos , Iones/análisisRESUMEN
Using both high-throughput sequencing and real-time PCR, the miRNA transcriptome can be analyzed in complementary ways. We describe the necessary bioinformatics pipeline, including software tools, and key methodological steps in the process, such as adapter removal, read mapping, normalization, and multiple testing issues for biomarker identification. The methods are exemplified by the analysis of five favorable (event-free survival) vs. five unfavorable (died of disease) neuroblastoma tumor samples with a total of over 188 million reads.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , MicroARNs/genética , Reacción en Cadena en Tiempo Real de la Polimerasa/normas , Algoritmos , Biomarcadores de Tumor/metabolismo , Mapeo Cromosómico , Bases de Datos Genéticas , Perfilación de la Expresión Génica/normas , Genoma Humano , Humanos , MicroARNs/aislamiento & purificación , MicroARNs/metabolismo , Neuroblastoma/genética , Neuroblastoma/metabolismo , Control de Calidad , Estándares de Referencia , Homología de Secuencia de Ácido Nucleico , Programas InformáticosRESUMEN
This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.