RESUMO
Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.
Assuntos
Biologia Computacional , Software , Humanos , Biologia Computacional/métodos , Computadores , Redes de Comunicação de ComputadoresRESUMO
MOTIVATION: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS: By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION: The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Análise de Sequência , GenômicaRESUMO
BACKGROUND: Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. RESULTS: We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. CONCLUSIONS: The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks.
Assuntos
Biologia Computacional , Software , Biologia Computacional/métodos , Algoritmos , Bases de Dados Factuais , Biblioteca GênicaRESUMO
MOTIVATION: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITYAND IMPLEMENTATION: The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
BACKGROUND: Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. RESULTS: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. CONCLUSIONS: Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. AVAILABILITY: The software and the datasets are available at https://github.com/fpalini/fastdoopc.
Assuntos
Compressão de Dados , Genômica , Software , Algoritmos , Big DataRESUMO
BACKGROUND: Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. RESULTS: One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. CONCLUSIONS: We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.
Assuntos
Análise de Dados , Bases de Dados de Ácidos Nucleicos , Genoma , Estatística como Assunto , Algoritmos , Sequência de Bases , Software , Fatores de TempoRESUMO
Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results: Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. Availability and implementation: The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. Contact: umberto.ferraro@uniroma1.it. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica/métodos , Linguística , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Bactérias/genética , Análise por Conglomerados , Epigenômica/métodos , Eucariotos/genética , Humanos , MetagenomaRESUMO
SUMMARY: MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. AVAILABILITY AND IMPLEMENTATION: The software and the datasets are available at http://www.di.unisa.it/FASTdoop/ . CONTACT: umberto.ferraro@uniroma1.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Genômica/métodos , Armazenamento e Recuperação da Informação , Análise de Sequência de DNA/métodos , Biblioteca Gênica , Genoma Humano , HumanosRESUMO
CONTEXT: The utility of thyroglobulin (Tg) in the follow-up of differentiated thyroid cancer (DTC) patients has been well-documented. Although third-generation immunoassays have improved accuracy, limitations persist (interfering anti-Tg antibodies and measurement variability). Evolving treatment strategies require a reevaluation of Tg thresholds for optimal patient management. OBJECTIVE: To assess the performance of serum Tg testing in two populations: patients receiving total thyroidectomy and radioiodine remnant ablation (RRA), or treated with thyroidectomy alone. DESIGN: Prospective observational study. Setting. Centers contributing to the Italian Thyroid Cancer Observatory (ITCO) database. PATIENTS: We included 540 patients with 5 years of follow-up and negative anti-Tg antibodies. INTERVENTIONS: Serum Tg levels assessed at 1-year follow-up visit. MAIN OUTCOME MEASURE: Detection of structural disease within 5 years of follow-up. RESULTS: After excluding 26 patients with structural disease detected at any time point, the median Tg did not differ between patients treated with or without radioiodine. Data-driven Tg thresholds were established based on the 97th percentile of Tg levels in disease-free individuals: 1.97 ng/mL for patients undergoing thyroidectomy alone (lower than proposed by the MSKCC protocol and ESMO Guidelines, yet demonstrating good predictive ability, with a negative predictive value (NPV) of 98%) and 0.84 ng/mL for patients receiving post-surgical RRA. High sensitivity and NPV supported the potential of these thresholds in excluding structural disease. CONCLUSIONS: This real-world study provides evidence for the continued reliability of 1-year serum Tg levels. The data-driven Tg thresholds proposed offer valuable insights for clinical decision-making in patients undergoing total thyroidectomy with or without RRA.
RESUMO
We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease-gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism.
Assuntos
Transtorno Autístico/genética , Algoritmos , Encéfalo/patologia , Análise por Conglomerados , Redes Reguladoras de Genes/genética , Predisposição Genética para Doença/genética , HumanosRESUMO
Background: The role of minimal extrathyroidal extension (mETE) as a risk factor for persistent papillary thyroid carcinoma (PTC) is still debated. The aims of this study were to assess the clinical impact of mETE as a predictor of worse initial treatment response in PTC patients and to verify the impact of radioiodine therapy after surgery in patients with mETE. Methods: We reviewed all records in the Italian Thyroid Cancer Observatory database and selected 2237 consecutive patients with PTC who satisfied the inclusion criteria (PTC with no lymph node metastases and at least 1 year of follow-up). For each case, we considered initial surgery, histological variant of PTC, tumor diameter, recurrence risk class according to the American Thyroid Association (ATA) risk stratification system, use of radioiodine therapy, and initial therapy response, as suggested by ATA guidelines. Results: At 1-year follow-up, 1831 patients (81.8%) had an excellent response, 296 (13.2%) had an indeterminate response, 55 (2.5%) had a biochemical incomplete response, and 55 (2.5%) had a structural incomplete response. Statistical analysis suggested that mETE (odds ratio [OR] 1.16, p = 0.65), tumor size >2 cm (OR 1.45, p = 0.34), aggressive PTC histology (OR 0.55, p = 0.15), and age at diagnosis (OR 0.90, p = 0.32) were not significant risk factors for a worse initial therapy response. When evaluating the combination of mETE, tumor size, and aggressive PTC histology, the presence of mETE with a >2 cm tumor was significantly associated with a worse outcome (OR 5.27 [95% confidence interval], p = 0.014). The role of radioiodine ablation in patients with mETE was also evaluated. When considering radioiodine treatment, propensity score-based matching was performed, and no significant differences were found between treated and nontreated patients (p = 0.24). Conclusions: This study failed to show the prognostic value of mETE in predicting initial therapy response in a large cohort of PTC patients without lymph node metastases. The study suggests that the combination of tumor diameter and mETE can be used as a reliable prognostic factor for persistence and could be easily applied in clinical practice to manage PTC patients with low-to-intermediate risk of recurrent/persistent disease.
Assuntos
Câncer Papilífero da Tireoide/patologia , Neoplasias da Glândula Tireoide/patologia , Adulto , Feminino , Humanos , Radioisótopos do Iodo , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Estudos Prospectivos , Câncer Papilífero da Tireoide/terapia , Neoplasias da Glândula Tireoide/terapia , TireoidectomiaRESUMO
Background: One of the most widely used risk stratification systems for estimating individual patients' risk of persistent or recurrent differentiated thyroid cancer (DTC) is the American Thyroid Association (ATA) guidelines. The 2015 ATA version, which has increased the number of patients considered at low or intermediate risk, has been validated in several retrospective, single-center studies. The aims of this study were to evaluate the real-world performance of the 2015 ATA risk stratification system in predicting the response to treatment 12 months after the initial treatment and to determine the extent to which this performance is affected by the treatment center in which it is used. Methods: A prospective cohort of DTC patients collected by the Italian Thyroid Cancer Observatory web-based database was analyzed. We reviewed all records present in the database and selected consecutive cases that satisfied inclusion criteria: (i) histological diagnosis of DTC, with the exclusion of noninvasive follicular thyroid neoplasm with papillary-like nuclear features; (ii) complete data of the initial treatment and pathological features; and (iii) results of 1-year follow-up visit (6-18 months after the initial treatment), including all data needed to classify the estimated response to treatment. Results: The final cohort was composed of 2071 patients from 40 centers. The ATA risk of persistent/recurrent disease was classified as low in 1109 patients (53.6%), intermediate in 796 (38.4%), and high in 166 (8.0%). Structural incomplete responses were documented in only 86 (4.2%) patients: 1.5% in the low-risk, 5.7% in the intermediate-risk, and 14.5% in the high-risk group. The baseline ATA risk class proved to be a significant predictor of structural persistent disease, both for intermediate-risk (odds ratio [OR] 4.67; 95% confidence interval [CI] 2.59-8.43) and high-risk groups (OR 16.48; CI 7.87-34.5). Individual center did not significantly influence the prediction of the 1-year disease status. Conclusions: The ATA risk stratification system is a reliable predictor of short-term outcomes in patients with DTC in real-world clinical settings characterized by center heterogeneity in terms of size, location, level of care, local management strategies, and resource availability.